Thesis
Download a PDF version of this document. Download a Zip archive of the HTML and PDF versions of this document.
1 Introduction
\def\ifmathjax#1{#1}\def\iflatex#1{}
1.1 Warm-up
[What does a compiler do]Todo
[This thesis aims to build a framework which helps write compilers.]Todo
[Our focus is on compilers, not virtual machines or other run-time systems. We are also not concerned with parsers — there are lots of existing approaches and libraries which help writing parsers, although parsing in general is not yet a solved problem on all accounts.]Todo
[IR = the macroscopic structure of the program (i.e. the meta-model (explain what it is)) + the code of functions and/or methods (statements and expressions, basic blocks of statements, or bytecode instructions)]Todo
1.2 The challenges of writing compilers
Brain surgery? — It’s not exactly rocket science, is it?
That Mitchell & Webb Look, Series 3 — BBC Two
Compilers are an essential part of today’s software systems. Compilers translate high-level languages with complex semantics into lower-level languages. A compiler will parse the program, transform it in various ways, perform some more or less advanced static checks, and optimise the input program before producing an output in the desired target language. A compiler must be correct, reusable and fast. It must be correct because programmers are concerned with logical errors in their own code, and should not fear that the compiler introduces erroneous behaviour on its own. It must be also well-architectured: extensible, because the language is likely to evolve over time, modular in the hope that some components can be improved independently of the rest of the compiler (e.g. replacing or improving an optimisation phase, or changing the compiler’s front-end, to support another input language), and more generally reusable, so that parts can be repurposed to build other compilers, or other tools (analysers, IDEs, code instrumentation and so on). Finally, a fast compiler is desirable because the programmer’s edit-build-test cycle should be as frequent as possible[ (Sandberg 1988)]Todo.
Given their broad role, the complexity of the transformations involved, and the stringent requirements, writing compilers is a difficult task. Multiple pitfalls await the compiler engineer, which we will discuss in more detail below. This thesis aims to improve the compiler-writing toolkit currently available, in order to help compiler developers produce compilers which are closer to correctness, and easier to maintain.
The overall structure of a compiler will usually include a lexer and parser, which turn the program’s source into an in-memory representation. This initial representation will often be translated into an intermediate representation (IR) better suited to the subsequent steps. At some early point, the program will be analysed for syntactical or semantic inconsistencies (ranging from missing parentheses to duplicate definitions of the same variable), and may also perform a more thorough static analysis. Finally, code in the target language or for the target architecture is generated. The translation can additionally include optimisation phases in several spots: during code generation, using locally-recognisable patterns, or for example earlier, using the results of the program-wide analysis performed separately.
We identify three pitfalls which await the compiler-writer:
It is easy to reuse excessively a single intermediate representation type, instead of properly distinguishing the specifics of the input and output type of each pass;
There is a high risk associated with the definition of large, monolithic passes, which are hard to test, debug, and extend;
The fundamental structure of the program being compiled is often a graph, but compilers often work on an Abstract Syntax Tree, which requires explicit handling of the backward arcs; This is a source of bugs which could be avoided by using a better abstraction.
The first two issues are prone to manifestations of some form or another of the “god object” anti-pattern11 The “god object” anti-pattern describes object-oriented classes which do too much or know too much. The size of these classes tends to grow out of control, and there is usually a tight coupling between the methods of the object, which in turn means that performing small changes may require understanding the interactions between random parts of a very large file, in order to avoid breaking existing functionality.. The last issue is merely caused by the choice of an abstraction which does not accurately represent the domain. We will discuss each of these ailments in more detail in the following sections, and detail the undesirable symptoms associated with them.
1.2.1 Large monolithic passes
Large, monolithic passes, which perform many transformations simultaneously have the advantage of possibly being faster than several smaller passes chained one after another. Furthermore, as one begins writing a compiler, it is tempting to incrementally extend an initial pass to perform more work, rather than starting all over again with a new intermediate representation, and a new scaffolding to support its traversal.
However, the drawback is that large compiler passes are harder to test (as there are many more combinations of paths through the compiler’s code to test), harder to debug (as many unrelated concerns interact to some extent with each other), and harder to extend (for example, adding a new special form to the language will necessitate changes to several transformations, but if these are mingled in a single pass, the changes may be scattered through it, and interact with a significant amount of surrounding code). This higher maintenance cost also comes with another drawback: formal verification of the compiler will clearly be more difficult when large, tangled chunks of code which handle different semantic aspects are involved.
[Talk a bit about compcert here (one of the few/ the only formally verified compilers).]Todo
1.2.2 Overusing a single intermediate representation
The static analysis, optimisation and code generation phases could in principle work on the same intermediate representation. However, several issues arise from this situation.
In principle, new information gained by the static analysis may be added to the existing representation via mutation, or the optimiser could directly alter the IR. This means that the IR will initially contain holes (e.g. represented by null values), which will get filled in gradually. Manipulating these parts is then risky, as it is easy to accidentally attempt to retrieve a value before it was actually computed. Using the same IR throughout the compiler also makes it difficult for later passes to assume that some constructions have been eliminated by previous simplification passes, and correctness relies on a fixed order of execution of the passes; parts of the code which access data introduced or modified by other passes are more brittle and may be disrupted when code is refactored (for example, when moving the computation of some information to a later pass).
This situation becomes worse during the maintenance phase of the compiler’s lifecycle: when considering the data manipulated by a small portion of code (in order to fix or improve said code), it is unclear which parts are supposed to be filled in at that point, as well as which parts have been eliminated by prior simplification passes.
Furthermore, a mutable IR hinders parallel execution of compiler passes. Indeed, some compiler passes will perform global transformations or analyses, and such code may be intrinsically difficult to parallelise. Many other passes however are mere local transformations, and can readily be executed on distinct parts of the abstract syntax tree, as long as there is no need to synchronise concurrent accesses or modifications.
Using immutable intermediate representations (and performing shallow copies when updating data) can help with this second issue. However, there is more to gain if, instead of having many instances of the same type, each intermediate representation is given a distinct, precise type. The presence or absence of computed information can be known from the input and output type of a pass, instead of relying on the order of execution of the passes in order to know what the input data structure may contain.
1.2.3 Graphs
Nontrivial programs are inherently graphs: they may contain mutually recursive functions (which directly refer to each other, and therefore will form a cycle in a representation of the program), circular and (possibly mutually) recursive datatypes may syntactically contain (possibly indirect) references to themselves, and the control flow graph of a function or method can, as its name implies, contain instructions which perform conditional or unconditional backwards branches.
However, nearly every compiler textbook will mention the use of Abstract Syntax Trees (ASTs) to represent the program. This means that a structure, which intrinsically has the shape of a graph, is encoded as a tree.
Edges in the graph which may embody backward references can be made explicit in various ways:
By using a form of unique identifier like a name bearing some semantic value (e.g. the fully qualified name of the type or function that is referred to), an index into an array of nodes (e.g. the offset of an instruction in a function’s bytecode may be used to refer to it in the control flow graph), an automatically-generated unique identifier.
Manipulation of these identifiers introduces a potential for some sorts of bugs: name clashes can occur if the chosen qualification is not sufficient to always distinguish nodes. Thus compiler passes which duplicate nodes (for example specialising functions) or merge them must be careful to correctly update identifiers.
Anecdotally, we can mention a bug in the "Mono.Cecil" library (which allows easy manipulation of .NET bytecode). When “resolving” a reference to a primitive type, it can happen in some cases that "Mono.Cecil" returns a Type metadata object which references a type with the correct name, but [exported]Todo from the wrong DLL library.
Alternatively, backward references may be encoded as a form of path from the referring node. De Bruijn indices can be used in such an encoding, for example.
Once again, manipulating these references is risky, and De Bruijn indices are particularly brittle, for example when adding a wrapper around a node (i.e. adding an intermediate node on the path from the root), the De Bruijn indices used in some of the descendents of that node (but not all) must be updated. It is understandably easy to incorrectly implement updates of these indices, and a single off-by-one error can throw the graph’s representation into an inconsistent state.
The program’s representation could also contain actual pointers (thereby really representing the program as an “Abstract Syntax Graph”), using mutation to patch nodes after they are initially created.
In order to prevent undesired mutation of the graph after it is created, it is possible to “freeze” the objects contained within[references]Todo. This intuitively gives guarantees similar to those obtained from a purely immutable representation. However, the use of mutation could obstruct formal verification efforts, as some invariants will need to take into account the two phases of an object’s lifetime (during the creation of the containing graph, and after freezing it). More generally speaking, it is necessary to ensure that no mutation of objects happens during the graph construction, with the exception of the mutations required to patch cycles.
The compiler could also manipulate lazy data structures, where the actual value of a node in the graph is computed on the fly when that node is accessed.
Lazy programs are however harder to debug (Morris 1982; Nilsson and Fritzson 1993; Wadler 1998), as the computation of the various parts of the data manipulated does not occur in an intuitive order. Among other things, accidental infinite recursion could be triggered by totally unrelated code which merely reads a value.
Finally, Higher-Order Abstract Syntax, or HOAS for short, is a technique which encodes variable bindings as anonymous functions in the host language (whose parameters reify bindings at the level of the host language). Variable references are then nothing more than actual uses of the variable at the host language’s level. Substitution, a common operation in compilers and interpreters, then becomes a simple matter of calling the anonymous function with the desired substitute. HOAS has the additional advantage that it enforces well-scopedness, as it is impossible to refer to a variable outside of its scope in the host language.
Parametric HOAS, dubbed PHOAS, also allows encoding the type of the variables in the representation. [Can extra information other than the type be stored?]Todo
There are a few drawbacks with HOAS and PHOAS:
The “target” of a backward reference must be above all uses in the tree (i.e. a node may be the target of either backward references, forward references, but not a mix of both). This might not always be the case. For example, pre/post-conditions could, in an early pass in the compiler, be located outside of the normal scope of a function’s signature, but still refer to the function’s parameters. If the pre/post-condition language allows breaking encapsulation, these could even refer to some temporary variables declared inside the function.
[PHOAS naturally lends itself to the implementation of substitutions, and therefore is well-suited to the writing of interpreters. However, the representation cannot be readily traversed and accessed as would be done with normal structures, and therefore the model could be counterintuitive.]Todo
[It seems difficult to encode an arbitrary number of variables bound in a single construct (e.g. to represent bound type names across the whole program, or an arbitrary number of mutually-recursive functions declared via let … and … in …, with any number of and clauses in the compiled language.]Todo
Although some of these seem like viable solutions (e.g. explicitly freezing objects), they still involve low-level mechanisms to create the graph. When functionally replacing a node with a new version of itself, all references to it must be updated to point to the new version. Furthermore, code traversing the graph to gather information or perform updates needs to deal with cycles, in order to avoid running into an infinite loop (or infinite recursion). Finally, backward edges are not represented in the same way as forward edges, introducing an arbitrary distinction when fetching data from the neighbours of a node. This last aspect reduces the extensibility of code which manipulates graphs where access to fields is not done uniformly: supposing new features of the language to be compiled require turning a “seamless” edge into one which must be explicitly resolved in some way (e.g. because this node, in the updated IR, may now be part of cycles), this change of interface will likely break old code which relied on seamless access to that field.
We think that the compiler engineer could benefit from abstracting over these implementation details, and think of compiler passes in terms of graph transformations. Programmers using functional languages often write list transformations using functions like map, foldl, filter, zip and so on, instead of explicitly writing recursive functions.
The graph can be manipulated by updating some or all nodes of a given type, using an old-node to new-node transformation function. This transformation function could produce more than one node, by referencing the extra nodes from the replacement one. It should furthermore be possible to locally navigate through the graph, from its root and from the node currently being transformed. This interface would allow to seamlessly handle cycles — transformations which apply over a whole collection of nodes need not be concerned with cycles — and still allow local navigation, without distinguishing backward and forward edges.
1.2.4 Expressing the data dependencies of a pass via row types
It is easy enough to test a compiler by feeding it sample programs and checking that the compiled output behaves as expected. However, a specific set of conditions inside a given pass, in order to achieve reasonably complete code coverage, may prove to be a harder task: previous passes may modify the input program in unexpected ways, and obtaining a certain data configuration at some point in the compiler requires the developer to mentally execute the compiler’s code backwards from that point, in order to determine the initial conditions which will produce the desired configuration many steps later. This means that extensively testing corner cases which may occur in principle, but are the result of unlikely combinations of features in the input program, is a cumbersome task.
If the compiler consists of many small passes, whose inputs and outputs are serializable, then it becomes possible to thoroughly test a single pass in isolation, by supplying an artificial, crafted input, and checking some properties of its output.
However, a compiler written following the Nanopass philosophy will feature many small passes which read and update only a small part of their input. Specifying actual values for the irrelevant parts of the data not only makes the test cases more verbose than they need to be, but also hides out of plain sight which variations of the input matter (and may thus allow the detection of new errors), and which parts of the input are mere placeholders whose actual value will not influence the outcome of the pass, aside from being copied over without changes.
It is desirable to express, in a statically verifiable way, which parts of the input are relevant, and which parts are copied verbatim (modulo updated sub-elements). Furthermore, it would be useful to be able to only specify the relevant parts of tests, and omit the rest (instead of supplying dummy values).
Row polymorphism allows us to satisfy both expectations. The irrelevant fields of a record and the irrelevant cases of a variant type can be abstracted away under a single row type variable. “Row” operations on records allow reading and updating relevant fields, while keeping the part abstracted by the row type variable intact. When invoking a compiler pass, the row type variables may be instantiated to the full set of extra fields present in the real IR, when the pass is called as part of the actual compilation; it is also possible, when the pass is called during testing, to instantiate them to an empty set of fields (or to use a single field containing a unique identifier, used to track “object identity”).
1.2.5 Verification
[Needs a transition from the previous section, or this should be moved elsewhere.]Todo
The implementation presented in this thesis cannot be immediately extended to support end-to-end formal verification of the compiler being written. However, it contributes to pave the way for writing formally verified compilers: firstly, the smaller passes are easier to verify. Secondly, the use of intermediate representations which closely match the input and output data can be used, given a formal semantic of each IR, to assert that a transformation pass is systematically preserving the semantics of the input. Thirdly, the use of a typed language instead of the currently “untyped” Nanopass framework means that a lot of properties can be ensured by relying on the type system. Typed Racket’s type checker is not formally verified itself and would have to be trusted (alternatively, the adventurous researcher could attempt to derive well-typedness proofs automatically by hijacking the type checker to generate traces of the steps involved, or manually, only using the type checker as a helper tool to detect and weed out issues during development. Fourthly, the explicit specification of the dependencies of passes on their input via row types is a form of frame specification, and can significantly ease the verification effort, as the engineer can rely on the fact that irrelevant parts were not modified in the output. These are statically enforced by our extension of Typed Racket’s type system, which we formalise in chapter [??]Todo. This relies on trusting Typed Racket itself once again, and on the correctness of the implementation of our translation from the extended type system to Typed Racket’s core type system. Fifthly, we provide means to express graph transformations as such instead of working with an encoding of graphs as abstract syntax trees (or directed acyclic graphs), with explicit backward references. We are hopeful that eliminating this mismatch will be beneficial to the formal verification of the transformation passes.
These advantages would be directly available to engineers attempting a formal proof of their compiler, while trusting the correctness of Typed Racket, as well as that of our framework. The implementation of our framework is not hardened, and Typed Racket itself suffers from a small number of known sources of unsoundness22 See https://github.com/racket/typed-racket/issues., however. In order to do an end-to-end verification of a compiler, it would be necessary to port our approach to a language better suited to formal verification. Alternatively, Racket could in principle be extended to help formalisation efforts. Two approaches come to mind: the first consists in proving that the macros correctly implement the abstraction which they attempt to model; the second would be to have the macros inject annotations and hints indicating properties that must be proven, in the same way that type annotations are currently inserted. These hints could then be used by the prover to generate proof obligations, which could then be solved manually or automatically.
Mixing macros and language features which help obtaining static guarantees is a trending topic in the Racket ecosystem and in other communities.
Typed Racket is still in active development, and several other projects were presented recently.
Hackett33 https://github.com/lexi-lambda/hackett, mainly developed by King, is a recent effort to bring a Haskell 98-like type system to Racket.
Hackett is built on the techniques developed by Chang, Knauth and Greenman in (Chang et al. 2017), which lead to the Turnstile library44 https://bitbucket.org/stchang/macrotypes.git. Turnstile is a helper library which facilitates the creation of typed languages in Racket. Macros are amended with typing rules, which are used to thread type information through uses of macros, definition forms and other special forms. Type checking is therefore performed during macro-expansion, and does not rely on an external type checker which would work on the expanded code. As a result, new languages built with Racket and Turnstile are not limited to a pre-existing type system, but can rather devise their own from the ground up. This approach brings a lot of flexibility, the drawback being that more trust is put in the language designer.
The work presented in this thesis aims to follow a different path than that followed by Turnstile: we chose to implement our extended type system as an abstraction over the existing type system of Typed Racket. This means that we do not rely so much on the correctness of our typing rules: instead of verifying ourselves the well-typedness of compilers written using our framework, we inject type annotations in the expanded code, which are then verified by Typed Racket. Therefore, we are confident that type errors will be caught by Typed Racket, safe in the knowledge that the code obtained after macro-expansion is type-safe55 We actually use on a few occasions unsafe-cast. Such a blunt instrument is however only used in cases where Typed Racket already has the required type information, but the type checker fails to deduce the equivalence between two formulations of the same type, or does not acknowledge that the term being checked has the expected type. These issues are being worked on by the developers of Typed Racket, however, and we hope to be able to remove these uses of unsafe-cast in later versions.. This increased serenity comes at the cost of flexibility, as Typed Racket’s type system was not able to express the type constructs that we wanted to add. We therefore had to resort to a few hacks to translate our types into constructs that could be expressed using Typed Racket.
The approach of building more complex type constructs atop a small trusted kernel has been pursued by Cur66 https://github.com/wilbowma/cur, developed by Bowman. Cur is a dependently typed language which permits theorem proving and verified programming. It is based on a small kernel (the Curnel), which does not contain language features which can be expressed by macro transformations. Most notably, the prover’s tactics are defined using metaprogramming tools, and are not part of the core language.
Another tool worth mentioning is Rosette77 https://github.com/emina/rosette[reference]Todo. Rosette, mainly developed by Torlak, is a solver-aided language: it tightly integrates an SMT solver with some Racket constructs, so that powerful queries can be performed and answered, for example “what input values to the function f will generate outputs which satisfy the predicate p?”. It can also generate simple functions which satisfy given conditions. These features allow it to be used both as a helper tool during development, for engineers coming from various domains, and as a verifier, as the solver can be used to assert that a function will never give an erroneous output, given a set of constraints on its input.
The idea of expressing the type of macro transformations at some level (e.g. by indicating the type of the resulting expression in terms of the type of the input terms, as is allowed by Turnstile) is not new: in 2001, Ganz, Sabry and Taha already presented in 2001 MacroML (Ganz et al. 2001), an experimental language which allows type checking programs before macro-expansion. However, it seems that there is interest, both in the Racket community and elsewhere, for languages with powerful metaprogramming facilities, coupled with an expressive type system. We are hopeful that this will lead to innovations concerning the formal verification of programs making heavy use of complex macros.
1.3 Summary
Once upon a time…
Unknown
1.3.1 Extensible type system
We implemented a different type system on top of Typed Racket, using macros. Macros have been used not only to extend a language’s syntax (control structures, contract annotations, and so on), but also to reduce the amount of “boilerplate” code and obtain clearer syntax for common or occasional tasks. Macros have further been used to extend the language with new paradigms, like adding an object system (CLOS (Bobrow et al. 1988)[is it really implemented using macros?]Todo, Racket classes (Flatt et al. 2006)) or supporting logic programming (Racket Datalog and Racklog, Rosette (Torlak and Bodik 2013, 2014)). In the past, Typed Racket (Tobin-Hochstadt and Felleisen 2008) has proved that a type system can be successfully fitted onto an existing “untyped” language, using macros. We implemented the type-expander library atop Typed Racket, without altering Typed Racket itself. Our type-expander library allows macros to be used in contexts where a type is expected.
This shows that an existing type system can be made extensible using macros, without altering the core implementation of the type system. We further use these type expanders to build new kinds of types which were not initially supported by Typed Racket: non-nominal algebraic types, with row polymorphism. Typed Racket has successfully been extended in the past (for example by adding type primitives for Racket’s class system, which incidentally also support row polymorphism), but these extensions required modifications to the trusted core of Typed Racket. Aside from a small hack (needed to obtain non-nominal algebraic types which remain equivalent across multiple files), our extension integrates seamlessly with other built-in types, and code using these types can use idiomatic Typed Racket features.
Typed Racket was not initially designed with this extension in mind, nor, that we know of, was it designed with the goal of being extensible. We therefore argue that a better choice of primitive types supported by the core implementation could enable many extensions without the need to resort to hacks the like of which was needed in our case. In other words, a better design of the core types with extensibility in mind would have made our job easier.
In particular, Types in Typed Clojure (Bonnaire-Sergeant et al. 2016) support fine-grained typing of heterogeneous hash tables, this would likely allow us to build much more easily the “strong duck typing” primitives on which our algebraic data types are based, and without the need to resort to hacks.
In languages making heavy uses of macros, it would be beneficial to design type systems with a well-chosen set of primitive types, on top of which more complex types can be built using macros.
Building the type system via macros atop a small kernel is an approach that has been pursued by Cur, a dependently-typed language developed with Racket, in which the tactics language is entirely built using macros, and does not depend on Cur’s trusted type-checking core.
1.3.2 Compiler-writing framework
Supports writing a compiler using many small passes (in the spirit of Nanopass)
Allows writing the compiler in a strongly-typed language
Uses immutable data structures for the Intermediate Representations (ASTs)
Supports backwards branches in the AST, making it rather an Abstract Syntax Graph (this is challenging due to the use of immutable data structures).
Provides easy manipulation of the Intermediate Representations: local navigation from node to node, global higher-order operations over many nodes, easy construction, easy serialization, with the guarantee that at no point an incomplete representation can be manipulated. These operations should handle seamlessly backwards arcs.
Enforces structural invariants (either at compile-time or at run-time), and ensures via the type system that unchecked values cannot be used where a value respecting the invariant is expected.
Has extensive support for testing the compiler, by allowing the generation of random ASTs [(note that the user guides the random generation, it’s not fully automatic like quickcheck)]Todo, making it possible to read and write ASTs from files and compare them, and allows compiler passes to consume ASTs containing only the relevant fields (using row polymorphism).
1.3.3 Overview
The rest of this document is structured as follows:
Chapter 2 presents related work. It discusses approaches to extensible type systems in Section 2.1. Section 2.2 considers how structures, variants and polymorphism may exhibit different properties in different languages, and makes the case for bounded row polymorphism as a well-suited tool for building compilers. The Nanopass Compiler Framework is presented in Section 2.3, and techniques used to encode and manipulate graphs are studied in Section 2.4.
In Chapter 3, we give some example uses of the compiler framework described in this thesis. We indicate the pitfalls, bugs and boilerplate avoided thanks to its use. [Move this above the related work?]Todo
Chapter 4 gives an overview of the features of Typed Racket’s type system. It then formalises the features relevant to our work, by giving the subtyping rules, as well as the builder and accessor primitives for values of these types.
Section 5.1 explains how we implemented support for type-level macros as an extension to Typed Racket, without modifying the core implementation. We detail the reduction rules which allow translating a type making use of expanders into a plain type.
Section 5.2 discusses the flavour of algebraic datatypes which we chose to implement atop Typed Racket, as well as its extension with row types. We explain in detail our goals and constraints, and present two implementations — a first attempt, which unfortunately required verbose annotations for some “row operations”, and a second solution, which greatly reduces the number of annotations required, by using a different implementation strategy. We give formal semantics for these extensions, give the reduction rules which translate them back to Typed Racket’s type system, and show that this reduction preserves the original semantics. We then finally define a layer of syntactic sugar which allows convenient use of these new type primitives.
Section 6.1 builds upon these algebraic datatypes and row types to define a typed version of the Nanopass Compiler Framework which operates on abstract syntax trees. We explain the notions of tree nodes, mapping functions and cata-morphisms, show how these interact with row typing, and give an overview of the implementation. We then extend this with detached mappings: this feature allows the user to map a function over all nodes of a given type in a graph, regardless of the structure of the graph outisde of the relevant parts which are manipulated by the mapping function. This allows test data to not only omit irrelevant branches in the abstract syntax tree by omitting the field pointing to these branches, but also irrelevant parts located above the interesting nodes. In other words, this feature allows cutting and removing the top of the abstract syntax tree, and glue together the resulting forest. We also formally describe the result of applying a set of detached or regular mapping functions to an input tree.
Section 6.2 extends this typed version of Nanopass to handle directed acyclic graphs. We start by considering concerns such as equality of nodes (for which the previous chapter assumed a predicate existed, without actually implementing it), and hash consing. This allows us to prepare the ground for the extension presented in the next chapter, namely graphs.
We can then introduce support for cycles in Section 6.3: instead of describing abstract syntax tree transformations, it then becomes possible to describe graphs transformations. To this end, we introduce new kinds of nodes: placeholders, which are used during the construction of the graph, with-indices nodes, which encode references to neighbouring nodes as indices into arrays containing all nodes of a given type, for the current graph, and with-promises nodes, which hide away this implementation detail by lazily resolving all references, using a uniform API. This allows all fields of a given node to be accessed in the same way, while still allowing logical cycles built atop a purely immutable representation.
We give an overview of how our implementation handles cycles, using worklists which gather and return placeholders when the mapping functions perform recursive calls, and subsequently turn the results into into with-indices nodes, then into with-promises ones. We then update our notion of equality and hash consing to account for cycles, and update the formal semantics too.
Extra safety can be obtained by introducing some structural invariants which constrain the shape of the graph. For example, it is possible to ensure the well-scopedness of variable references. Another desirable property would be the absence of cycles, either to better model the IR, knowing that cyclic references are not allowed at some point by the target language, or to detect early on changes in the IR which may break code assuming the absence of cycles. A third option would be to ensure that “parent” pointers are correct, and that the node containing them is indeed referenced by the parent (i.e., ensure the presence of well-formed cycles). Section 6.4 presents an extension of our graph manipulation framework, which allows the specification of structural invariants. These can in some cases be checked statically, in other cases it may be necessary to follow a macro-enforced discipline, and as a last resort, a dynamic check may be performed.
We further explain how we use phantom types to reflect at the type level which invariants were checked on an instance of a graph. The types used to represent that an instance satisfies an invariant are chosen so that instances with stronger invariants are subtypes of instances with weaker invariants.
Finally, in Section 6.5 we succinctly present some extensions which could be added to the framework presented in the previous chapters. We discuss how it would be possible to garbage-collect unused parts of the graph when only a reference to an internal node is kept, and the root is logically unreachable. Another useful feature would be the ability to define general-purpose graph algorithms (depth-first traversal, topological sort, graph colouring, and so on), operating on a subset of the graph’s fields. This would allow to perform these common operations while considering only a subgraph of the one being manipulated. Finally, we mention the possibility to implement an α-equivalence comparison operator.
In Chapter 7, we present more examples and revisit the initial examples illustrating our goals in the light of the previous chapters.
We ported the most complete compiler written with Nanopass (a Scheme compiler) to our framework; we summarise our experience and compare our approach with Nanopass, by indicating the changes required, in particular how many additional type annotations were necessary.
Finally, we conclude and list future work directions.
2 State of the art
\def\ifmathjax#1{#1}\def\iflatex#1{}
2.1 Extending the type system via macros
Our work explores one interesting use of macros: their use to extend a programming language’s type system.
Chang, Knauth and Greenman (Chang et al. 2017) took the decision to depart from Typed Racket, and implemented a new approach, which allows type systems to be implemented as macros. Typing information about identifiers is threaded across the program at compile-time, and macros can decide whether a term is well-typed or not.
Another related project is Cur88 https://github.com/wilbowma/cur, a dependent type system implemented using Racket macros.
Bracha suggests that pluggable type systems should be used (Bracha 2004). Such a system, JavaCOP is presented by Andreae, Noble, Markstrum and Shane (Andreae et al. 2006) as a tool which “enforces user-defined typing constraints written in a declarative and expressive rule language”.
In contrast, Typed Racket (Tobin-Hochstadt and Felleisen 2008) was implemented using macros on top of “untyped” Racket, but was not designed as an extensible system: new rules in the type system must be added to the core implementation, a system which is complex to approach.
Following work by Asumu Takikawa99 https://github.com/racket/racket/pull/604, we extended Typed Racket with support for macros in the type declarations and annotations. We call this sort of macro “type expanders”, following the commonly-used naming convention (e.g. “match expanders” are macros which operate within patterns in pattern-matching forms). These type expanders allow users to easily extend the type system with new kinds of types, as long as these can be translated back to the types offered natively by Typed Racket.
While the Type Systems as Macros by Chang, Knauth and Greenman (Chang et al. 2017) allows greater flexibility by not relying on a fixed set of core types, it also places on the programmer the burden of ensuring that the type checking macros are sound. In contrast, our type expanders rely on Typed Racket’s type checker, which will still catch type errors in the fully-expanded types. In other words, writing type expanders is safe, because they do not require any specific trust, and translate down to plain Typed Racket types, where any type error would be caught at that level.
An interesting aspect of our work is that the support for type expanders was implemented without any change to the core of Typed Racket. Instead, the support for type expanders itself is available as a library, which overrides special forms like define, lambda or cast, enhancing them by pre-processing type expanders, and translating back to the “official” forms from Typed Racket. It is worth noting that Typed Racket itself is implemented in a similar way: special forms like define and lambda support plain type annotations, and translate back to the “official” forms from so-called “untyped” Racket. In both cases, this approach goes with the Racket spirit of implementing languages as libraries (Tobin-Hochstadt et al. 2011)
2.2 Algebraic datatypes for compilers
I used polymorphic variants to solve the assert false problems in my lexical analyser, along with some subtyping. You have to write the typecasts explicitly, but that aside it is very powerful (a constructor can “belong” to several types etc.).
Personal communication from a friend
The phc-adt library implements algebraic datatypes (variants and structures) which are adapted to compiler-writing.
“datatype” uses nominal typing, while we use structural typing (i.e. two identical type declarations in distinct modules yield the same type in our case). This avoids the need to centralize the type definition of ASTs.
“datatype” uses closed variants, where a constructor can only belong to one variant. We simply define variants as a union of constructors, where a constructor can belong to several variants. This allows later passes in the compiler to add or remove cases of variants, without the need to duplicate all the constructors under slightly different names.
“datatype” does not support row polymorphism, or similarly the update and extension of its product types with values for existing and new fields, regardless of optional fields. We implement the latter.
[Cite the variations on variants paper (for Haskell)]Todo
2.2.1 The case for bounded row polymorphism
[Explain the “expression problem”.]Todo [Talk about the various ways in which it can be “solved”, and which tradeoffs we aim for. Mention extensible-functions, another solution to the expression problem in Racket, but with rather different tradeoffs.]Todo
We strive to implement compilers using many passes. A pass should be able to accept a real-world AST, and produce accordingly a real-world transformed AST as its output. It should also be possible to use lightweight “mock” ASTs, containing only the values relevant to the passes under test (possibly only the pass being called, or multiple passes tested from end to end). The pass should then return a corresponding simplified output AST, omitting the fields which are irrelevant to this pass (and were not part of the input). Since the results of the pass on a real input and on a test input are equivalent modulo the irrelevant fields, this allows testing the pass in isolation with simple data, without having to fill in each irrelevant field with null (and hope that they are indeed irrelevant, without a comforting guarantee that this is the case), as is commonly done when creating “mocks” to test object-oriented programs.
This can be identified as a problem similar to the “expression problem”. In our context, we do not strive to build a compiler which can be extended in an “open world”, by adding new node types to the AST, or new operations handling these nodes. Rather, the desired outcome is to allow passes to support multiple known subsets of the same AST type, from the start.
We succinctly list below some of the different sorts of polymorphism, and argue that row polymorphism is well-suited for our purpose. More specifically, bounded row polymorphism gives the flexibility needed when defining passes which keep some fields intact (without reading their value), but the boundedness ensures that changes in the input type of a pass do not go unnoticed, so that the pass can be amended to handle the new information, or its type can be updated to ignore this new information.
Subtyping (also called inclusion polymorphism, subtype polymorphism, or nominal subtyping ?): Subclasses and interfaces in C# and Java, sub-structs and union types in Typed Racket, polymorphic variants in CAML (Minsky et al. 2013b), chap. 6, sec. Polymorphic Variants
Multiple inheritance. NIT, CLOS, C++, C# interfaces, Java interfaces. As an extension in “untyped” Racket with Alexis King’s safe multimethods1010 https://lexi-lambda.github.io/blog/2016/02/18/simple-safe-multimethods-in-racket/.
This in principle could help in our case: AST nodes would have .withField(value) methods returning a copy of the node with the field’s value updated, or a node of a different type with that new field, if it is not present in the initial node type. This would however require the declaration of many such methods in advance, so that they can be used when needed (or with a recording mechanism like the one we use, so that between compilations the methods called are remembered and generated on the fly by a macro). Furthermore, Typed Racket lacks support for multiple inheritance on structs. It supports multiple inheritance for classes [?]Todo, but classes currently lack the ability to declare immutable fields, which in turn causes problems with occurrence typing (see the note in the “row polymorphism” point below).
Parametric polymorphism: Generics in C# and Java, polymorphic types in CAML and Typed Racket
F-bounded polymorphism: Java, C#, C++, Eiffel. Possible to mimic to some extent in Typed Racket with (unbounded) parametric polymorphism and intersection types. [Discuss how it would work/not work in our case.]Todo
- Operator overloading (also called overloading polymorphism?) and multiple dispatch:
Operator overloading in C#
Nothing in Java aside from the built-in cases for arithmetic and string concatenation, but those are not extensible
C++
typeclasses in Haskell? [I’m not proficient enough in Haskell to be sure or give a detailed description, I have to ask around to double-check.]Todo
LISP (CLOS): has multiple dispatch
nothing built-in in Typed Racket.
CAML?
Coercion polymorphism (automatic casts to a given type). This includes Scala’s implicits, C# implicit coercion operators (user-extensible, but at most one coercion operator is applied automatically, so if there is a coercion operator A \rightarrow{} B, and a coercion operator B \rightarrow{} C, it is still impossible to supply an A where a C is expected without manually coercing the value), and C++’s implicit conversions, where single-argument constructors are treated as implicit conversion operators, unless annotated with the explicit keyword. Similarly to C#, C++ allows only one implicit conversion, not two or more in a chain.
Struct type properties in untyped Racket can somewhat be used to that effect, although they are closer to Java’s interfaces than to coercion polymorphism. Struct type properties are unsound in Typed Racket and are not represented within the type system, so their use is subject to caution anyway.
Coercion (downcasts). Present in most typed languages. This would not help in our case, as the different AST types are incomparable (especially since Typed Racket lacks multiple inheritance)
Higher-kinded polymorphism: Type which is parameterized by a \mathit{Type} \rightarrow{} \mathit{Type} function. Scala, Haskell. Maybe CAML?
The type expander library which we developed for Typed Racket supports \Lambda{}, used to describe anonymous type-level macros. They enable some limited form of \mathit{Type} \rightarrow{} \mathit{Type} functions, but are actually applied at macro-expansion time, before typechecking is performed, which diminishes their use in some cases. For example, they cannot cooperate with type inference. Also, any recursive use of type-level macros must terminate, unless the type “function” manually falls back to using \mathit{Rec} to create a regular recursive type. This means that a declaration like F(X) := X × F(F(X)) is not possible using anonymous type-level macros only.
As an example of this use of the type expander library, our cycle-handling code uses internally a “type traversal” macro. In the type of a node, it performs a substitution on some subparts of the type. It is more or less a limited form of application of a whole family of type functions a{}_i \rightarrow{} b{}_i, which have the same inputs a{}_i \ldots{}, part of the initial type, but different outputs b{}_i \ldots{} which are substituted in place of the a{}_i \ldots{} in the resulting type. The “type traversal” macro expands the initial type into a standard polymorphic type, which accepts the desired outputs b{}_i \ldots{} as type arguments.
Lenses. Can be in a way compared to explicit coercions, where the coercion is reversible and the accessible parts can be altered.
Structural polymorphism (also sometimes called static duck-typing): Scala, TypeScript. It is also possible in Typed Racket, using the algebraic datatypes library which we implemented. Possible to mimic in Java and C# with interfaces “selecting” the desired fields, but the interface has to be explicitly implemented by the class (i.e. at the definition site, not at the use-site).
Palmer et al. present TinyBang (Palmer et al. 2014), a typed language in which flexible manipulation of objects is possible, including adding and removing fields, as well as changing the type of a field. They implement in this way a sound, decidable form of static duck typing, with functional updates which can add new fields and replace the value of existing fields. Their approach is based on two main aspects:Type-indexed records supporting asymmetric concatenation: by concatenating two records r{}_1 \& r{}_2, a new record is obtained containing all the fields from r{}_1 (associated to their value in r{}_1), as well as the fields from r{}_2 which do not appear in r{}_1 (associated to their value in r{}_2). Primitive types are eliminated by allowing the use of type names as keys in the records: integers then become simply records with a int key, for example.
Dependently-typed first-class cases: pattern-matching functions are expressed as pattern \mathbin{\texttt{->}} expression, and can be concatenated with the \& operator, to obtain functions matching against different cases, possibly with a different result type for each case. The leftmost cases can shadow existing cases (i.e. the case which is used is the leftmost case for which the pattern matches successfully).
TinyBang uses an approach which is very different from the one we followed in our Algebraic Data Types library, but contains the adequate primitives to build algebraic data types which would fulfill our requirements (aside from the ability to bound the set of extra “row” fields). We note that our flexible structs, which are used within the body of node-to-node functions in passes, do support an extension operation, which is similar to TinyBang’s \&, with the left-hand side containing a constant and fixed set of fields.
- Row polymorphism: Apparently, quoting a post on Quora1111 https://www.quora.com/Object-Oriented-Programming-What-is-a-concise-definition-of-polymorphism\label{quora-url-footnote}:
Mostly only concatenative and functional languages (like Elm and PureScript) support this.
Classes in Typed Racket can have a row type argument (but classes in Typed Racket cannot have immutable fields (yet), and therefore occurrence typing does not work on class fields. Occurrence typing is an important idiom in Typed Racket, used to achieve safe but concise pattern-matching, which is a feature frequently used when writing compilers).
Our Algebraic Data Types library implements a bounded form of row polymorphism, and a separate implementation (used within the body of node-to-node functions in passes) allows unbounded row polymorphism.
[Virtual types]Todo
So-called “untyped” or “uni-typed” languages: naturally support most of the above, but without static checks.
Operator overloading can be present in “untyped” languages, but is really an incarnation of single or multiple dispatch, based on the run-time, dynamic type (as there is no static type based on which the operation could be chosen). However it is not possible in “untyped” languages and languages compiled with type erasure to dispatch on “types” with a non-empty intersection: it is impossible to distinguish the values, and they are not annotated statically with a type.
As mentioned above, Typed Racket does not have operator overloading, and since the inferred types cannot be accessed reflectively at compile-time, it is not really possible to construct it as a compile-time feature via macros. Typed Racket also uses type erasure, so the same limitation as for untyped languages applies when implementing some form of single or multiple dispatch at run-time —
namely the intersection of the types must be empty. [Here, mention (and explain in more detail later) our compile-time “empty-intersection check” feature (does not work with polymorphic variables).]Todo
[Overview of the existing “solutions” to the expression problems, make a summary table of their tradeoffs (verbosity, weaknesses, strengths).]Todo
[Compare the various sorts of subtyping and polymorphism in that light (possibly in the same table), even those which do not directly pose as a solution to the expression problem.]Todo
“Nominal types”: our tagged structures and node types are not nominal types.
The “trivial” Racket library tracks static information about the types in simple cases. The “turnstile” Racket language [is a follow-up]Todo work, and allows to define new typed Racket languages. It tracks the types of values, as they are assigned to variables or passed as arguments to functions or macros. These libraries could be used to implement operator overloads which are based on the static type of the arguments. It could also be used to implement unbounded row polymorphism in a way that does not cause a combinatorial explosion of the size of the expanded code.[Have a look at the implementation of row polymorphism in Typed Racket classes, cite their work if there is something already published about it.]Todo
From the literate program (tagged-structure-low-level):
Row polymorphism, also known as "static duck typing" is a type system feature which allows a single type variable to be used as a place holder for several omitted fields, along with their types. The phc-adt library supports a limited form of row polymorphism: for most operations, a set of tuples of omitted field names must be specified, thereby indicating a bound on the row type variable.
This is both a limitation of our implementation (to reduce the combinatorial explosion of possible input and output types), as well as a desirable feature. Indeed, this library is intended to be used to write compilers, and a compiler pass should have precise knowledge of the intermediate representation it manipulates. It is possible that a compiler pass may operate on several similar intermediate representations (for example a full-blown representation for actual compilation and a minimal representation for testing purposes), which makes row polymorphism desirable. It is however risky to allow as an input to a compiler pass any data structure containing at least the minimum set of required fields: changes in the intermediate representation may add new fields which should, semantically, be handled by the compiler pass. A catch-all row type variable would simply ignore the extra fields, without raising an error. Thanks to the bound which specifies the possible tuples of omitted field names, changes to the the input type will raise a type error, bringing the programmer’s attention to the issue. If the new type is legit, and does not warrant a modification of the pass, the fix is easy to implement: simply adding a new tuple of possibly omitted fields to the bound (or replacing an existing tuple) will allow the pass to work with the new type. If, on the other hand, the pass needs to be modified, the type system will have successfully caught a potential issue.
2.3 Writing compilers using many small passes (a.k.a following the Nanopass Compiler Framework philosophy)
2.4 Cycles in intermediate representations of programs
[There already were a few references in my proposal for JFLA.]Todo [Look for articles about graph rewriting systems.]Todo
The following sections present the many ways in which cycles within the AST, CFG and other intermediate representations can be represented.
2.4.1 Mutable data structures
Hard to debug
When e.g. using lazy-loading, it is easy to mistakenly load a class or method after the Intermediate Representation was frozen. Furthermore, unless a .freeze() method actually enforces this conceptual change from a mutable to an immutable representation, it can be unclear at which point the IR (or parts of it) is guaranteed to be complete and its state frozen. This is another factor making maintenance of such code difficult.
We are using ML to build a compiler that does low-level optimization. To support optimizations in classic imperative style, we built a control-flow graph using mutable pointers and other mutable state in the nodes. This decision proved unfortunate: the mutable flow graph was big and complex, and it led to many bugs. We have replaced it by a smaller, simpler, applicative flow graph based on Huet’s (1997) zipper. The new flow graph is a success; this paper presents its design and shows how it leads to a gratifyingly simple implementation of the dataflow framework developed by Lerner, Grove, and Chambers (2002).
2.4.2 Unique identifiers used as a replacement for pointers
Mono uses that (Evain and others 2008; Köplinger et al. 2014), it is very easy to use an identifier which is supposed to reference a missing object, or an object from another version of the AST. It is also very easy to get things wrong when duplicating nodes (e.g. while specializing methods based on their caller), or when merging or removing nodes.
2.4.3 Explicit use of other common graph representations
Adjacency lists, De Bruijn indices.
Error prone when updating the graph (moving nodes around, adding, duplicating or removing nodes).
Needs manual
2.4.4 Using lazy programming languages
- Lazy programming is harder to debug.
Quote (Nilsson and Fritzson 1993):Traditional debugging techniques are, however, not suited for lazy functional languages since computations generally do not take place in the order one might expect.
Quote (Nilsson and Fritzson 1993):Within the field of lazy functional programming, the lack of suitable debugging tools has been apparent for quite some time. We feel that traditional debugging techniques (e.g. breakpoints, tracing, variable watching etc.) are not particularly well suited for the class of lazy languages since computations in a program generally do not take place in the order one might expect from reading the source code.
To be usable, a language system must be accompanied by a debugger and a profiler. Just as with interlanguage working, designing such tools is straightforward for strict languages, but trickier for lazy languages.
Constructing debuggers and profilers for lazy languages is recognized as difficult. Fortunately, there have been great strides in profiler research, and most implementations of Haskell are now accompanied by usable time and space profiling tools. But the slow rate of progress on debuggers for lazy languages makes us researchers look, well, lazy.
How does one debug a program with a surprising evaluation order? Our attempts to debug programs submitted to the lazy implementation have been quite entertaining. The only thing in our experience to resemble it was debugging a multi-programming system, but in this case virtually every parameter to a procedure represents a new process. It was difficult to predict when something was going to happen; the best strategy seems to be to print out well-defined intermediate results, clearly labelled.
So-called “infinite” data structures constructed lazily have problems with equality and serialization. The latter is especially important for serializing and de-serializing Intermediate Representations for the purpose of testing, and is also very important for code generation: the backend effectively needs to turn the infinite data structure into a finite one. The Revised$^6$ Report on Scheme requires the "equal?" predicate to correctly handle cyclic data structures, but efficient algorithms implementing this requirement are nontrivial (Adams and Dybvig 2008). Although any representation of cyclic data structures will have at some point to deal with equality and serialization, it is best if these concerns are abstracted away as much as possible.
2.4.5 True graph representations using immutable data structures
The huet zipper (Huet 1997). Implementation in untyped Racket, but not Typed Racket1212 See zippers, and https://github.com/david-christiansen/racket-zippers
3 Goals, constraints and examples
\def\ifmathjax#1{#1}\def\iflatex#1{}
4 Typed Racket
\def\ifmathjax#1{#1}\def\iflatex#1{}
We start this section with some history: Lisp, the language with lots of parentheses, shortly following Fortran as one of the first high-level programming languages, was initially designed between 1956 and 1958, and subsequently implemented (McCarthy 1981). Dialects of Lisp generally support a variety of programming paradigms, including (but not limited to) functional programming and object-oriented programming (e.g. via CLOS, the Common Lisp Object System). One of the the most proeminent aspects of Lisp is homoiconicity, the fact that programs and data structures look the same. This enables programs to easily manipulate other programs, and led to the extensive use of macros. Uses of macros usually look like function applications, but, instead of invoking a target function at run-time, a macro will perform some computation at compile-time, and expand to some new code, which is injected as a replacement of the macro’s use.
The two main dialects of Lisp are Common Lisp and Scheme. Scheme follows a minimalist philosophy, where a small core is standardised (Abelson et al. 1998; Shinn et al. 2013; Sperber et al. 2009) and subsequently extended via macros and additional function definitions.
Racket, formerly named PLT Scheme, started as a Scheme implementation. Racket evolved, and the Racket Manifesto (Felleisen et al. 2015) presents it as a “programming-language programming language”, a language which helps with the creation of small linguistic extensions as well as entirely new languages. The Racket ecosystem features many languages covering many paradigms:
The racket/base language is a full-featured programming language which mostly encourages functional programming.
racket/class implements an object-oriented system, implemented atop racket/base using macros, and can be used along with the rest of the racket/base language.
racklog is a logic programming language in the style of prolog. The Racket ecosystem also includes an implementation of datalog.
Scribble can be seen as an alternative to LaTeX, and is used to create the Racket documentation. It also supports literate programming, by embedding chunks of code in the document which are then aggregated together. This thesis is in fact written using Scribble.
slideshow is a DSL (domain-specific language) for the creation of presentations, and can be thought as an alternative to Beamer and SliTeX.
r5rs and r6rs are implementations of the corresponding scheme standards.
Redex is a DSL which allows the specification of reduction semantics for programming languages. It features tools to explore and test the defined semantics.
Typed Racket (Tobin-Hochstadt 2010; Tobin-Hochstadt and Felleisen 2008) is a typed variant of the main racket language. It is implemented as a macro which takes over the whole body of the program. That macro fully expands all other macros in the program, and then typechecks the expanded program.
Turnstile allows the creation of new typed languages. It takes a different approach when compared to Typed Racket, and threads the type information through assignments and special forms, in order to be able to typecheck the program during expansion, instead of doing so afterwards.
In the remainder of this section, we will present the features of Typed Racket’s type system, and then present formal semantics for a subset of those, namely the part which is relevant to our work. The Typed Racket Guide and The Typed Racket Reference provide good documentation for programmers who desire to use Typed Racket; we will therefore keep our overview succinct and gloss over most details.
4.1 Overview of Typed Racket’s type system
4.1.1 Simple primitive types
Typed Racket has types matching Racket’s baggage of primitive values: Number, Boolean, Char, String, Void1313 The Void type contains only a single value, #<void>, and is equivalent to the void type in C. It is the equivalent of unit of CAML and Haskell, and is often used as the return type of functions which perform side-effects. It should not be confused with Nothing, the bottom type which is not inhabited by any value, and is similar to the type of Haskell’s undefined. Nothing can be used for example as the type of functions which never return — in that way it is similar to C’s __attribute__ ((__noreturn__)). and so on.
> (ann #true Boolean) - : Boolean
#t
> 243 - : Integer [more precisely: Positive-Byte]
243
> "Hello world" - : String
"Hello world"
> #\c - : Char
#\c
; The void function produces the void value ; Void values on their own are not printed, ; so we place it in a list to make it visible. > (list (void)) - : (Listof Void) [more precisely: (List Void)]
'(#<void>)
For numbers, Typed Racket offers a “numeric tower” of partially-overlapping types: Positive-Integer is a subtype of Integer, which is itself a subtype of Number. Zero, the type containing only the number 0, is a both a subtype of Nonnegative-Integer (numbers ≥ 0) and of Nonpositive-Integer (numbers ≤ 0).
Typed Racket also includes a singleton type for each primitive value of these types: we already mentioned Zero, which is an alias of the 0 type. Every number, character, string and boolean value can be used as a type, which is only inhabited by the same number, character, string or boolean value. For example, 243 belongs to the singleton type 243, which is a subtype of Positive-Integer.
> 0 - : Integer [more precisely: Zero]
0
> (ann 243 243) - : Integer [more precisely: 243]
243
> #t - : Boolean [more precisely: True]
#t
4.1.2 Pairs and lists
Pairs are the central data structure of most Lisp dialects. They are used to build linked lists of pairs, terminated by '(), the null element. The null element has the type Null, while the pairs which build the list have the type (Pairof A B), where A and B are replaced by the actual types for the first and second elements of the pair. For example, the pair built using (cons 729 #true), which contains 729 as its first element, and #true as its second element, has the type (Pairof Number Boolean), or using the most precise singleton types, (Pairof 729 #true).
> (cons 729 #true) - : (Pairof Positive-Index True)
'(729 . #t)
> '(729 . #true) - : (Pairof Positive-Index True)
'(729 . #t)
Heterogeneous linked lists of fixed length can be given a precise type by nesting the same number of pairs at the type level. For example, the list built with (list 81 #true 'hello) has the type (List Number Boolean Symbol), which is a shorthand for the type (Pairof Number (Pairof Boolean (Pairof Symbol Null))). Lists in Typed Racket can thus be seen as the equivalent of a chain of nested 2-tuples in languages like CAML or Haskell. The analog in object-oriented languages with support for generics would be a class Pair<A, B>, where the generic type argument B could be instantiated by another instance of Pair, and so on.
> (cons 81 (cons #true (cons 'hello null))) - : (Listof (U 'hello Positive-Byte True))
'(81 #t hello)
> (ann (list 81 #true 'hello) (Pairof Number (Pairof Boolean (Pairof Symbol Null)))) - : (Listof (U Boolean Complex Symbol)) [more precisely: (List Number Boolean Symbol)]
'(81 #t hello)
The type of variable-length homogeneous linked lists can be described using the Listof type operator. The type (Listof Integer) is equivalent to (Rec R (U (Pairof Integer R) Null)). The Rec type operator describes recursive types, and U describes unions. Both of these features are described below, for now we will simply say that the previously given type is a recursive type R, which can be a (Pairof Integer R) or Null (to terminate the linked list).
> (ann (range 0 5) (Listof Number)) - : (Listof Number)
'(0 1 2 3 4)
4.1.3 Symbols
Another of Racket’s primitive datatypes is symbols. Symbols are interned strings: two occurrences of a symbol produce values which are pointer-equal if the symbols are equal (i.e. they represent the same string)1414 This is true with the exception of symbols created with gensym and the like. gensym produces a fresh symbol which is not interned, and therefore different from all existing symbols, and different from all symbols created in the future..
Typed Racket includes the Symbol type, to which all symbols belong. Additionally, there is a singleton type for each symbol: the type 'foo is only inhabited by the symbol 'foo.
> 'foo - : Symbol [more precisely: 'foo]
'foo
Singleton types containing symbols can be seen as similar to constructors without arguments in CAML and Haskell, and as globally unique enum values in object-oriented languages. The main difference resides in the scope of the declaration: two constructor declarations with identical names in two separate files will usually give distinct types and values. Similarly, when using the “type-safe enum” design pattern, two otherwise identical declarations of an enum will yield objects of different types. In contrast, two uses of an interned symbols in Racket and Typed Racket will produce identical values and types. A way of seeing this is that symbols are similar to constructors (in the functional programming sense) or enums which are implicitly declared globally.
> (module m1 typed/racket (define sym1 'foo) (provide sym1))
> (module m2 typed/racket (define sym2 'foo) (provide sym2)) > (require 'm1 'm2) ; The tow independent uses of 'foo are identical: > (eq? sym1 sym2) - : Boolean
#t
4.1.4 Unions
These singleton types may not seem very useful on their own. They can however be combined together with union types, which are built using the U type operator.
The union type (U 0 1 2) is inhabited by the values 0, 1 and 2, and by no other value. The Boolean type is actually defined as (U #true #false), i.e. the union of the singleton types containing the #true and #false values, respectively. The Nothing type, which is not inhabited by any value, is defined as the empty union (U). The type Any is the top type, i.e. it is a super-type of all other types, and can be seen as a large union including all other types, including those which will be declared later or in other units of code.
Unions of symbols are similar to variants which contain zero-argument constructors, in CAML or Haskell.
> (define v : (U 'foo 'bar) 'foo) > v - : Symbol [more precisely: (U 'bar 'foo)]
'foo
> (set! v 'bar) > v - : Symbol [more precisely: (U 'bar 'foo)]
'bar
; This throws an error at compile-time: > (set! v 'oops) eval:5:0: Type Checker: type mismatch;
mutation only allowed with compatible types
expected: (U 'bar 'foo)
given: 'oops
in: oops
A union such as (U 'ca (List 'cb Number) (List 'cc String Symbol)) can be seen as roughly the equivalent of a variant with three constructors, ca, cb and cc, where the first has no arguments, the second has one argument (a Number), and the third has two arguments (a String and a Symbol).
The main difference is that a symbol can be used as parts of several unions, e.g. (U 'a 'b) and (U 'b 'c), while constructors can often only be part of the variant used to declare them. Unions of symbols are in this sense closer to CAML’s so-called polymorphic variants (Minsky et al. 2013a) than to regular variants.
> (define-type my-variant (U 'ca (List 'cb Number) (List 'cc String Symbol))) > (define v₁ : my-variant 'ca) > (define v₂ : my-variant (list 'cb 2187)) > (define v3 : my-variant (list 'cc "Hello" 'world))
Finally, it is possible to mix different sorts of types within the same union: the type (U 0 #true 'other) is inhabited by the number 0, the boolean #true, and the symbol 'other. Translating such an union to a language like CAML could be done by explicitly tagging each case of the union with a distinct constructor.
Implementation-wise, all values in the so-called “untyped” version of Racket are tagged: a few bits within the value’s representation are reserved and used to encode the value’s type. When considering the target of a pointer in memory, Racket is therefore able to determine if the pointed-to value is a number, boolean, string, symbol and so on. Typed Racket preserves these run-time tags. They can then be used to detect the concrete type of a value when its static type is a union. This detection is done simply by using Racket’s predicates: number?, string?, symbol? etc.
4.1.5 Intersections
Intersections are the converse of unions: instead of allowing a mixture of values of different types, an intersection type, described using the ∩ type operator, only allows values which belong to all types.
The intersection type (∩ Nonnegative-Integer Nonpositive-Integer) is the singleton type 0. The intersection of (U 'a 'b 'c) and (U 'b 'c 'd) will be (U 'b 'c), as 'b and 'c belong to both unions.
; :type shows the given type, or a simplified version of it > (:type (∩ (U 'a 'b 'c) (U 'b 'c 'd))) (U 'b 'c)
Typed Racket is able to reduce some intersections such as those given above at compile-time. However, in some cases, it is forced to keep the intersection type as-is. For example, structs (describled below can, using special properties, impersonate functions. This mechanism is similar to PHP’s __invoke, the ability to overload operator() in C++. Typed Racket does not handle these properties (yet), and therefore cannot determine whether a given struct type also impersonates a function or not. This means that the intersection (∩ s (→ Number String)), where s is a struct type, cannot be reduced to Nothing, because Typed Racket cannot determine whether the struct s can act as a function or not.
Another situation where Typed Racket cannot reduce the intersection is when intersecting two function types (presented below).
(∩ (→ Number String) (→ Number Symbol)) (∩ (→ Number String) (→ Boolean String))
The first intersection seems like could be simplified to (→ Number String) (→ Number Symbol), and the second one could be simplified to (→ (U Number Boolean) String), however the equivalence between these types has not been implemented (yet) in Typed Racket, so we do not rely on them. Note that this issue is not a soundness issue: it only prevents passing values types to which they belong in principle, but it cannot be exploited to assign a value to a variable with an incompatible type.
Finally, when some types are intersected with a polymorphic type variable, the intersection cannot be computed until the polymorphic type is instantiated.
When Typed Racket is able to perform a simplification, occurrences of Nothing (the bottom type) propagate outwards in some cases, pairs and struct types which contain Nothing as one of their elements being collapsed to Nothing. This propagation of Nothing starts from occurrences of Nothing in the parts of the resulting type which are traversed by the intersection operator. It collapses the containing pairs and struct types to Nothing, moving outwards until the ∩ operator itself is reached. In principle, the propagation could go on past that point, but this is not implemented yet in Typed Racket1515 See Issue #552 on Typed Racket’s GitHub repository for more details on what prevents implementing a more aggressive propagation of Nothing..
The type (∩ 'a 'b) therefore gets simplified to Nothing, and the type (∩ (Pairof 'a 'x) (Pairof 'b 'x)) also simplifies to Nothing (Typed Racket initially pushes the intersection down the pairs, so that the type first becomes (Pairof (∩ 'a 'b) (∩ 'x 'x)), which is simplified to (Pairof Nothing 'x), and the occurrence of Nothing propagates outwards). However, if the user directly specifies the type (Pairof (∩ 'a 'b) Integer), it is simplified to (Pairof Nothing Integer), but the Nothing does not propagate outwards beyond the initial use of ∩.
> (:type (∩ 'a 'b)) Nothing
> (:type (∩ (Pairof 'a 'x) (Pairof 'b 'x))) Nothing
> (:type (Pairof (∩ 'a 'b) Integer))
(Pairof Nothing Integer)
[can expand further: Integer]
A simple workaround exists: the outer type, which could be collapsed to Nothing, can be intersected again with a type of the same shape. The outer intersection will traverse both types (the desired one and the “shape”), and propagate the leftover Nothing further out.
> (:type (Pairof (∩ 'a 'b) Integer))
(Pairof Nothing Integer)
[can expand further: Integer]
> (:type (∩ (Pairof (∩ 'a 'b) Integer) (Pairof Any Any))) Nothing
These intersections are not very interesting on their own, as in most cases it is possible to express the resulting simplified type without using the intersection operator. They become more useful when mixed with polymorphic types: intersecting a polymorphic type variable with another type can be used to restrict the actual values that may be used. The type (∩ A T), where A is a polymorphic type variable and T is a type defined elsewhere, is equivalent to the use of bounded type parameters in Java or C#. In C#, for example, the type (∩ A T) would be written using an where A : T clause.
4.1.6 Structs
Racket also supports structs, which are mappings from fields to values. A struct is further distinguished by its struct type: instances of two struct types with the same name and fields, declared in separate files, can be differentiated using the predicates associated with these structs. Structs in Racket can be seen as the analog of classes containing only fields (but no methods) in C# or Java. Such classes are sometimes called “Plain Old Data (POD) Objects”. Structs belong to a single-inheritance hierarchy: instances of the descendents of a struct type are recognised by their ancestor’s predicate. When a struct inherits from another, it includes its parent’s fields, and can add extra fields of its own.
Each struct declaration within a Typed Racket program additionally declares corresponding type.
> (struct parent ([field₁ : (Pairof String Symbol)]) #:transparent)
> (struct s parent ([field₂ : Integer] [field₃ : Symbol]) #:transparent) > (s (cons "x" 'y) 123 'z) - : s
(s '("x" . y) 123 'z)
In Typed Racket, structs can have polymorphic type arguments, which can be used inside the types of the struct’s fields.
> (struct (A B) poly-s ([field₁ : (Pairof A B)] [field₂ : Integer] [field₃ : B]) #:transparent) > (poly-s (cons "x" 'y) 123 'z) - : (poly-s String (U 'y 'z))
(poly-s '("x" . y) 123 'z)
Racket further supports struct type properties, which can be seen as a limited form of method definitions for a struct, thereby making them closer to real objects. The same struct type property can be implemented by many structs, and the declaration of a struct type property is therefore roughly equivalent to the declaration of an interface with a single method.
Struct type properties are often considered a low-level mechanism in Racket. Among other things, a struct type property can only be used to define a single property at a time. When multiple “methods” have to be defined at once (for example, when defining the prop:equal+hash property, which requires the definition of an equality comparison function, and two hashing functions), these can be grouped together in a list of functions, which is then used as the property’s value. “Generic interfaces” are a higher-level feature, which among other things allow the definition of multiple “methods” as part of a single generic interface, and offers a friendlier API for specifying the “generic interface” itself (i.e. what Object Oriented languages call an interfece), as and for specifying the implementation of said interface.
Typed Racket unfortunately offers no support for struct type properties and generic interfaces for now. It is impossible to assert that a struct implements a given property at the type level, and it is also for example not possible to describe the type of a function accepting any struct implementing a given property or generic interface. Finally, no type checks are performed on the body of functions bound to such properties, and to check verifies that a function implementation with the right signature is supplied to a given property. Since struct type properties and generics cannot be used in a type-safe way for now, we refrain from using these features, and only use them to implement some very common properties1616 We built a thin macro wrapper which allows typechecking the implementation and signature of the functions bound to these two properties.: prop:custom-write which is the equivalent of Java’s void toString(), and prop:equal+hash which is equivalent to Java’s boolean equals(Object o) and int hashCode().
4.1.7 Functions
Typed Racket provides rich function types, to support some of the flexible use patterns allowed by Racket.
The simple function type below indicates that the function expects two arguments (an integer and a string), and returns a boolean:
We note that unlike Haskell and CAML functions, Racket functions are not implicitly curried. To express the corresponding curried function type, one would write:
A function may additionally accept optional positional arguments, and keyword (i.e. named) arguments, both mandatory and optional:
; Mandatory string, optional integer and boolean arguments: (->* (String) (Integer Boolean) Boolean) ; Mandatory keyword arguments: (→ #:size Integer #:str String Boolean) ; Mandatory #:str, optional #:size and #:opt: (->* (#:str String) (#:size Integer #:opt Boolean) Boolean)
Furthermore, functions in Racket accept a catch-all “rest” argument, which allows for the definition of variadic functions. Typed racket also allows expressing this at the type level, as long as the arguments covered by the “rest” clause all have the same type:
; The function accepts one integer and any number of strings: (-> Integer String * Boolean) ; Same thing with an optional symbol inbetween: (->* (Integer) (Symbol) #:rest String Boolean)
One of Typed Racket’s main goals is to be able to typecheck idiomatic Racket programs. Such programs may include functions whose return type depends on the values of the input arguments. Similarly, case-lambda can be used to create lambda functions which dispatch to multiple behaviours based on the number of arguments passed to the function.
Typed Racket provides the case→ type operator, which can be used to describe the type of these functions:
; Allows 1 or 3 arguments, with the same return type. (case→ (→ Integer Boolean) (→ Integer String Symbol Boolean)) ; A similar type based on optional arguments allows 1, 2 or 3 ; arguments in contrast: (->* (Integer) (String Symbol) Boolean) ; The output type can depend on the input type: (case→ (→ Integer Boolean) (→ String Symbol)) ; Both features (arity and dependent output type) can be mixed (case→ (→ Integer Boolean) (→ Integer String (Listof Boolean)))
Another important feature, which can be found in the type system of most functional programming languages, and most object-oriented languages, is parametric polymorphism. Typed Racket allows the definition of polymorphic structs, as detailed above, as well as polymorphic functions. For example, the function cons can be considered as a polymorphic function with two polymorphic type arguments A and B, which takes an argument of type A, an argument of type B, and returns a pair of A and B.
Typed Racket supports polymorphic functions with multiple polymorphic type variables, as the one shown above. Furthermore, it allows one of the polymorphic variables to be followed by ellipses, indicating a variable-arity polymorphic type (Tobin-Hochstadt 2010). The dotted polymorphic type variable can be instantiated with a tuple of types, and will be replaced with that tuple where it appears. For example, the type
can be instantiated with Number String Boolean, which would yield the type for a function accepting a list of four elements: two numbers, a string and a boolean.
Dotted polymorphic type variables can only appear in some places. A dotted type variable can be used as the tail of a List type, so (List Number B ...) (a String followed by any number of Bs) is a valid type, but (List B ... String) (any number of Bs followed by a String) is not. A dotted type variable can also be used to describe the type of a variadic function, as long as it appears after all other arguments, so (→ String B ... Void) (a function taking a String, any number of Bs, and returning Void) is a valid type, but (→ String B ... Void) (a function taking any number of Bs, and a String, and returning Void) is not. Finally, a dotted type variable can be used to represent the last element of the tuple of returned values, for functions which return multiple values (which are described below).
When the context makes it unclear whether an ellipsis \ldots{} indicates a dotted type variable, or is used to indicate a metasyntactic repetition at the level of mathematical formulas, we will write the first using τ{}_1 \ldots{} τ{}_n, explicitly indicating the first and last elements, or using \overline{τ} [and we will write the second using \textit{tvar}\ \textit{ooo}]Todo
Functions in Racket can return one or several values. When the number of values returned is different from one, the result tuple can be destructured using special functions such as (call-with-values f g), which passes each value returned by f as a distinct argument to g. The special form (let-values ([(v₁ … vₙ) e]) body) binds each value returned by e to the corresponding vᵢ (the expression e must produce exactly n values). The type of a function returning multiple values can be expressed using the following notation:
(→ In₁ … Inₙ (Values Out₁ … Outₘ))
Finally, predicates (functions whose results can be interpreted as booleans) can be used to gain information about the type of their argument, depending on the result. The type of a predicate can include positive and negative filters, indicated with #:+ and #:-, respectively. The type of the string? predicate is:
In this notation, the positive filter #:+ String indicates that when the predicate returns #true, the argument is known to be a String. Conversely, when the predicate exits with #false, the negative filter #:- (! String) indicates that the input could not (!) possibly have been a string. The information gained this way allows regular conditionals based on arbitrary predicates to work like pattern-matching:
> (define (f [x : (U String Number Symbol)]) (if (string? x) ; x is known to be a String here: (ann x String) ; x is known to be a Number or a Symbol here: (ann x (U Number Symbol))))
The propositions do not necessarily need to refer to the value as a whole, and can instead give information about a sub-part of the value. Right now, the user interface for specifying paths can only target the left and right members of cons pairs, recursively. Internally, Typed Racket supports richer paths, and the type inference can produce filters which give information about individual structure fields, or about the result of forced promises, for example.
4.1.8 Occurrence typing
Typed Racket is built atop Racket, which does not natively support pattern matching. Instead, pattern matching forms are implemented as macros, and expand to nested uses of if.
As a result, Typed Racket needs to typecheck code with the following structure:
(λ ([v : (U Number String)]) (if (string? v) (string-append v ".") (+ v 1)))
In this short example, the type of v is a union type including Number and String. After applying the string? predicate, the type of v is narrowed down to String in the then branch, and it is narrowed down to Number in the else branch. The type information gained thanks to the predicate comes from the filter part of the predicate’s type (as explained in Functions).
Occurrence typing only works on immutable variables and values. Indeed, if the variable is modified with set!, or if the subpart of the value stored within which is targeted by the predicate is mutable, it is possible for that value to change between the moment the predicate is executed, and the moment when the value is actually used. This places a strong incentive to mostly use immutable variables and values in Typed Racket programs, so that pattern-matching and other forms work well.
In principle, it is always possible to copy the contents of a mutated variable to a temporary one (or copy a mutable subpart of the value to a new temporary variable), and use the predicate on that temporary copy. The code in the then and else branches should also use the temporary copy, to benefit from the typing information gained via the predicate. In our experience, however, it seems that most macros which perform tasks similar to pattern-matching do not provide an easy means to achieve this copy. It therefore remains more practical to avoid mutation altogether when possible.
4.1.9 Recursive types
Typed Racket allows recursive types, both via (possibly mutually-recursive) named declarations, and via the Rec type operator.
In the following examples, the types Foo and Bar are mutually recursive. The type Foo matches lists with an even number of alternating Integer and String elements, starting with an Integer,
(define-type Foo (Pairof Integer Bar)) (define-type Bar (Pairof String (U Foo Null)))
This same type could alternatively be defined using the Rec operator. The notation (Rec R T) builds the type T, where occurrences of R are interpreted as recursive occurrences of T itself.
(Rec R (Pairof Integer (Pairof String (U R Null))))
4.1.10 Classes
The racket/class module provides an object-oriented system for Racket. It supports the definition of a hierarchy of classes with single inheritance, interfaces with multiple inheritance, mixins and traits (methods and fields which may be injected at compile-time into other classes), method renaming, and other features.
The typed/racket/class module makes most of the features of racket/class available to Typed Racket. In particular, it defines the following type operators:
Class is used to describe the features a class, including the signature of its constructor, as well as the public fields and methods exposed by the class. We will note that a type expressed with Class does not mention the name of the class. Any concrete implementation which exposes all (and only) the required methods, fields and constructor will inhabit the type. In other words, the types built with Class are structural, and not nominal.
Object is used to describe the methods and fields which an already-built object bears.
The (Instance (Class …)) type is a shorthand for the Object type of instances of the given class type. It can be useful to describe the type of an instance of a class without repeating all the fields and methods (which could have been declared elsewhere).
In types described using Class and Instance, it is possible to omit fields which are not relevant. These fields get grouped under a single row polymorphic type variable. A row polymorphic function can, for example, accept a class with some existing fields, and produce a new class extending the existing one:
> (: add-my-field (∀ (r #:row) (-> (Class (field [existing Number]) #:row-var r) (Class (field [existing Number] [my-field String]) #:row-var r)))) > (define (add-my-field parent%) (class parent% (super-new) (field [my-field : String "Hello"]))) The small snippet of code above defined a function add-my-field which accepts a parent% class exporting at least the existing field (and possibly other fields and methods). It then creates an returns a subclass of the given parent% class, extended with the my-field field.
We consider the following class, with the required existing field, and a supplementary other field:
> (define a-class% (class object% (super-new) (field [existing : Integer 0] [other : Boolean #true]))) When passed to the add-my-field function, the row type variable is implicitly instantiated with the field [other Boolean]. The result of that function call is therefore a class with the three fields existing, my-field and other.
> (add-my-field a-class%) - : (Class (field (existing Number) (my-field String) (other Boolean)))
#<class:add-my-field>
These mechanisms can be used to perform reflective operations on classes like adding new fields and methods to dynamically-created subclasses, in a type-safe fashion.
The Row operator can be used along with row-inst to explicitly instantiate a row type variable to a specific set of fields and methods. The following call to add-my-field is equivalent to the preceding one, but does not rely on the automatic inference of the row type variable.
> ({row-inst add-my-field (Row (field [other Boolean]))} a-class%) - : (Class (field (existing Number) (my-field String) (other Boolean)))
#<class:add-my-field>
We will not further describe this object system here, as our work does not rely on this aspect of Typed Racket’s type system. We invite the curious reader to refer to the documentation for racket/class and typed/racket/class for more details.
We will simply note one feature which is so far missing from Typed Racket’s object system: immutability. It is not possible yet to indicate in the type of a class that a field is immutable, or that a method is pure (in the sense of always returning the same value given the same input arguments). The absence of immutability means that occurrence typing does not work on fields. After testing the value of a field against a predicate, it is not possible to narrow the type of that field, because it could be mutated by a third party between the check and future uses of the field.
4.1.11 Local type inference
Typed Racket’s type system is rich and contains many features. Among other things, it mixes polymorphism and subtyping, which notoriously make typechecking difficult.
The authors of Typed Racket claim that global type inference often produces indecipherable error messages, with a small change having repercussions on the type of terms located in other distant parts of the program.
The authors of Typed Racket suggest that type annotations are often beneficial as documentation. Since the type annotations are checked at compile-time, they additionally will stay synchronised with the code that they describe, and will not become out of date.
Instead of relying on global type inference, Typed Racket uses local type inference to determine the type of local variables and expressions. Typed Racket’s type inference can also determine the correct instantiation for most calls to polymorphic functions. It however requires type annotations in some places. For example, it is usually necessary to indicate the type of the parameters when defining a function.
4.2 Formal semantics for part of Typed Racket’s type system
\def\ifmathjax#1{#1}\def\iflatex#1{}
The following definitions and rules are copied and adjusted from (Tobin-Hochstadt 2010), with the author’s permission. Some of the notations were changed to use those of (Kent et al. 2016).
We include below the grammar, semantics and typing rules related to the minimal core of the Typed Racket language1717 The core language is defined in (Tobin-Hochstadt 2010), pp. 61–70., dubbed λ_{\mathit{TS}}, including extensions which add pairs1818 The extensions needed to handle pairs are described in (Tobin-Hochstadt 2010), pp. 71–75., functions of multiple arguments, variadic functions and variadic polymorphic functions1919 The extensions needed to handle functions of multiple arguments, variadic functions, and variadic functions where the type of the “rest” arguments are not uniform are described in (Tobin-Hochstadt 2010), pp. 91–77., intersection types, recursive types, symbols and promises. These features have been informally described in Overview of Typed Racket’s type system.
We purposefully omit extensions which allow advanced logic reasoning when propagating information gained by complex combinations of conditionals2020 The extensions which allow advanced logic reasoning are described in (Tobin-Hochstadt 2010), pp. 75–78., refinement types2121 The extensions which introduce refinement types are described in (Tobin-Hochstadt 2010), pp. 85–89., dependent refinement types2222 Dependent refinement types are presented in (Kent et al. 2016). (which allow using theories from external solvers to reason about values and their type, e.g. using bitvector theory to ensure that a sequence of operations does not produce a result exceeding a certain machine integer size), structs and classes. These extensions are not relevant to our work2323 We informally describe a translation of our system of records into structs in section [[??]]Todo, but settle for an alternative implementation in section [[??]]Todo which does not rely on structs., and their inclusion in the following semantics would needlessly complicate things.
4.2.1 Notations
We note a sequence of elements with \overrightarrow{y}. When there is more than one sequence involved in a rule or equation, we may use the notation \overrightarrow{y}\overset{\scriptscriptstyle\,\ifmathjax{\raise1mu{n}}\iflatex{n}}{\vphantom{y}} to indicate that there are n elements in the sequence. Two sequences can be forced to have the same number of elements in that way. We represent a set of elements (an “unordered” sequence) with the notation \def\overrightbracedarrow#1{\overset{\scriptscriptstyle{\raise1mu{\{}}}{\vphantom{#1}}\overrightarrow{#1}\overset{\scriptscriptstyle{\raise1mu{\}}}}{\vphantom{#1}}}\overrightbracedarrow{y}. The use of ellipses in α \mathbf{\ldots{}} does not indicate the repetition of α. Instead, it indicates that α is a variadic polymorphic type variable: a placeholder for zero or more types which will be substituted for occurrences of α when the polymorphic type is instantiated. These ellipses appear as such in the Typed Racket source code, and are the reason we use the notation \overrightarrow{y} to indicate repetition, instead of the ellipses commonly used for that purpose. FInally, an empty sequence of repeated elements is sometimes noted ϵ
The judgements are written following the usual convention, where Γ is the environment which associates variables to their type. The Δ environment contains the type variables within scope, and is mostly used to determine the validity of types.
The environments can be extended as follows:
The typing information \mathrm{R} associated with an expression contains the type of the expression, as well as aliasing information and other propositions which are known to be conditionally true depending on the value of the expression at run-time. These pieces of information are described in more detail below. Since the typing information \mathrm{R} is often inlined in the typing judgement, a typing judgement will generally have the following form:
In this notation, the + and - signs in \phi{}^+ and \phi{}^- are purely syntactical, and serve to distinguish the positive and negative filters, which are instances of the nonterminal \phi.
The various nonterminals used throughout the language are written in italics and are defined using the notation:
Additionally, a symbol assigned to a nonterminal may be used as a placeholder in rules and definitions, implicitly indicating that the element used to fill in that placeholder should be an instance of the corresponding nonterminal. When multiple such placeholders are present in a rule, they will be subscripted to distinguish between the different occurrences. The subscripts i, j, k and l are often used for repeated sequences.
In later chapters, we extend already-defined non-terminals using the notation:
Typing rules and other rules are described following the usual natural deduction notation.
Metafunctions (i.e. functions which operate on types as syntactical elements, or on other terms of the language) are written in a roman font. The meta-values \mathrm{true} and \mathrm{false} indicate logical truth and falsehood respectively.
Language operators are written in bold face:
Values are written using a bold italic font:
Type names start with a capital letter, and are written using a bold font:
We indicate the syntactical substitution of y with z in w using the notation w[y↦z]. When given several elements to replace, the substitution operator performs a parallel substitution (that is, w[x↦y\ y↦z] will not replace the occurrences of y introduced by the first substitution).
[Succinctly describe the other conventions used in the thesis, if any were omitted above.]Todo
[Define the meta substitution and equality operators precisely.]Todo
4.2.2 Names and bindings
In the following sections, we assume that all type variable names which occur in binding positions are unique. This assumption could be made irrelevant by explicitly renaming in the rules below all type variables to fresh unique ones. Performing this substitution would be a way of encoding a notion of scope and the possibility for one identifier to hide another. However, the Racket language features macros which routinely produce new binding forms. The macro system in Racket is a hygienic one, and relies on a powerful model of the notion of scope (Flatt 2016). Extending the rules below with a simplistic model of Racket’s notion of scope would not do justice to the actual system, and would needlessly complicate the rules. Furthermore, Typed Racket only typechecks fully-expanded programs. In these programs, the binding of each identifier has normally been determined2424 Typed Racket actually still determines the binding for type variables by itself, but we consider this is an implementation detail..
4.2.3 Expressions
The following expressions are available in the subset of Typed Racket which we consider. These expressions include references to variables, creation of basic values (numbers, booleans, lists of pairs ending with \textbf{$\mathord{{\bfit \text{null}}}$}, symbols, promises), a variety of lambda functions with different handling of rest arguments (fixed number of arguments, polymorphic functions with a uniform list of rest arguments and variadic polymorphic functions, as well as polymorphic abstractions), a small sample of primitive functions which are part of Racket’s library and a few operations manipulating these values (function application and polymorphic instantiation, forcing promises, symbol comparison and so on).
Symbol literals are noted as s \in{} \mathcal{S} and the universe of symbols (which includes symbol literals and fresh symbols created via (\textbf{gensym})) is noted as s{}’ \in{} \mathcal{S}{}’.
4.2.4 Primitive operations (library functions)
Racket offers a large selection of library functions, which we consider as primitive operations. A few of these are listed below, and their type is given later after, once the type system has been introduced. \textit{number?}, \textit{pair?} and \textit{null?} are predicates for the corresponding type. \textit{car} and \textit{cdr} are accessors for the first and second elements of a pair, which can be created using \mathit{cons}. The \textit{identity} function returns its argument unmodified, and \textit{add1} returns its numeric argument plus 1. These last two functions are simply listed as examples.
4.2.5 Values
These expressions and primitive functions may produce or manipulate the following values:
The \textbf{$\mathord{{\bfit \text{list}}}$} value notation is defined as a shorthand for a \textbf{$\mathord{{\bfit \text{null}}}$}-terminated linked list of pairs.
4.2.6 Evaluation contexts
The operational semantics given below rely on the following evaluation contexts:
4.2.7 Typing judgement
The Γ ; Δ ⊢ e : \mathrm{R} typing judgement indicates that the expression e has type τ. The Γ typing environment maps variables to their type (and to extra information), while the Δ environment stores the polymorphic type variables, variadic polymorphic type variables and recursive type variables which are in scope.
Additionally, the typing judgement indicates a set of propositions \phi{}^- which are known to be true when the run-time value of e is \textbf{$\mathord{{\bfit \text{false}}}$}, and a set of propositions \phi{}^+ which are known to be true when the run-time value of e is \textbf{$\mathord{{\bfit \text{true}}}$}2525 Any other value is treated in the same way as \textbf{$\mathord{{\bfit \text{true}}}$}, as values other than \textbf{$\mathord{{\bfit \text{false}}}$} are traditionally considered as true in language of the Lisp family.. The propositions will indicate that the value of a separate variable belongs (or does not belong) to a given type. For example, the \phi{}^- proposition \mathbf{Number}_y indicates that when e evaluates to \textbf{$\mathord{{\bfit \text{false}}}$}, the variable y necessarily holds an integer.
Finally, the typing judgement can indicate with o that the expression e is an alias for a sub-element of another variable in the environment. For example, if the object o is \mathrm{car} :: \mathrm{cdr}(y), it indicates that the expression e produces the same value that (car (cdr y)) would, i.e. that it returns the second element of a (possibly improper) list stored in y.
Readers familiar with abstract interpretation can compare the \phi propositions to the Cartesian product of the abstract domains of pairs of variables. A static analyser can track possible pairs of values contained in pairs of distinct variables, and will represent this information using an abstract domain which combinations of values may be possible, and which may not. Occurrence typing similarly exploits the fact that the type of other variables may depend on the value of τ.
4.2.8 Types
Typed Racket handles the types listed below. Aside from the top type (⊤) which is the supertype of all other types, this list includes singleton types for numbers, booleans, symbols and the \textbf{$\mathord{{\bfit \text{null}}}$} value. The types \mathbf{Number} and \mathbf{Symbol} are the infinite unions of all number and symbol singletons, respectively. Also present are function types (with fixed arguments, homogeneous rest arguments and the variadic polymorphic functions which accept heterogeneous rest arguments, as well as polymorphic abstractions), unions of other types, intersections of other types, the type of pairs and promises. The value assigned to a variadic polymorphic function’s rest argument will have a type of the form (\textbf{List}\ τ \mathbf{\ldots{}}_{α}). Finally, Typed Racket allows recursive types to be described with the \textbf{Rec} combinator.