diff options
| author | Who? Me?! <mark-i-m@users.noreply.github.com> | 2020-06-05 11:11:09 -0500 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2020-06-05 13:11:09 -0300 |
| commit | 71e7090872d8d2bc2e5c036df2a1243771dbe741 (patch) | |
| tree | 0631fdd929e3e509721c538f7437f6f371cbc83e /src/doc/rustc-dev-guide | |
| parent | 2d4344fed1065bd4548493a37f136cc7b95d8b93 (diff) | |
| download | rust-71e7090872d8d2bc2e5c036df2a1243771dbe741.tar.gz rust-71e7090872d8d2bc2e5c036df2a1243771dbe741.zip | |
Add a bit about various type system concepts (#697)
* add a bit on dataflow analysis * add a bit on quanitification * add a bit on debruijn index * add a bit on early and late bound params * add missing link * Typos Co-authored-by: Tshepang Lekhonkhobe <tshepang@gmail.com> * clarify dataflow example * fix formatting * fix typos * Typos Co-authored-by: Tshepang Lekhonkhobe <tshepang@gmail.com> * fix errors in background * remove dup material and make early/late intro short * adjust intro * Niko's intro Co-authored-by: Niko Matsakis <niko@alum.mit.edu> Co-authored-by: Tshepang Lekhonkhobe <tshepang@gmail.com> Co-authored-by: Niko Matsakis <niko@alum.mit.edu>
Diffstat (limited to 'src/doc/rustc-dev-guide')
| -rw-r--r-- | src/doc/rustc-dev-guide/src/SUMMARY.md | 1 | ||||
| -rw-r--r-- | src/doc/rustc-dev-guide/src/appendix/background.md | 146 | ||||
| -rw-r--r-- | src/doc/rustc-dev-guide/src/appendix/glossary.md | 1 | ||||
| -rw-r--r-- | src/doc/rustc-dev-guide/src/early-late-bound.md | 107 |
4 files changed, 253 insertions, 2 deletions
diff --git a/src/doc/rustc-dev-guide/src/SUMMARY.md b/src/doc/rustc-dev-guide/src/SUMMARY.md index 0a9dc777377..f45e8ba9663 100644 --- a/src/doc/rustc-dev-guide/src/SUMMARY.md +++ b/src/doc/rustc-dev-guide/src/SUMMARY.md @@ -76,6 +76,7 @@ - [Generic arguments](./generic_arguments.md) - [Type inference](./type-inference.md) - [Trait solving](./traits/resolution.md) + - [Early and Late Bound Parameters](./early-late-bound.md) - [Higher-ranked trait bounds](./traits/hrtb.md) - [Caching subtleties](./traits/caching.md) - [Specialization](./traits/specialization.md) diff --git a/src/doc/rustc-dev-guide/src/appendix/background.md b/src/doc/rustc-dev-guide/src/appendix/background.md index b688d2c75db..fe69f6bfda6 100644 --- a/src/doc/rustc-dev-guide/src/appendix/background.md +++ b/src/doc/rustc-dev-guide/src/appendix/background.md @@ -79,16 +79,158 @@ cycle. [*Static Program Analysis*](https://cs.au.dk/~amoeller/spa/) by Anders Møller and Michael I. Schwartzbach is an incredible resource! -*to be written* +_Dataflow analysis_ is a type of static analysis that is common in many +compilers. It describes a general technique, rather than a particular analysis. + +The basic idea is that we can walk over a [CFG](#cfg) and keep track of what +some value could be. At the end of the walk, we might have shown that some +claim is true or not necessarily true (e.g. "this variable must be +initialized"). `rustc` tends to do dataflow analyses over the MIR, since that +is already a CFG. + +For example, suppose we want to check that `x` is initialized before it is used +in this snippet: + +```rust,ignore +fn foo() { + let mut x; + + if some_cond { + x = 1; + } + + dbg!(x); +} +``` + +A CFG for this code might look like this: + +```txt + +------+ + | Init | (A) + +------+ + | \ + | if some_cond + else \ +-------+ + | \| x = 1 | (B) + | +-------+ + | / + +---------+ + | dbg!(x) | (C) + +---------+ +``` + +We can do the dataflow analysis as follows: we will start off with a flag `init` +which indicates if we know `x` is initialized. As we walk the CFG, we will +update the flag. At the end, we can check its value. + +So first, in block (A), the variable `x` is declared but not initialized, so +`init = false`. In block (B), we initialize the value, so we know that `x` is +initialized. So at the end of (B), `init = true`. + +Block (C) is where things get interesting. Notice that there are two incoming +edges, one from (A) and one from (B), corresponding to whether `some_cond` is true or not. +But we cannot know that! It could be the case the `some_cond` is always true, +so that `x` is actually always initialized. It could also be the case that +`some_cond` depends on something random (e.g. the time), so `x` may not be +initialized. In general, we cannot know statically (due to [Rice's +Theorem][rice]). So what should the value of `init` be in block (C)? + +[rice]: https://en.wikipedia.org/wiki/Rice%27s_theorem + +Generally, in dataflow analyses, if a block has multiple parents (like (C) in +our example), its dataflow value will be some function of all its parents (and +of course, what happens in (C)). Which function we use depends on the analysis +we are doing. + +In this case, we want to be able to prove definitively that `x` must be +initialized before use. This forces us to be conservative and assume that +`some_cond` might be false sometimes. So our "merging function" is "and". That +is, `init = true` in (C) if `init = true` in (A) _and_ in (B) (or if `x` is +initialized in (C)). But this is not the case; in particular, `init = false` in +(A), and `x` is not initialized in (C). Thus, `init = false` in (C); we can +report an error that "`x` may not be initialized before use". + +There is definitely a lot more that can be said about dataflow analyses. There is an +extensive body of research literature on the topic, including a lot of theory. +We only discussed a forwards analysis, but backwards dataflow analysis is also +useful. For example, rather than starting from block (A) and moving forwards, +we might have started with the usage of `x` and moved backwards to try to find +its initialization. <a name="quantified"></a> ## What is "universally quantified"? What about "existentially quantified"? -*to be written* +In math, a predicate may be _universally quantified_ or _existentially +quantified_: + +- _Universal_ quantification: + - the predicate holds if it is true for all possible inputs. + - Traditional notation: ∀x: P(x). Read as "for all x, P(x) holds". +- _Existential_ quantification: + - the predicate holds if there is any input where it is true, i.e., there + only has to be a single input. + - Traditional notation: ∃x: P(x). Read as "there exists x such that P(x) holds". + +In Rust, they come up in type checking and trait solving. For example, + +```rust,ignore +fn foo<T>() +``` +This function claims that the function is well-typed for all types `T`: `∀ T: well_typed(foo)`. + +Another example: + +```rust,ignore +fn foo<'a>(_: &'a usize) +``` +This function claims that for any lifetime `'a` (determined by the +caller), it is well-typed: `∀ 'a: well_typed(foo)`. + +Another example: + +```rust,ignore +fn foo<F>() +where for<'a> F: Fn(&'a u8) +``` +This function claims that it is well-typed for all types `F` such that for all +lifetimes `'a`, `F: Fn(&'a u8)`: `∀ F: ∀ 'a: (F: Fn(&'a u8)) => well_typed(foo)`. + +One more example: + +```rust,ignore +fn foo(_: dyn Debug) +``` +This function claims that there exists some type `T` that implements `Debug` +such that the function is well-typed: `∃ T: (T: Debug) and well_typed(foo)`. <a name="variance"></a> +## What is a DeBruijn Index? + +DeBruijn indices are a way of representing which variables are bound in +which binders using only integers. They were [originally invented][wikideb] for +use in lambda calculus evaluation. In `rustc`, we use a similar idea for the +[representation of generic types][sub]. + +[wikideb]: https://en.wikipedia.org/wiki/De_Bruijn_index +[sub]: ../generics.md + +Here is a basic example of how DeBruijn indices might be used for closures (we +don't actually do this in `rustc` though): + +```rust,ignore +|x| { + f(x) // de Bruijn index of `x` is 1 because `x` is bound 1 level up + + |y| { + g(x, y) // index of `x` is 2 because it is bound 2 levels up + // index of `y` is 1 because it is bound 1 level up + } +} +``` + ## What is co- and contra-variance? Check out the subtyping chapter from the diff --git a/src/doc/rustc-dev-guide/src/appendix/glossary.md b/src/doc/rustc-dev-guide/src/appendix/glossary.md index dddf628f277..f8042170c08 100644 --- a/src/doc/rustc-dev-guide/src/appendix/glossary.md +++ b/src/doc/rustc-dev-guide/src/appendix/glossary.md @@ -15,6 +15,7 @@ CTFE <div id="ctfe"/> | Short for Compile-Time Function Eval cx <div id="cx"/> | We tend to use "cx" as an abbreviation for context. See also `tcx`, `infcx`, etc. DAG <div id="dag"/> | A directed acyclic graph is used during compilation to keep track of dependencies between queries. ([see more](../queries/incremental-compilation.html)) data-flow analysis <div id="data-flow"/> | A static analysis that figures out what properties are true at each point in the control-flow of a program; see [the background chapter for more](./background.html#dataflow). +DeBruijn Index <div id="debruijn"> | A technique for describing which binder a variable is bound by using only integers. It has the benefit that it is invariant under variable renaming. ([see more](./background.md#debruijn)) DefId <div id="def-id"/> | An index identifying a definition (see `librustc_middle/hir/def_id.rs`). Uniquely identifies a `DefPath`. See [the HIR chapter for more](../hir.html#identifiers-in-the-hir). Double pointer <div id="double-ptr"/> | A pointer with additional metadata. See "fat pointer" for more. drop glue <div id="drop-glue"/> | (internal) compiler-generated instructions that handle calling the destructors (`Drop`) for data types. diff --git a/src/doc/rustc-dev-guide/src/early-late-bound.md b/src/doc/rustc-dev-guide/src/early-late-bound.md new file mode 100644 index 00000000000..a703cbab5dd --- /dev/null +++ b/src/doc/rustc-dev-guide/src/early-late-bound.md @@ -0,0 +1,107 @@ +# Early and Late Bound Variables + +In Rust, item definitions (like `fn`) can often have generic parameters, which +are always [_universally_ quantified][quant]. That is, if you have a function +like + +```rust +fn foo<T>(x: T) { } +``` + +this function is defined "for all T" (not "for some specific T", which would be +[_existentially_ quantified][quant]). + +[quant]: ./appendix/background.md#quantified + +While Rust *items* can be quantified over types, lifetimes, and constants, the +types of values in Rust are only ever quantified over lifetimes. So you can +have a type like `for<'a> fn(&'a u32)`, which represents a function pointer +that takes a reference with any lifetime, or `for<'a> dyn Trait<'a>`, which is +a `dyn` trait for a trait implemented for any lifetime; but we have no type +like `for<T> fn(T)`, which would be a function that takes a value of *any type* +as a parameter. This is a consequence of monomorphization -- to support a value +of type `for<T> fn(T)`, we would need a single function pointer that can be +used for a parameter of any type, but in Rust we generate customized code for +each parameter type. + +One consequence of this asymmetry is a weird split in how we represesent some +generic types: _early-_ and _late-_ bound parameters. +Basically, if we cannot represent a type (e.g. a universally quantified type), +we have to bind it _early_ so that the unrepresentable type is never around. + +Consider the following example: + +```rust,ignore +fn foo<'a, 'b, T>(x: &'a u32, y: &'b T) where T: 'b { ... } +``` + +We cannot treat `'a`, `'b`, and `T` in the same way. Types in Rust can't have +`for<T> { .. }`, only `for<'a> {...}`, so whenever you reference `foo` the type +you get back can't be `for<'a, 'b, T> fn(&'a u32, y: &'b T)`. Instead, the `T` +must be substituted early. In particular, you have: + +```rust,ignore +let x = foo; // T, 'b have to be substituted here +x(...); // 'a substituted here, at the point of call +x(...); // 'a substituted here with a different value +``` + +## Early-bound parameters + +Early-bound parameters in rustc are identified by an index, stored in the +[`ParamTy`] struct for types or the [`EarlyBoundRegion`] struct for lifetimes. +The index counts from the outermost declaration in scope. This means that as you +add more binders inside, the index doesn't change. + +For example, + +```rust,ignore +trait Foo<T> { + type Bar<U> = (Self, T, U); +} +``` + +Here, the type `(Self, T, U)` would be `($0, $1, $2)`, where `$N` means a +[`ParamTy`] with the index of `N`. + +In rustc, the [`Generics`] structure carries the this information. So the +[`Generics`] for `Bar` above would be just like for `U` and would indicate the +'parent' generics of `Foo`, which declares `Self` and `T`. You can read more +in [this chapter](./generics.md). + +[`ParamTy`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.ParamTy.html +[`EarlyBoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.EarlyBoundRegion.html +[`Generics`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Generics.html + +## Late-bound parameters + +Late-bound parameters in `rustc` are handled quite differently (they are also +specialized to lifetimes since, right now, only late-bound lifetimes are +supported, though with GATs that has to change). We indicate their potential +presence by a [`Binder`] type. The [`Binder`] doesn't know how many variables +there are at that binding level. This can only be determined by walking the +type itself and collecting them. So a type like `for<'a, 'b> ('a, 'b)` would be +`for (^0.a, ^0.b)`. Here, we just write `for` because we don't know the names +of the things bound within. + +Moreover, a reference to a late-bound lifetime is written `^0.a`: + +- The `0` is the index; it identifies that this lifetime is bound in the + innermost binder (the `for`). +- The `a` is the "name"; late-bound lifetimes in rustc are identified by a + "name" -- the [`BoundRegion`] struct. This struct can contain a + [`DefId`][defid] or it might have various "anonymous" numbered names. The + latter arise from types like `fn(&u32, &u32)`, which are equivalent to + something like `for<'a, 'b> fn(&'a u32, &'b u32)`, but the names of those + lifetimes must be generated. + +This setup of not knowing the full set of variables at a binding level has some +advantages and some disadvantages. The disadvantage is that you must walk the +type to find out what is bound at the given level and so forth. The advantage +is primarily that, when constructing types from Rust syntax, if we encounter +anonymous regions like in `fn(&u32)`, we just create a fresh index and don't have +to update the binder. + +[`Binder`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/struct.Binder.html +[`BoundRegion`]: https://doc.rust-lang.org/nightly/nightly-rustc/rustc_middle/ty/enum.BoundRegion.html +[defid]: ./hir.html#identifiers-in-the-hir |
