diff options
Diffstat (limited to 'compiler/rustc_codegen_llvm/src/debuginfo/doc.md')
| -rw-r--r-- | compiler/rustc_codegen_llvm/src/debuginfo/doc.md | 180 |
1 files changed, 180 insertions, 0 deletions
diff --git a/compiler/rustc_codegen_llvm/src/debuginfo/doc.md b/compiler/rustc_codegen_llvm/src/debuginfo/doc.md new file mode 100644 index 00000000000..f983d092039 --- /dev/null +++ b/compiler/rustc_codegen_llvm/src/debuginfo/doc.md @@ -0,0 +1,180 @@ +# Debug Info Module + +This module serves the purpose of generating debug symbols. We use LLVM's +[source level debugging](https://llvm.org/docs/SourceLevelDebugging.html) +features for generating the debug information. The general principle is +this: + +Given the right metadata in the LLVM IR, the LLVM code generator is able to +create DWARF debug symbols for the given code. The +[metadata](https://llvm.org/docs/LangRef.html#metadata-type) is structured +much like DWARF *debugging information entries* (DIE), representing type +information such as datatype layout, function signatures, block layout, +variable location and scope information, etc. It is the purpose of this +module to generate correct metadata and insert it into the LLVM IR. + +As the exact format of metadata trees may change between different LLVM +versions, we now use LLVM +[DIBuilder](https://llvm.org/docs/doxygen/html/classllvm_1_1DIBuilder.html) +to create metadata where possible. This will hopefully ease the adaption of +this module to future LLVM versions. + +The public API of the module is a set of functions that will insert the +correct metadata into the LLVM IR when called with the right parameters. +The module is thus driven from an outside client with functions like +`debuginfo::create_local_var_metadata(bx: block, local: &ast::local)`. + +Internally the module will try to reuse already created metadata by +utilizing a cache. The way to get a shared metadata node when needed is +thus to just call the corresponding function in this module: + + let file_metadata = file_metadata(cx, file); + +The function will take care of probing the cache for an existing node for +that exact file path. + +All private state used by the module is stored within either the +CrateDebugContext struct (owned by the CodegenCx) or the +FunctionDebugContext (owned by the FunctionCx). + +This file consists of three conceptual sections: +1. The public interface of the module +2. Module-internal metadata creation functions +3. Minor utility functions + + +## Recursive Types + +Some kinds of types, such as structs and enums can be recursive. That means +that the type definition of some type X refers to some other type which in +turn (transitively) refers to X. This introduces cycles into the type +referral graph. A naive algorithm doing an on-demand, depth-first traversal +of this graph when describing types, can get trapped in an endless loop +when it reaches such a cycle. + +For example, the following simple type for a singly-linked list... + +``` +struct List { + value: i32, + tail: Option<Box<List>>, +} +``` + +will generate the following callstack with a naive DFS algorithm: + +``` +describe(t = List) + describe(t = i32) + describe(t = Option<Box<List>>) + describe(t = Box<List>) + describe(t = List) // at the beginning again... + ... +``` + +To break cycles like these, we use "forward declarations". That is, when +the algorithm encounters a possibly recursive type (any struct or enum), it +immediately creates a type description node and inserts it into the cache +*before* describing the members of the type. This type description is just +a stub (as type members are not described and added to it yet) but it +allows the algorithm to already refer to the type. After the stub is +inserted into the cache, the algorithm continues as before. If it now +encounters a recursive reference, it will hit the cache and does not try to +describe the type anew. + +This behavior is encapsulated in the 'RecursiveTypeDescription' enum, +which represents a kind of continuation, storing all state needed to +continue traversal at the type members after the type has been registered +with the cache. (This implementation approach might be a tad over- +engineered and may change in the future) + + +## Source Locations and Line Information + +In addition to data type descriptions the debugging information must also +allow to map machine code locations back to source code locations in order +to be useful. This functionality is also handled in this module. The +following functions allow to control source mappings: + ++ `set_source_location()` ++ `clear_source_location()` ++ `start_emitting_source_locations()` + +`set_source_location()` allows to set the current source location. All IR +instructions created after a call to this function will be linked to the +given source location, until another location is specified with +`set_source_location()` or the source location is cleared with +`clear_source_location()`. In the later case, subsequent IR instruction +will not be linked to any source location. As you can see, this is a +stateful API (mimicking the one in LLVM), so be careful with source +locations set by previous calls. It's probably best to not rely on any +specific state being present at a given point in code. + +One topic that deserves some extra attention is *function prologues*. At +the beginning of a function's machine code there are typically a few +instructions for loading argument values into allocas and checking if +there's enough stack space for the function to execute. This *prologue* is +not visible in the source code and LLVM puts a special PROLOGUE END marker +into the line table at the first non-prologue instruction of the function. +In order to find out where the prologue ends, LLVM looks for the first +instruction in the function body that is linked to a source location. So, +when generating prologue instructions we have to make sure that we don't +emit source location information until the 'real' function body begins. For +this reason, source location emission is disabled by default for any new +function being codegened and is only activated after a call to the third +function from the list above, `start_emitting_source_locations()`. This +function should be called right before regularly starting to codegen the +top-level block of the given function. + +There is one exception to the above rule: `llvm.dbg.declare` instruction +must be linked to the source location of the variable being declared. For +function parameters these `llvm.dbg.declare` instructions typically occur +in the middle of the prologue, however, they are ignored by LLVM's prologue +detection. The `create_argument_metadata()` and related functions take care +of linking the `llvm.dbg.declare` instructions to the correct source +locations even while source location emission is still disabled, so there +is no need to do anything special with source location handling here. + +## Unique Type Identification + +In order for link-time optimization to work properly, LLVM needs a unique +type identifier that tells it across compilation units which types are the +same as others. This type identifier is created by +`TypeMap::get_unique_type_id_of_type()` using the following algorithm: + +1. Primitive types have their name as ID + +2. Structs, enums and traits have a multipart identifier + + 1. The first part is the SVH (strict version hash) of the crate they + were originally defined in + + 2. The second part is the ast::NodeId of the definition in their + original crate + + 3. The final part is a concatenation of the type IDs of their concrete + type arguments if they are generic types. + +3. Tuple-, pointer-, and function types are structurally identified, which + means that they are equivalent if their component types are equivalent + (i.e., `(i32, i32)` is the same regardless in which crate it is used). + +This algorithm also provides a stable ID for types that are defined in one +crate but instantiated from metadata within another crate. We just have to +take care to always map crate and `NodeId`s back to the original crate +context. + +As a side-effect these unique type IDs also help to solve a problem arising +from lifetime parameters. Since lifetime parameters are completely omitted +in debuginfo, more than one `Ty` instance may map to the same debuginfo +type metadata, that is, some struct `Struct<'a>` may have N instantiations +with different concrete substitutions for `'a`, and thus there will be N +`Ty` instances for the type `Struct<'a>` even though it is not generic +otherwise. Unfortunately this means that we cannot use `ty::type_id()` as +cheap identifier for type metadata -- we have done this in the past, but it +led to unnecessary metadata duplication in the best case and LLVM +assertions in the worst. However, the unique type ID as described above +*can* be used as identifier. Since it is comparatively expensive to +construct, though, `ty::type_id()` is still used additionally as an +optimization for cases where the exact same type has been seen before +(which is most of the time). |
