diff --git a/DOCUMENTATION.md b/DOCUMENTATION.md index 8c8cf63..6a5454b 100644 --- a/DOCUMENTATION.md +++ b/DOCUMENTATION.md @@ -2,7 +2,7 @@ When writing a program, I usually make the most primitive and smallest code I can that does the job. If it turns out I miscalculated the complexity, or I must add some feature that isn't compatible with the codebase, I'll obviously have to refactor it. Still, I've been programming this way for probably my entire life. -That being said, if you know this compiler took since 2019 to get to its current state, you will correctly guess that I DO NOT KNOW WHAT I AM DOING. Compiler literature and online discussion is abstract to the point where it is not useful for real-world archs. When it gets specific, it's often too simplistic. It's common to say instruction selection should happen before register allocation, but how can you know which instructions to emit when some of them only work with specific registers? Imagine how long it took me to realize real-world IRs are not at all generic, and are actually quite close to their target architectures. As a result, much of what you see in the source is the result of a lot of experimentation. There's definitely better ways to do the things I show here, but I figured it's better to have at least some resource on how a "real" compiler works. +That being said, if you know this compiler took since 2019 to get to its current state, you will correctly guess that I DO NOT KNOW WHAT I AM DOING. Compiler literature and online discussion is abstract to the point where it is not useful for real-world archs. When it gets specific, it's often too simplistic. It's common to say instruction selection should happen before register allocation, but how can you know which instructions to emit when some of them only work with specific registers? People say spilling causes more loads and stores, but that's false; x86 has memory operands! Imagine how long it took me to realize real-world IRs are not at all generic, and are actually quite close to their target architectures. As a result, much of what you see in the source is me basically banging rocks. There's definitely better ways to do the things I show here, but I figured it's better to have at least some resource on how a "real" compiler works. The core idea behind the compiler is to progressively iterate through the AST, turning it into a more primitive form step by step. Once this primitivization ends, the code generator is given the code in a form it will understand. Doing it this way is necessary because machine code itself is primitive, and instructions typically have 0-3 operands. Thanks to both this, and Nectar itself being highly low-level, the need for an IR disappears. On the other hand, making sure the AST is in a correct state between steps is the prime source of bugs. @@ -16,9 +16,9 @@ An AST node may not be shared by multiple parent nodes. Also, the internal Necta Each block of code is called a "chunk", likely a term I took from Lua. Chunks may contain one another; the least deep one within a function is called the top-level chunk (very important). Top-level chunks may contain other top-level chunks, because user-defined functions are within the "global scope", which is considered a function in itself. After all, nothing stops you from directly inserting instructions in the `.text` section of an executable, without attaching it to a label. -During parsing, a tree of maps is used to handle scopes and variable declarations, called `VarTable`. Its entries are of type `VarTableEntry` (VTE), which may be of kind `VAR`, `SYMBOL` (global variables) or `TYPE` (type-system entries). Shadowing in vartables is allowed, like in Nectar itself. +During parsing, a tree of maps, `Scope`, is used to handle scopes for variables, types, symbols and constant expressions. Its entries are of type `ScopeItem` (often called VTE in the source for historical reasons). Shadowing of scope items is allowed, like in Nectar itself. -The top-level chunk keeps a list of variables within its `ASTChunk` structure. After a chunk is finished parsing, all local variables in the current `VarTable` are added to its top-level chunk's variable list. Names may conflict, but at this point they're no longer important. Also worth mentioning is that this flat list contains `VarTableEntry` structs, even though `VarTable`s are now irrelevant. Said VTEs are all of type `VAR`; the rest are ignored because they're not subject to coloring. +The top-level chunk keeps a list of variables within its `ASTChunk` structure. After a chunk is finished parsing, all local variables in the current `Scope` are added to its top-level chunk's variable list. Names may conflict, but at this point they're no longer important. Also worth mentioning is that this flat list contains `ScopeItem` structs, even though `Scope`s are now irrelevant. Said items are all of type `SCOPEITEM_VAR`; the rest are ignored because they're not subject to coloring. There's enough types of passes to push us to have a generic way to invoke the visitor pattern on the AST. Because passes may do many different things to the AST, including modify it, the definition of a generic visitor is very broad. Most functionality is unused by each pass, but all of it is needed. @@ -154,21 +154,38 @@ Now, why did I choose UD chains? Why, simplicity, obviously... At this point we have a very distorted kind of Nectar AST in our function. Sure we've got blocks and other familiar things, but all variables are in a flat list. These variables are essentially the "virtual registers" you hear a lot about. Because x86 only has six general-purpose registers, we must assign each of these variables (VTEs) to a physical machine register. -This problem is a large area of study in itself, but a common approach is to imagine it as a graph coloring problem, where vertices are VTEs, and edges connect conflicting VTEs that cannot have the same color. Said edges are determined using the UD-chains of both VTEs. +The x86 register set is highly irregular: + +1. `ax` may be split into two halves (`ah` and `al`), but `eax` may not be split. Likewise with `ebx`, `ecx` and `edx`, but nothing else. +2. The low bytes of `si`, `di`, `bp` and `sp` are accessible only in 64-bit mode. +3. Registers `r8` through `r15` are accessible only in 64-bit mode. +4. If an instruction uses any 64-bit-only register, then `ah`, `bh`, `ch`, `dh` are not accessible. +5. Of the 16-bit registers, only `bx`, `bp`, `di` and `si` may be dereferenced. Any 32-bit register can be dereferenced, including in 16-bit mode. +6. Obviously, floating-point registers, control registers, debug registers form separate spaces. + +It is insufficient for a compiler to assume these are in any way comparable. The trick is to imagine the CPU as a bitmap of resources. + +`al` and `ah` are separate registers on their own, so we assign 1 bit to each (for clarity, let it be ints 1 and 2). `ax` is also a separate register, but as it is the concatenation of `al` and `ah`, it is assigned the OR of both bits (int 3). You could reserve a bit for the high 16 bits of `eax`, but as it is inaccessible, you may as well give the same bits of `ax` to `eax`. Repeat the same for the rest of the registers. + +From these we form sets of registers called "register classes", which can be thought of as "ways in which a register can be used". The resource mask of a register class is the union (bitwise OR) of all bits used by all of its registers. + +This compiler currently considers 3 register classes: `REG_CLASS_8`, for `al`, `ah`, `bl`, `bh`, `cl`, `ch`, `dl`, `dh`; `REG_CLASS_NOT_8` for `ax`, `eax`, `bx`, `ebx`, `cx`, `ecx`, `dx`, `edx`, `di`, `edi`, `si`, `esi`; `REG_CLASS_IA16_PTRS` for `di`, `si`, `bx`. It can be seen registers are not unique under this abstraction, but this is necessary as this abstraction assumes the CPU to be a soup. + +(64-bit is not considered) + +Finally we begin coloring, a large area of study in itself, but a common approach is to imagine it as a graph coloring problem, where vertices are VTEs, and edges connect conflicting VTEs that cannot have the same color. "Color" in this case means the register class and index within the register class. Edges in the graph are determined using the UD-chains of both VTEs and their resource masks (registers with an empty resource mask intersection cannot interfere). The actual coloring algorithm used is Welsh-Powell, which sorts the VTEs/vertices by degree before attempting greedy coloring. -If there's more colors than there are physical registers, then we have a conflict, and must spill. There are two ways to do so: ~~spill2var~~ and spill2stack. ~~The former is necessary on boundaries where suddenly a specific register/color must be used (e.g. returning in `eax`).~~ The latter transforms every use of a local variable (`ASTExprVar` where its VTE is of type `VARTABLEENTRY_VAR`) into the form `@stack + n`. +If there's more colors than there are physical registers, then we have a conflict, and must spill. `spill2stack` transforms every use of a local variable (`ASTExprVar` where its VTE is of type `SCOPEITEM_VAR`) into the form `@stack + n`. If spill2stack is used, then CG must fail once so that dumbification can be applied again. ## Pre-coloring -NOTE: spill2var turned out to be pushing the problem a step back rather than solving it. Because it is known in advance what must be pre-colored, any such expressions are immediately placed in their own variable by another pass (dumbification?). If the assignment turns out to have been redundant, the register allocator should coaslesce the moves. +NOTE: `spill2var` turned out to be pushing the problem a step back rather than solving it. Because it is known in advance what must be pre-colored, any such expressions are immediately placed in their own variable by another pass. If the assignment turns out to have been redundant, the register allocator should coaslesce the moves. -~~I skipped forward a bit. Coloring assumes that all registers have equal importance, which is never true. A return value must be in `eax`, the remainder of division must be in `edx`, etc. In 64-bit, the index of an argument determines in which register it may end up.~~ - -~~The pre-coloring visitor applies said rules to the AST, setting the colors in the VTE. It is completely plausible that a conflict can occur here, too, from two variables having overlapping live ranges and the same color, but it can also be from demanding more than one color from the same variable. In the latter case, the pre-coloring visitor gives up as soon as its detected. In both cases we do spill2var, not spill2stack, because spilling to the stack doesn't solve the pre-coloring problem.~~ +TODO: preclassing. ## Callee-saved pass @@ -246,9 +263,57 @@ Upon parsing the above statement, the parser: How's that for a hack? +## x86 Segmentation + +In x86, absolutely all memory access by software goes through one of six "segments", which translates the address from the segment's space to true linear space. + +In 32-bit and above, segmentation has mostly been forgotten. Major OSes set all\* segments to the same setting, letting software pretend that segmentation doesn't exist. This is called a flat memory model. In 16-bit, pointers are enough only for 64kB of memory, so Intel added segmentation to expand the memory space to 20-bit, giving the 8086 one megabyte to use. + +This splits the pointer type into three subtypes: + +1. Near: the kind of pointer everyone knows, which uses some "default" segment +2. Far: a pair of integers, one which defines the segment, the other defines the memory offset +3. Huge: a far pointer that is slower but more well-behaved + +By default, pointers in Nectar are just integers, i.e. near, limiting a Nectar program to only 64kB. The parameter `mem` sets all unspecified pointers in a Nectar source file to huge pointers. + +Far and huge pointers have to be dumbified. Let us have the following: + + u8 @far* a; + + *a = 5; + + a = a + 1; + +A data access goes through the data segment `ds`, so prior to dereferencing, we must make sure `ds` is set to the segment part of `a`. + + FarPointer a; + + $sreg_ds = a.segment; + *a.offset = 5; + + a.offset = a.offset + 1; + +This example benefits from scalar replacement: + + u16 a_segment; + u16 a_offset; + + $sreg_ds = a_segment; + *a_offset = 5; + + a_offset = a_offset + 1; + +If `a` were `u8 @huge*`, we must account for overflow and the last statement would instead become: + + a.offset = a.offset + 1; + if(a.offset == 0) { + a.segment = a.segment + 4096; + } + ## Other problems with this approach (1) -Short-circuit evaluation is when the evaluation of an expression is guaranteed to stop once the output is already known. For example, if in `A || B` `A` is already truthy, then `B` is not evaluated. This is not just an optimization, but an important semantical detail, as evaluation of the operands may have side-effects. +Short-circuit evaluation is when the evaluation of an expression is guaranteed to stop once the output is already known. For example, if in `A || B` `A` is already truthy, then `B` is not evaluated. This is not an optimization, but an important semantical detail, as evaluation of the operands may have side-effects. Let us write `if(x == 1 || y == 1) { do stuff; }` in x86: @@ -273,7 +338,7 @@ Even worse, the dumbification pass will try to move the condition into a variabl And now we need 2 new registers for no reason.. -Lack of gotos also makes function inlining impossible (!!). +Lack of gotos also makes function inlining impossible, as returns also become gotos (!!). In conclusion, what? Should a good IR actually be 100% flat and have nothing but jumps? Can this be solved by modelling the code as a graph of basic blocks? I don't know, but for now I have given up on short-circuit evaluation, and I do not actually support neither `||` nor `&&`. @@ -324,7 +389,7 @@ Lastly, the codegen pass must recognize the above sequence as a multiplication a In `cg.c` is a function called `xop`, which returns an x86 operand string, given a trivially compilable Nectar expression. Because we've guaranteed the other operand may not be a constant, we do not need to check the XOP type, but it's a good idea to insert `assert`s and `abort`s everywhere to prevent hard-to-find bugs. -Once all that is done and tested, now we can add the following dumbification rules: all binary operations with the operand `AST_BINOP_MUL` or `AST_BINOP_MULHI` must be the whole expression within an assignment statement. If not, extract into a separate assignment & new variable with `varify`. The destination of the assignment, and both operands of the binary operation must be of type `AST_EXPR_VAR`, with their corresponding variables being of type `VARTABLEENTRY_VAR`, not `VARTABLEENTRY_SYMBOL` or `VARTABLEENTRY_TYPE`. If any of those don't apply, `varify` the offenders. Each such assignment have a neighboring, symmetric assignment, so that both A and D are caught by the pre-coloring pass. +Once all that is done and tested, now we can add the following dumbification rules: all binary operations with the operand `AST_BINOP_MUL` or `AST_BINOP_MULHI` must be the whole expression within an assignment statement. If not, extract into a separate assignment & new variable with `varify`. The destination of the assignment, and both operands of the binary operation must be of type `AST_EXPR_VAR`, with their corresponding variables being of type `SCOPEITEM_VAR`, not `SCOPEITEM_SYMBOL`, `SCOPEITEM_TYPE` nor `SCOPEITEM_CEXPR`. If any of those don't apply, `varify` the offenders. Each such assignment have a neighboring, symmetric assignment, so that both A and D are caught by the pre-coloring pass. A common bug when writing a dumbification rule is ending up with one that is always successful. If this happens, the compiler will become stuck endlessly dumbifying, which is nonsense. It would be nice if you could formally prove that won't happen. Another common bug is not realizing the order in which dumbification rules are applied matters :).