Updated DOCUMENTATION.md

2025-05-03 10:03:26 +03:00 · 2025-05-03 10:03:26 +03:00 · 0d808de34c
commit 0d808de34c
parent 56c10daaa7
1 changed files with 81 additions and 20 deletions
--- a/DOCUMENTATION.md
+++ b/DOCUMENTATION.md
@ -2,7 +2,7 @@
 When writing a program, I usually make the most primitive and smallest code I can that does the job. If it turns out I miscalculated the complexity, or I must add some feature that isn't compatible with the codebase, I'll obviously have to refactor it. Still, I've been programming this way for probably my entire life.
-That being said, if you know this compiler took since 2019 to get to its current state, you will correctly guess that I DO NOT KNOW WHAT I AM DOING. Compiler literature, and online discussion, is abstract to the point where it is not useful for real-world processors. Imagine how long it took me to realize real-world IRs are actually not at all generic, and are actually quite close to their target architectures. As a result, much of what you see in the source is the result of a lot of experimentation. There's definitely better ways to do the things I show here, but I figured it's better to have at least some resource on how a "real" compiler works.
+That being said, if you know this compiler took since 2019 to get to its current state, you will correctly guess that I DO NOT KNOW WHAT I AM DOING. Compiler literature and online discussion is abstract to the point where it is not useful for real-world archs. When it gets specific, it's often too simplistic. It's common to say instruction selection should happen before register allocation, but how can you know which instructions to emit when some of them only work with specific registers? Imagine how long it took me to realize real-world IRs are not at all generic, and are actually quite close to their target architectures. As a result, much of what you see in the source is the result of a lot of experimentation. There's definitely better ways to do the things I show here, but I figured it's better to have at least some resource on how a "real" compiler works.
 The core idea behind the compiler is to progressively iterate through the AST, turning it into a more primitive form step by step. Once this primitivization ends, the code generator is given the code in a form it will understand. Doing it this way is necessary because machine code itself is primitive, and instructions typically have 0-3 operands. Thanks to both this, and Nectar itself being highly low-level, the need for an IR disappears. On the other hand, making sure the AST is in a correct state between steps is the prime source of bugs.
@ -22,12 +22,14 @@ The top-level chunk keeps a list of variables within its `ASTChunk` structure. A
 There's enough types of passes to push us to have a generic way to invoke the visitor pattern on the AST. Because passes may do many different things to the AST, including modify it, the definition of a generic visitor is very broad. Most functionality is unused by each pass, but all of it is needed.
-    void generic_visitor(AST **nptr, AST *stmt, AST *stmtPrev, AST *chu, AST *tlc, void *ud, void(*handler)(AST**, AST*, AST*, AST*, AST*, void*));
+    void generic_visitor(AST **nptr, AST *stmt, AST *stmtPrev, AST *chu, AST *tlc, void *ud, GenericVisitorHandler preHandler, GenericVisitorHandler postHandler);
 `*nptr` is the actual node that is currently being visited. It is behind an additional indirection, because the node may be replaced by another.
 If the current node is within a statement (most are), `stmt` is equal to that statement. `stmtPrev` is the previous statement. This is necessary for patching in the linked list of statements within a chunk during modification passes. If there is no previous statement, then the head pointer of the singly-linked list must be patched through the `chu` node. The `tlc` is the top-level chunk, which may be equal to `chu`.
 A handler may be called before or after delving deeper into the tree (hence the pre and post handlers). Most passes use the prehandler, but type checking will be better with a posthandler, since we want type checks to happen bottom to top.
 ## Pre-dumbification
 Before dumbification we need to make sure the code at least matches the semantics of the x86 architecture.
@ -91,11 +93,11 @@ NOTE: Later someone called this normalization, which is a much less stupid word
 I hate these things. Another is def-use chains, but both are horribly underdocumented. Their only use in most literature is so the author can immediately move to SSA form.
-For each variable, its UD chain is a list of each usage in the AST, with the corresponding potential definition of the variable at that use. For each potential definition that exists at that point, there is one UD element in the chain. If there's only one potential definition at a point, then it's definitely the true one. Users of UD chains include optimizers and codegen. The UD chains are continually regenerated when needed by using the UD visitor on the top-level chunk.
+For each variable, its UD chain is a list of each usage in the AST, with the corresponding potential definition of the variable at that use. For each potential definition that exists at that point, there is one UD element in the chain. If there's only one potential definition at a point, then it's definitely the true one. Users of UD chains include optimizers and codegen. UD chains are always regenerated for use between passes by using the UD visitor on the top-level chunk.
 As simplest, the code `u8 x = 0;` has an empty UD-chain, because there are no uses. It's definition could even be classified as dead code.
-Clearly, a definition of a variable overrides every definition before it, but that is only within the same basic block. In the following code, a variable has a single potential definition in each branch of the if statement, but afterward it will have two:
+Clearly, a definition of a variable overrides every definition before it, but that is only within a basic block. In the following code, a variable has a single potential definition in each branch of the if statement, but afterward it will have two:
    u8 x = 0;   /* Potential definitions: [x = 0]
                 * UD-chain of x:
@ -156,15 +158,17 @@ This problem is a large area of study in itself, but a common approach is to ima
 The actual coloring algorithm used is Welsh-Powell, which sorts the VTEs/vertices by degree before attempting greedy coloring.
-If there's more colors than there are physical registers, then we have a conflict, and must spill. There are two ways to do so: spill2var and spill2stack. The former is necessary on boundaries where suddenly a specific register/color must be used (e.g. returning in `eax`). The latter transforms every use of a local variable (`ASTExprVar` where its VTE is of type `VARTABLEENTRY_VAR`) into the form `@stack + n`.
+If there's more colors than there are physical registers, then we have a conflict, and must spill. There are two ways to do so: ~~spill2var~~ and spill2stack. ~~The former is necessary on boundaries where suddenly a specific register/color must be used (e.g. returning in `eax`).~~ The latter transforms every use of a local variable (`ASTExprVar` where its VTE is of type `VARTABLEENTRY_VAR`) into the form `@stack + n`.
 If spill2stack is used, then CG must fail once so that dumbification can be applied again.
 ## Pre-coloring
-I skipped forward a bit. Coloring assumes that all registers have equal importance, which is never true. A return value must be in `eax`, the remainder of division must be in `edx`, etc. In 64-bit, the index of an argument determines in which register it may end up.
+NOTE: spill2var turned out to be pushing the problem a step back rather than solving it. Because it is known in advance what must be pre-colored, any such expressions are immediately placed in their own variable by another pass (dumbification?). If the assignment turns out to have been redundant, the register allocator should coaslesce the moves.
-The pre-coloring visitor applies said rules to the AST, setting the colors in the VTE. It is completely plausible that a conflict can occur here, too, from two variables having overlapping live ranges and the same color, but it can also be from demanding more than one color from the same variable. In the latter case, the pre-coloring visitor gives up as soon as its detected. In both cases we do spill2var, not spill2stack, because spilling to the stack doesn't solve the pre-coloring problem.
+~~I skipped forward a bit. Coloring assumes that all registers have equal importance, which is never true. A return value must be in `eax`, the remainder of division must be in `edx`, etc. In 64-bit, the index of an argument determines in which register it may end up.~~
 ~~The pre-coloring visitor applies said rules to the AST, setting the colors in the VTE. It is completely plausible that a conflict can occur here, too, from two variables having overlapping live ranges and the same color, but it can also be from demanding more than one color from the same variable. In the latter case, the pre-coloring visitor gives up as soon as its detected. In both cases we do spill2var, not spill2stack, because spilling to the stack doesn't solve the pre-coloring problem.~~
 ## Callee-saved pass
@ -209,19 +213,52 @@ Using the same Fibonacci example as above, this is the result.
    mov eax, ecx
    ret
-## Other problems with this approach
+## Generics
-Short-circuit evaluation is when the evaluation of an expression is guaranteed to stop once the output is already known. For example, if in `A || B` `A` is already truthy, then `B` is not evaluated. This is not just an optimization, but an important semantical feature, as evaluation of the operands may have side-effects.
+**NOTE: I intend to place this section in a different Markdown file entirely. It will be simply too big.**
    record Foo[T, U, V] {
        T t;
        U u;
        V v;
    }
 Nectar does generics similarly to C++. Structures are simple to make generic. When parsing a generic structure definition we must introduce a new scope, so we can introduce the generic types as instances of `TypeGeneric`. If we encounter a parametrization like `Foo[u8, u16, u32]`, we walk up the tree formed by the type of `Foo`, and replace all `TypeGeneric` instances with the concrete types. This is done by `type_parametrize` which takes a `Parametrization` structure. Note that generic type names are not used, but the indices at which they appear.
    bar: [T]T(T a, T b) -> {
        return a + b;
    };
 If a function is defined with a generic type, parsing it is skipped until an explicit instantiation. This is because type checking is coupled with parsing. It needn't be this way, but it's a refactoring I'm not interested in doing at the moment. This ended up bringing other complexities. Because of the parser-type checker coupling, we must know what a generic type's name originally was, so `TypeGeneric`s must store this in addition to the index.
    @instantiate bar[u32];
 Upon parsing the above statement, the parser:
 1. Creates a new scope
 2. Finds the generic type names (using an output value of `type_parametrize` not mentioned until now)
 3. Inserts the concrete types into the scope under the generic type names
 4. Jumps to the generic function definition (in fact, to *right after the `[...]` block* to ignore the genericness)
 5. Begins parsing the function's code block
 6. Pops the scope
 7. Jumps back to the end of the `@instantiate` statement
 8. Insert the function code block into a new symbol, appending the concrete type names to the original function name separated by underscores (`bar_u32`)
 How's that for a hack?
 ## Other problems with this approach (1)
 Short-circuit evaluation is when the evaluation of an expression is guaranteed to stop once the output is already known. For example, if in `A || B` `A` is already truthy, then `B` is not evaluated. This is not just an optimization, but an important semantical detail, as evaluation of the operands may have side-effects.
 Let us write `if(x == 1 || y == 1) { do stuff; }` in x86:
-    cmp eax, 1
+        cmp eax, 1
-    je .L1
+        je .L1
-    cmp ebx, 1
+        cmp ebx, 1
-    jne .L2
+        jne .L2
-.L1:
+    .L1:
-    ; do stuff
+        ; do stuff
-.L2:
+    .L2:
 Note that the two jump instructions are basically goto statements. As the Nectar IR is defined without gotos, it is practically impossible for the compiler to output the neat code shown above. You could insert special logic for this case, but in general it'll fail.
@ -234,15 +271,39 @@ Even worse, the dumbification pass will try to move the condition into a variabl
        do stuff;
    }
-And now suddenly we need 2 new registers for no reason..
+And now we need 2 new registers for no reason..
-In conclusion, an optimized IR should not use self-contained blocks, but should actually be flat, like the real thing, and have goto statements. Fixing this in nctref will require extreme refactoring, as the notion of blocks forming a tree is ingrained. Also, statements within a block form a singly-linked list. Even if we there were a simple `ASTStmtGoto` node, it cannot be a simple pointer to a statement, because passes need to modify the AST. For the time being, I have given up on short-circuit evaluation, and I do not actually support neither `||` nor `&&`.
+Lack of gotos also makes function inlining impossible (!!).
 In conclusion, what? Should a good IR actually be 100% flat and have nothing but jumps? Can this be solved by modelling the code as a graph of basic blocks? I don't know, but for now I have given up on short-circuit evaluation, and I do not actually support neither `||` nor `&&`.
 ## Other problems with this approach (2)
 The `denoop_visitor` pass is incredibly important in normalizing the AST to something other passes will accept. Here's one case I found when trying to implement a statically allocated list class:
    T* data = &((*this).data[0]);
 It seems innocent enough, but it actually becomes:
    T* data = &*(&*((&*this + 4) as T[4]*) + 0);
 As of writing, `denoop_visitor` had produced this:
    T* data = (this + 4) as T*;
 The code generator failed to accept this, because the `as T*` cast meant that it could not match any pattern. The dumbifier also failed to decompose this to `data = this; data = data + 4;` for the same reason.
 What was my solution? IGNORE ALL POINTER CASTS! As I wrote above, the Nectar AST does not support pointer arithmetic like that of C. By this point, all complex types should have already been converted into integers. Therefore, it does not even matter.
 By adding the rule (`x as A*` -> `x` *only* if x's type is a pointer), we obtain the following after denooping:
    T* data = this + 4;
 ## Adding a Feature
 When adding a feature, first write it out in Nectar in the ideal dumbified form. Make sure this compiles correctly. Afterward, implement dumbification rules so that code can be written in any fashion. If specific colorings are required, then the pre-coloring and spill2var passes must be updated. The following is an example with multiplication, as this is what I'm adding as of writing.
-Note the way `mul` works on x86. Firstly, one of the operands is the destination, because `mul` is a 2-op instruction. Secondly, the other operand cannot be an immediate, because it is defined as r/m (register or memory), so if the second operand is a constant, it must be split into a variable (`varify` in `dumberdowner.c`). Thirdly, the destination must be the A register, so one of the operands must be pre-colored to A. Fourthly, `mul` clobbers the D register with the high half of the product. In other words, we have an instruction with *two* output registers, which the Nectar AST does not support. But we can't have the register allocator assign anything to D here.
+Note the way `mul` works on x86 (yes, I'm aware `imul` exists). Firstly, one of the operands is the destination, because `mul` is a 2-op instruction. Secondly, the other operand cannot be an immediate, because it is defined as r/m (register or memory), so if the second operand is a constant, it must be split into a variable (`varify` in `dumberdowner.c`). Thirdly, the destination must be the A register, so one of the operands must be pre-colored to A. Fourthly, `mul` clobbers the D register with the high half of the product. In other words, we have an instruction with *two* output registers, which the Nectar AST does not support. But we can't have the register allocator assign anything to D here.
 To account for this, we can have a second assignment statement right next to the multiplication. Because the main multiplication clobbers the source operand, the mulhi assignment must come before the mul. Putting all this together, this is the canonical way to do `z = x * y` with an x86 target:
@ -250,7 +311,7 @@ To account for this, we can have a second assignment statement right next to the
    w = z *^ y;
    z = z * y;
-But this is without pre-coloring. We want precolored nodes to live as little as possible, because separately solving pre-coloring collisions whilst also keeping the code dumbified *and* not horrible is pretty much impossible. I've tried.
+But this is without pre-coloring. We want precolored nodes to live as little as possible, because separately solving pre-coloring collisions whilst also keeping the code dumbified *and* not horrible turned out to be practically impossible (spill2var).
    k = x;
    w = k *^ y;