Page 1 of 1

Toolchain Structure

Posted: Thu Aug 14, 2014 2:46 pm
by SoLDMG
Hi everyone, does this toolchain structure tree make any sense (i.e. is it practical), and does anyone have any better ideas/improvements?

Tree:

1.1: C compiler front end + optimizer (could be C++ or FreeBASIC, just an example)
1.2: Assembler front end + optimizer (platform specific)
3: Intermediate language translator (translates to backend language)
4: Backend (platform specific)

[1.1] [1.2]
| |
| <generates>. | <generates>
| <intermediary> | <backend>
| <language> | <language>
| |
[2] [3]
|
| <generates>
| <backend>
| <language>
|
[3]

Thanks in advance.

Re: Toolchain Structure

Posted: Thu Aug 14, 2014 2:50 pm
by SoLDMG
Okay, I found out that the forum software doesn't take kindly to spaces. I'll post a picture ASAP.

Re: Toolchain Structure

Posted: Fri Aug 15, 2014 12:17 am
by Combuster
It helps to use tags. And the preview button.

Re: Toolchain Structure

Posted: Fri Aug 15, 2014 12:46 am
by SoLDMG
Alright, thanks to Combuster I managed to fix the spacing problem:

Code: Select all

1.1: C compiler front end + optimizer (could be C++ or FreeBASIC, just an example)
1.2: Assembler front end + optimizer (platform specific)
2: Intermediate language translator (translates to specific backend language)
3: Backend (platform specific)

    [1.1]                      [1.2]
        |                        |
        | <generates>            | <generates>
        | <intermediary>         | <backend>
        |  <language>            | <language>
        |                        |
       [2]                      [3]
        |
        | <generates>
        | <backend>
        | <language>
        |
       [3]

Re: Toolchain Structure

Posted: Fri Aug 15, 2014 12:52 am
by SoLDMG
Combuster wrote:It helps to use tags. And the preview button.
Thank you so much, man. You really helped me out here.

Re: Toolchain Structure

Posted: Fri Aug 15, 2014 1:21 am
by iansjack
I'm unclear as to what exactly you are asking here. You seem to have provided a very general outline of a typical compiler.

Re: Toolchain Structure

Posted: Fri Aug 15, 2014 2:21 am
by SoLDMG
iansjack wrote:I'm unclear as to what exactly you are asking here. You seem to have provided a very general outline of a typical compiler.
In my original post I asked if it is a practical implementation i.e. is it usable and implementable.

Re: Toolchain Structure

Posted: Fri Aug 15, 2014 2:23 am
by Brendan
Hi,
SoLDMG wrote:Hi everyone, does this toolchain structure tree make any sense (i.e. is it practical), and does anyone have any better ideas/improvements?
Some random comments (that may or may not help):

a) The first steps of most compilers are tokenising (converting text into tokens), parsing (building an abstract syntax tree) and sanity checking. If you have 2 compilers (e.g. one for "C to IL" and another for "IL to native") then both compilers end up doing parsing and sanity checking.

b) Because the first compiler can't do any machine-specific optimisations; most (all?) of the optimisations done by the first compiler can be done on the AST itself.

c) If you use the AST as the basis for an intermediate representation, then the parsing done by the second compiler can be extremely minimal.

d) When converting from one language to another, there will be "expressivity mismatches" where the first language can contain things that are hard to express in the second language (or the reverse). When you chain 2 or more compilers together the "expressivity mismatches" are multiplied. For a simple example; the first compiler might see a "foreach item in array { }" loop that is an extremely good candidate for conversion to SIMD; the intermediate "byte-code" language may not have any support for "variable width SIMD" forcing the first compiler to convert the "foreach" into a sequence ending with "if(cond)" and "jump startLoop" instructions; and now the back-end compiler (which may be converting to a language like 80x86 with SSE/AVX) has to try to detect that the loop (that no longer looks like a loop) could/should be converted to SIMD after the first compiler has made it much more difficult to do so.

e) The "expressivity mismatches" can severely effect the second compiler's ability to generate sane error messages and/or information needed for run-time debugging. You either need to ensure that the first compiler detects/reports all possible errors (including run-time errors - zero chance of that when the source code is in C); or you need to provide a way for the second compiler and/or run-time environment to map pieces of generated code (in the intermediate language and in the final/native executable) back to the "column X in line Y of file Z" location in the original source code, where this mapping has to survive all transformations/optimisations done by any compiler. Note: Some compilers use a special "debugging output" mode where all optimisations are disabled (to make it easier to create the mapping between final/native code and its original location). This is an extremely poor alternative with significant problems (e.g. bugs that disappear as soon as you try to debug them) and (in my opinion) should be avoided if at all possible.

f) For modern programming environments (IDEs), you want the editor to highlight errors in the source code (so that the programmer doesn't need to compile before finding them). For this purpose, I'd want to design the toolchain so that the first compiler's parsing and sanity checking can be used as a static analyser by an IDE. Basically; you'd have a "check but don't optimise or generate any output" mode in the first compiler; and the IDE would use this to find errors in the source code (and highlight them in the editor) while the programmer is typing. Now; old compilers tend to stop as soon as they find the first error (it makes writing the compiler easier); but when the compiler is being used for error detection (and not compilation) by an IDE you want the editor to highlight as many errors in the source code as possible, and you definitely do not want the compiler to stop as soon as it sees the first error.

g) In general; to avoid "stop on first error"; when the compiler detects an error it needs to begin searching for something recognisable. For example, if you write "for( 1234 ) { x = y; }" the compiler will detect that "for( 1234 )" is an error, and can then skip forward until it sees the "{" (the next recognisable thing) and can continue checking the "{ x = y; }" for correct syntax, etc. Due to "terseness" and the ability to (e.g.) define functions and types inside functions; this "searching for something recognisable" is extremely difficult for C. For example, if a function looks wrong, you can't just search forward until you see the start of the next function definition because the next function definition can be nested within the "previously reported as erroneous" function. The end result of this is that (for example), if/when the programmer accidentally forget a '}' or semicolon C compilers tend to generate a large number of "nonsense errors" until it reaches the end of the file. This means that (for the purpose of integrating it into an "integrated development environment") if a C compiler doesn't stop on first error, often only the first error is useful, and often none of the errors are useful.

h) The first compiler will have to support inline assembly for every CPU that the second compiler supports. This may include the ability to check inline assembly for as many kinds of errors as possible; and means that whatever you use as intermediate representation will have to handle (possibly many dialects of) assembly language (and be able to map native instructions back to the "column X in line Y of file Z" location in the original inline assembly at run-time).

i) A toolchain consists of more than just compilers - how will you be handling libraries and linking (and "whole program optimisation")?


Cheers,

Brendan

Re: Toolchain Structure

Posted: Fri Aug 15, 2014 2:15 pm
by SoLDMG
Brendan wrote: a) The first steps of most compilers are tokenising (converting text into tokens), parsing (building an abstract syntax tree) and sanity checking. If you have 2 compilers (e.g. one for "C to IL" and another for "IL to native") then both compilers end up doing parsing and sanity checking.
I've already thought out some of the IL, and I'm thinking of going with a BASIC-like language so that unneeded parsing/tokenizing overhead is kept to a minimum + the IL is the AST.
Brendan wrote: b) Because the first compiler can't do any machine-specific optimisations; most (all?) of the optimisations done by the first compiler can be done on the AST itself.
Indeed. I didn't even think of that, but indeed. I am going to supply an optimization argument for the IL translator just in case though.
Brendan wrote: c) If you use the AST as the basis for an intermediate representation, then the parsing done by the second compiler can be extremely minimal.
Yeah, that's why I'm going with a BASIC-like language so that parsing already is fast and easy (like I said in my response to point a).
Brendan wrote: d) When converting from one language to another, there will be "expressivity mismatches" where the first language can contain things that are hard to express in the second language (or the reverse). When you chain 2 or more compilers together the "expressivity mismatches" are multiplied. For a simple example; the first compiler might see a "foreach item in array { }" loop that is an extremely good candidate for conversion to SIMD; the intermediate "byte-code" language may not have any support for "variable width SIMD" forcing the first compiler to convert the "foreach" into a sequence ending with "if(cond)" and "jump startLoop" instructions; and now the back-end compiler (which may be converting to a language like 80x86 with SSE/AVX) has to try to detect that the loop (that no longer looks like a loop) could/should be converted to SIMD after the first compiler has made it much more difficult to do so.
Wow. I guess extensions added before runtime to the IL in the compiler could fix that problem which is then also passed through to the IL translator so it can still generate code.
Brendan wrote: e) The "expressivity mismatches" can severely effect the second compiler's ability to generate sane error messages and/or information needed for run-time debugging. You either need to ensure that the first compiler detects/reports all possible errors (including run-time errors - zero chance of that when the source code is in C); or you need to provide a way for the second compiler and/or run-time environment to map pieces of generated code (in the intermediate language and in the final/native executable) back to the "column X in line Y of file Z" location in the original source code, where this mapping has to survive all transformations/optimisations done by any compiler. Note: Some compilers use a special "debugging output" mode where all optimisations are disabled (to make it easier to create the mapping between final/native code and its original location). This is an extremely poor alternative with significant problems (e.g. bugs that disappear as soon as you try to debug them) and (in my opinion) should be avoided if at all possible.
You mean with runtime errors and C the annoying "compiles-but-doesn't-run-let's-dump" phenomenon? I hate that. And I hope I addressed this in point d?
Brendan wrote: f) For modern programming environments (IDEs), you want the editor to highlight errors in the source code (so that the programmer doesn't need to compile before finding them). For this purpose, I'd want to design the toolchain so that the first compiler's parsing and sanity checking can be used as a static analyser by an IDE. Basically; you'd have a "check but don't optimise or generate any output" mode in the first compiler; and the IDE would use this to find errors in the source code (and highlight them in the editor) while the programmer is typing. Now; old compilers tend to stop as soon as they find the first error (it makes writing the compiler easier); but when the compiler is being used for error detection (and not compilation) by an IDE you want the editor to highlight as many errors in the source code as possible, and you definitely do not want the compiler to stop as soon as it sees the first error.
I believe that separate programs that can serve multiple goals is better than one large chunk that does one thing. So yes, I will divide the usual compiler structure (lexers and backends etc.) into separate programs.
Brendan wrote: g) In general; to avoid "stop on first error"; when the compiler detects an error it needs to begin searching for something recognisable. For example, if you write "for( 1234 ) { x = y; }" the compiler will detect that "for( 1234 )" is an error, and can then skip forward until it sees the "{" (the next recognisable thing) and can continue checking the "{ x = y; }" for correct syntax, etc. Due to "terseness" and the ability to (e.g.) define functions and types inside functions; this "searching for something recognisable" is extremely difficult for C. For example, if a function looks wrong, you can't just search forward until you see the start of the next function definition because the next function definition can be nested within the "previously reported as erroneous" function. The end result of this is that (for example), if/when the programmer accidentally forget a '}' or semicolon C compilers tend to generate a large number of "nonsense errors" until it reaches the end of the file. This means that (for the purpose of integrating it into an "integrated development environment") if a C compiler doesn't stop on first error, often only the first error is useful, and often none of the errors are useful.
I personally like it when compilers yell at me when they come across the first error so I can gradually spray the bugs, so that's what I'll have the ones that I write do.
Brendan wrote: h) The first compiler will have to support inline assembly for every CPU that the second compiler supports. This may include the ability to check inline assembly for as many kinds of errors as possible; and means that whatever you use as intermediate representation will have to handle (possibly many dialects of) assembly language (and be able to map native instructions back to the "column X in line Y of file Z" location in the original inline assembly at run-time).
I'll pass the inline assembly off to the correct assembly frontend, and then later concatenate the backend language files back together into one backend language file.
Brendan wrote: i) A toolchain consists of more than just compilers - how will you be handling libraries and linking (and "whole program optimisation")?
I don't know yet. I'm trying to scrape money together for a book on linkers. And I don't know. I'm reading books on compilers and the source code to TCC just to find out more about the internals although I had a pretty good idea of it before I started Maza.

Thanks for your feedback.

Re: Toolchain Structure

Posted: Sat Aug 16, 2014 1:30 am
by Brendan
Hi,
SoLDMG wrote:
Brendan wrote: a) The first steps of most compilers are tokenising (converting text into tokens), parsing (building an abstract syntax tree) and sanity checking. If you have 2 compilers (e.g. one for "C to IL" and another for "IL to native") then both compilers end up doing parsing and sanity checking.
I've already thought out some of the IL, and I'm thinking of going with a BASIC-like language so that unneeded parsing/tokenizing overhead is kept to a minimum + the IL is the AST.
A BASIC-like language in text form that has to be tokenised and parsed; or binary tokens representing a BASIC-like functionality that have to be parsed?
SoLDMG wrote:
Brendan wrote: c) If you use the AST as the basis for an intermediate representation, then the parsing done by the second compiler can be extremely minimal.
Yeah, that's why I'm going with a BASIC-like language so that parsing already is fast and easy (like I said in my response to point a).
Note: What I'm doing is using "serialised abstract syntax tree" as the output of the first compiler (and input to the second compiler). More specifically; the first compiler combines the node's data with a "last sibling" flag and a "first child" flag, and the second compiler uses those 2 flags to reconstruct the tree.
SoLDMG wrote:
Brendan wrote: d) When converting from one language to another, there will be "expressivity mismatches" where the first language can contain things that are hard to express in the second language (or the reverse). When you chain 2 or more compilers together the "expressivity mismatches" are multiplied. For a simple example; the first compiler might see a "foreach item in array { }" loop that is an extremely good candidate for conversion to SIMD; the intermediate "byte-code" language may not have any support for "variable width SIMD" forcing the first compiler to convert the "foreach" into a sequence ending with "if(cond)" and "jump startLoop" instructions; and now the back-end compiler (which may be converting to a language like 80x86 with SSE/AVX) has to try to detect that the loop (that no longer looks like a loop) could/should be converted to SIMD after the first compiler has made it much more difficult to do so.
Wow. I guess extensions added before runtime to the IL in the compiler could fix that problem which is then also passed through to the IL translator so it can still generate code.
Let's have a simple example. Imagine the original source code looks like this:

Code: Select all

    for(i = 0; i < 8192; i++) {
        myArrayOfBytes[i] = i;
    }
After all the optimising; the resulting native assembly might load the (packed bytes) constant 0x0706050403020100 into an SSE register and the constant 0x0F0E0D0C0B0A0908 into another SSE register; then have a loop (with 512 iterations) that stores both registers into memory and adds 0x0202020202020202 to each register.

Now; what does the IL created by the first compiler look like?
SoLDMG wrote:
Brendan wrote:e) The "expressivity mismatches" can severely effect the second compiler's ability to generate sane error messages and/or information needed for run-time debugging. You either need to ensure that the first compiler detects/reports all possible errors (including run-time errors - zero chance of that when the source code is in C); or you need to provide a way for the second compiler and/or run-time environment to map pieces of generated code (in the intermediate language and in the final/native executable) back to the "column X in line Y of file Z" location in the original source code, where this mapping has to survive all transformations/optimisations done by any compiler. Note: Some compilers use a special "debugging output" mode where all optimisations are disabled (to make it easier to create the mapping between final/native code and its original location). This is an extremely poor alternative with significant problems (e.g. bugs that disappear as soon as you try to debug them) and (in my opinion) should be avoided if at all possible.
You mean with runtime errors and C the annoying "compiles-but-doesn't-run-let's-dump" phenomenon? I hate that. And I hope I addressed this in point d?
I mean; if the second compiler has to report an error, then how does the second compiler know where the error came from (so that it can say "error on line 123 of file foo.c" instead of "error but I'm too stupid to tell you where"); given that the first compiler does optimisations (and each optimisation it does transforms/mutates the code into a slightly different form) and therefore the input to the second compiler may look nothing like the input to the first compiler.

I also mean; if the final executable crashes at run-time, then how does the run-time environment (e.g. possibly a debugger like GDB) know where the error came from (so that it can say "general protection fault on line 123 of file foo.c" instead of "general protection fault but I'm too stupid to tell you where"); given that both the first compiler and the second compiler have transformed/mutated the code into something that looks nothing like the input to the first compiler.

For example, most linkers have problems with this. If a linker notices an error (e.g. "symbol not found") it will probably generate a bad error message (like "symbol foo not found at offset 1234567 in object file bar.o") and won't generate a useful error message (like "symbol foo not found on line 123 of C source file bar.c").
SoLDMG wrote:
Brendan wrote:f) For modern programming environments (IDEs), you want the editor to highlight errors in the source code (so that the programmer doesn't need to compile before finding them). For this purpose, I'd want to design the toolchain so that the first compiler's parsing and sanity checking can be used as a static analyser by an IDE. Basically; you'd have a "check but don't optimise or generate any output" mode in the first compiler; and the IDE would use this to find errors in the source code (and highlight them in the editor) while the programmer is typing. Now; old compilers tend to stop as soon as they find the first error (it makes writing the compiler easier); but when the compiler is being used for error detection (and not compilation) by an IDE you want the editor to highlight as many errors in the source code as possible, and you definitely do not want the compiler to stop as soon as it sees the first error.
I believe that separate programs that can serve multiple goals is better than one large chunk that does one thing. So yes, I will divide the usual compiler structure (lexers and backends etc.) into separate programs.
Separate programs? Every tool in your tool-chain converts "input data" into "output error messages" (and "output data" if you're lucky). Having more separate tools just spreads the same problem (getting the error messages back to the IDE) further; while also increasing the number of times you have to encode output (in one tool) and decode input (in the next tool).
SoLDMG wrote:
Brendan wrote:g) In general; to avoid "stop on first error"; when the compiler detects an error it needs to begin searching for something recognisable. For example, if you write "for( 1234 ) { x = y; }" the compiler will detect that "for( 1234 )" is an error, and can then skip forward until it sees the "{" (the next recognisable thing) and can continue checking the "{ x = y; }" for correct syntax, etc. Due to "terseness" and the ability to (e.g.) define functions and types inside functions; this "searching for something recognisable" is extremely difficult for C. For example, if a function looks wrong, you can't just search forward until you see the start of the next function definition because the next function definition can be nested within the "previously reported as erroneous" function. The end result of this is that (for example), if/when the programmer accidentally forget a '}' or semicolon C compilers tend to generate a large number of "nonsense errors" until it reaches the end of the file. This means that (for the purpose of integrating it into an "integrated development environment") if a C compiler doesn't stop on first error, often only the first error is useful, and often none of the errors are useful.
I personally like it when compilers yell at me when they come across the first error so I can gradually spray the bugs, so that's what I'll have the ones that I write do.
I personally like IDEs that highlight all the errors before I compile, so that I don't need to waste my time doing "compile, then fix one error and retry". In a similar way, I like word-processors that highlight spelling errors while I type (where I don't need to manually run a "spelling error checker" and fix my spelling mistakes in a slow and painful one-by-one way). ;)
SoLDMG wrote:
Brendan wrote:h) The first compiler will have to support inline assembly for every CPU that the second compiler supports. This may include the ability to check inline assembly for as many kinds of errors as possible; and means that whatever you use as intermediate representation will have to handle (possibly many dialects of) assembly language (and be able to map native instructions back to the "column X in line Y of file Z" location in the original inline assembly at run-time).
I'll pass the inline assembly off to the correct assembly frontend, and then later concatenate the backend language files back together into one backend language file.
Erm, OK. Whatever you use as intermediate representation (e.g. the file format for that "all backend language files concatenated together" file) will have to handle (possibly many dialects of) assembly language (and be able to map native instructions back to the "column X in line Y of file Z" location in the original inline assembly at run-time).


Cheers,

Brendan

Re: Toolchain Structure

Posted: Sat Aug 16, 2014 3:52 pm
by SoLDMG
I'm gonna rethink all of this. The brother of a friend of mine has the Dragon Book and he's willing to sell it to me for a few euros, so I'll postpone the linker book. Thanks for asking all these questions, I clearly didn't have enough thinking/theory done to actually program something that people would use.