Brendan wrote:
a) The first steps of most compilers are tokenising (converting text into tokens), parsing (building an abstract syntax tree) and sanity checking. If you have 2 compilers (e.g. one for "C to IL" and another for "IL to native") then both compilers end up doing parsing and sanity checking.
I've already thought out some of the IL, and I'm thinking of going with a BASIC-like language so that unneeded parsing/tokenizing overhead is kept to a minimum + the IL is the AST.
Brendan wrote:
b) Because the first compiler can't do any machine-specific optimisations; most (all?) of the optimisations done by the first compiler can be done on the AST itself.
Indeed. I didn't even think of that, but indeed. I am going to supply an optimization argument for the IL translator just in case though.
Brendan wrote:
c) If you use the AST as the basis for an intermediate representation, then the parsing done by the second compiler can be extremely minimal.
Yeah, that's why I'm going with a BASIC-like language so that parsing already is fast and easy (like I said in my response to point a).
Brendan wrote:
d) When converting from one language to another, there will be "expressivity mismatches" where the first language can contain things that are hard to express in the second language (or the reverse). When you chain 2 or more compilers together the "expressivity mismatches" are multiplied. For a simple example; the first compiler might see a "foreach item in array { }" loop that is an extremely good candidate for conversion to SIMD; the intermediate "byte-code" language may not have any support for "variable width SIMD" forcing the first compiler to convert the "foreach" into a sequence ending with "if(cond)" and "jump startLoop" instructions; and now the back-end compiler (which may be converting to a language like 80x86 with SSE/AVX) has to try to detect that the loop (that no longer looks like a loop) could/should be converted to SIMD after the first compiler has made it much more difficult to do so.
Wow. I guess extensions added before runtime to the IL in the compiler could fix that problem which is then also passed through to the IL translator so it can still generate code.
Brendan wrote:
e) The "expressivity mismatches" can severely effect the second compiler's ability to generate sane error messages and/or information needed for run-time debugging. You either need to ensure that the first compiler detects/reports all possible errors (including run-time errors - zero chance of that when the source code is in C); or you need to provide a way for the second compiler and/or run-time environment to map pieces of generated code (in the intermediate language and in the final/native executable) back to the "column X in line Y of file Z" location in the original source code, where this mapping has to survive all transformations/optimisations done by any compiler. Note: Some compilers use a special "debugging output" mode where all optimisations are disabled (to make it easier to create the mapping between final/native code and its original location). This is an extremely poor alternative with significant problems (e.g. bugs that disappear as soon as you try to debug them) and (in my opinion) should be avoided if at all possible.
You mean with runtime errors and C the annoying "compiles-but-doesn't-run-let's-dump" phenomenon? I hate that. And I hope I addressed this in point d?
Brendan wrote:
f) For modern programming environments (IDEs), you want the editor to highlight errors in the source code (so that the programmer doesn't need to compile before finding them). For this purpose, I'd want to design the toolchain so that the first compiler's parsing and sanity checking can be used as a static analyser by an IDE. Basically; you'd have a "check but don't optimise or generate any output" mode in the first compiler; and the IDE would use this to find errors in the source code (and highlight them in the editor) while the programmer is typing. Now; old compilers tend to stop as soon as they find the first error (it makes writing the compiler easier); but when the compiler is being used for error detection (and not compilation) by an IDE you want the editor to highlight as many errors in the source code as possible, and you definitely do not want the compiler to stop as soon as it sees the first error.
I believe that separate programs that can serve multiple goals is better than one large chunk that does one thing. So yes, I will divide the usual compiler structure (lexers and backends etc.) into separate programs.
Brendan wrote:
g) In general; to avoid "stop on first error"; when the compiler detects an error it needs to begin searching for something recognisable. For example, if you write "for( 1234 ) { x = y; }" the compiler will detect that "for( 1234 )" is an error, and can then skip forward until it sees the "{" (the next recognisable thing) and can continue checking the "{ x = y; }" for correct syntax, etc. Due to "terseness" and the ability to (e.g.) define functions and types inside functions; this "searching for something recognisable" is extremely difficult for C. For example, if a function looks wrong, you can't just search forward until you see the start of the next function definition because the next function definition can be nested within the "previously reported as erroneous" function. The end result of this is that (for example), if/when the programmer accidentally forget a '}' or semicolon C compilers tend to generate a large number of "nonsense errors" until it reaches the end of the file. This means that (for the purpose of integrating it into an "integrated development environment") if a C compiler doesn't stop on first error, often only the first error is useful, and often none of the errors are useful.
I personally like it when compilers yell at me when they come across the first error so I can gradually spray the bugs, so that's what I'll have the ones that I write do.
Brendan wrote:
h) The first compiler will have to support inline assembly for every CPU that the second compiler supports. This may include the ability to check inline assembly for as many kinds of errors as possible; and means that whatever you use as intermediate representation will have to handle (possibly many dialects of) assembly language (and be able to map native instructions back to the "column X in line Y of file Z" location in the original inline assembly at run-time).
I'll pass the inline assembly off to the correct assembly frontend, and then later concatenate the backend language files back together into one backend language file.
Brendan wrote:
i) A toolchain consists of more than just compilers - how will you be handling libraries and linking (and "whole program optimisation")?
I don't know yet. I'm trying to scrape money together for a book on linkers. And I don't know. I'm reading books on compilers and the source code to TCC just to find out more about the internals although I had a pretty good idea of it before I started Maza.
Thanks for your feedback.