Public Domain C/C++ Compiler

~ · Post by ~ » Sun Nov 26, 2017 12:38 pm

Source code:
http://sourceforge.net/projects/c-compiler/files/

COMPILER-2017-11-08start.zip

Theory-only:
http://devel.archefire.org/forum/viewtopic.php?t=2359

Source-code-only:
http://devel.archefire.org/forum/viewtopic.php?t=2362

Repetitive code reading for functional mental redundancy of rare/heavy ideas:
http://devel.archefire.org/forum/viewtopic.php?t=2367

The plan of this project is writing a C/C++ compiler that can at least be compiled with MinGW or Open Watcom. We can only use the language that those compilers can understand in common.

The resulting compiler needs to be able to understand old and new versions of the language, needs to be able to understand idioms from all versions of Visual C++, Turbo C++, Borland C++, Open Watcom, and GCC mainly.

The goal is that we can easily port software with a compiler that is capable of compiling recent software under old platforms like DOS and Windows 98. That's why the compiler itself is written only in the simple C/C++ that the current Open Watcom understands, which should run under Windows 9x. GCC is used just as a control measure to make sure that the code will still compile under newer platforms with GCC-based compiles. No fancy code should be present in the compiler even if it becomes fully capable of understanding such code. This is to make it possible to always be able to compile it directly in old platforms and in exchange have a modern and capable compiler.

______________________________________________
______________________________________________
______________________________________________
______________________________________________

I'm starting this project. So far the code is only able to open a file and record the start/end offsets of each line, and also to count the number of lines present in the given source file.

Usage:

Code: Select all

C source.c

The code should also compile under Linux.

Currently what I need to do next is to parse the source code to detect comments, strings and end-of-instruction/end-of-parameter characters like ; and , ... This is to be able to support extremely long lines being able to load only up to the point where an instruction starts and ends (line, column). After that I will start by implementing the tasks that the preprocessor should perform.

If somebody wants to help me, just say so. The code is freely available so we will all learn. You can see what I've done and the explanations of what each part do and how to add them to a compiler. You can help me by telling me how to implement things and maybe implementing functions so I can review them to fully adapt them to the compiler.

It also has to be able to understand inline assembly for AS, GAS, TASM, MASM, NASM, MASM, and also that for other architectures. For x86, all assembly code should be internally converted to NASM syntax.

We will add support for keywords from 0 keywords as we try to compile more and more existing code. Thus our compiler will be no more than a collection of compiling capabilities developed and grouped on demand in the same program as the demands of using more C/C++ code appear over time.

Solar · Post by **Solar** » Mon Nov 27, 2017 8:22 am

Everybody, please excuse the upcoming negativity, but (not only) I have tried to talk to ~ before, and he remained unresponsive. At this point I consider his advertising his "magic compiler" to be bordering on spam.

We've always been giving OS projects some flak if they come up with beautiful utopias, to see if they have an idea about the necessary groundwork. I take the liberty of doing the same for this project (where even the "utopia" seems to be rather vaguely defined). If my reply crosses the line, I apologize in advance and will hold no grudge if mods take it down.

~ wrote:The plan of this project is writing a C/C++ compiler that can at least be compiled with MinGW or Open Watcom. We can only use the language that those compilers can understand in common.

MinGW is an environment, not a compiler...

The language all halfway-competent C++ compilers "understand in common" is C++. The same goes for C, although it should be noted that several of the compilers you mention in the course of your post don't even come as far as C99.

Seeing how a compiler does need nothing special in ways of library support (basic file I/O is fully sufficient), plain old standard compliance should be enough.

The resulting compiler needs to be able to understand old and new versions of the language, needs to be able to understand idioms from all versions of Visual C++, Turbo C++, Borland C++, Open Watcom, and GCC mainly.

We have talked about this before. I still get the impression that you are confusing and commingling several things when you talk of "idioms".

When a software written on one compiler doesn't compile on another, that is either 1) a (potentially dangerous) bug in the software, 2) an issue of the build system (which is quite separate from anything the compiler has influence on), 3) an issue with third-party headers / libraries, or 4) some explicit usage of compiler extensions.

I shudder to think of an attempt to bring all the extensions of Visual C++ and GCC into one compiler. Let alone the other three, which, to put it bluntly, are so outdated as to be irrelevant.

(TurboC++ --> BorlandC++ --> Embarcadero C++Builder, and that is using the CLang compiler core by now. OpenWatcom has not even reached C++11 standards by now, and is rather unlikely to ever get there.)

The goal is that we can easily port software with a compiler that is capable of compiling recent software under old platforms like DOS and Windows 98.

Again, you are confusing things. "Recent software" will make use of recent OS features. If a software uses recent versions of, e.g., DirectX, or OpenGL, or any other API, you would have to port those new APIs to these old platforms. A good many of these APIs are proprietary, so there is nothing for you to compile, let alone port. And you would need drivers for recent hardware to run on those old platforms, unless you expect recent software using recent APIs to run on 20+ years old hardware.

None of these issues are addressed by a compiler, no matter how advanced it might be. At which point I challenge you to demonstrate that you even know what you are talking about, to give an architectural overview on how you expect things to work out.

Not just words, you tried that often enough. Paint us a picture. Hardware, OS layers, drivers, APIs.

No fancy code should be present in the compiler even if it becomes fully capable of understanding such code. This is to make it possible to always be able to compile it directly in old platforms and in exchange have a modern and capable compiler.

Do you know what a "self-hosted" compiler is?

It is a compiler that can compile itself. That is one of the first major milestones in a compiler's development. Once you have achieved that, you are no longer depending on what other compilers can, or cannot, do.

A new platform B gets supported by adding the necessary logic (e.g. binary formats) to the compiler backend, then compiling the compiler for platform B on platform A, getting a binary for platform B that is then capable of compiling on B. This is called "bootstrapping".

Again, you seem to be ignorant of some very basic concepts, which casts doubt on whether you should be embarking on this project, let alone ask for contribution, at this point in time.

So far the code is only able to open a file and record the start/end offsets of each line, and also to count the number of lines present in the given source file.

Which is about the level of sophistication displayed by the "Quickstart 1 - A word counter" tutorial example of Boost.Spirit / Lex. Only that you have so far ignored the existence of something like lexer / parser generators, despite multiple hints and pointers in that direction.

The code should also compile under Linux.

Believe me, that is the least of your problems.

Currently what I need to do next is to parse the source code to detect comments, strings and end-of-instruction/end-of-parameter characters like ; and , ...

Everyone, note how ~ is eschewing several decades of experience in how compilers are usually built, and is trying to not only reinvent the wheel, but reinventing the concept of roundness in the process. He is actually going at this bottom-up. No syntax tree, no lexer / parser...

(Note, at this point, that rather experienced people have attempted to get full C++ compliance with such generated parsers, and -- after years of learning -- have found that it cannot really be done, that such generated parsers still need manual tweaking. Note that this is manual work added on top of generated parsers...)

I have, for some time, dabbled in maintaining AStyle, a C/C++/Java reformatter. It does (or did, back then) the "state machine" parsing you seem to be aiming for. It wasn't pretty, it was fragile as all hell, and next to unmaintainable. (I tip my hat to Jim Pattee, who took over, and had more success in improving things than I had.) You don't want to walk that road for a "real" compiler. Trust me. I've been there.

(It works, somewhat, as long as you are only going for source that is halfway-decent to begin with. Unfortunately C, and especially C++, allow some things that will make your toenails curl, but which are still 100% legit and must be parsed correctly.)

You can help me by telling me how to implement things...

I have pointed you to the Dragon Book before, to GNU Bison (C) resp. Boost.Spirit (C++). I have also pointed out that you will fail at playing catch-up with C++ single-handedly. I have so far refrained from pointing you to various style guides, seeing how your function names have reached a rather extensive state of illegibility before your source even starts doing things.

Solar · Post by **Solar** » Mon Nov 27, 2017 8:44 am

Bottom line, if I were you, I would start with a parser for C99 (or any other given version of ISO/IEC 9899). Turning source into a digital representation. No more. Don't bother with code generation yet. Just the parser. Then you can widen the scope, gradually...

AJ · Post by AJ » Mon Nov 27, 2017 9:24 am

@solar: Seems eminently sensible to me - no mods will be taking it down

For my own curiosity I just want to clarify the goal - "to compile modern software under DOS / Windows 98". Why? I can (almost) understand using a modern system to compile for an older system but am just interested in the case for doing it the other way around.

Another point I don't get - surely if you compile code that complies with a particular compiler, then whether it runs on a particular OS depends on the system call library it is linked against? With a well-defined interface between OS-agnostic code and OS-specific code, getting it to run on a particular platform is the easy part. At least when compared with the actual parsing / code generation / optimisation / sensible message generation and so on.

Cheers,
Adam

iansjack · Post by **iansjack** » Mon Nov 27, 2017 1:43 pm

Well, have fun, but it's not my idea of a viable - or useful - joint project.

alexfru · Post by **alexfru** » Mon Nov 27, 2017 2:36 pm

~ wrote: The resulting compiler needs to be able to understand old and new versions of the language, needs to be able to understand idioms from all versions of Visual C++, Turbo C++, Borland C++, Open Watcom, and GCC mainly.

Practically impossible. Too much work. The latest C++ standard includes over 1000 pages of poorly readable text. And then it refers to the C standard for some things, AFAIK, which is another 600 or so. Do you know and understand at least 95% of all of that text? Or are you hoping to quickly learn it in the process?

~ wrote: The goal is that we can easily port software with a compiler that is capable of compiling recent software under old platforms like DOS and Windows 98.

Recent software uses things that either never existed in DOS and Windows or have been introduced in Windows recently as well. You know, all those libraries and system APIs. So, you're planning to port or emulate the missing parts? Btw, how are you going to run Windows 98 and why? It wouldn't run on modern PCs AFAIK. And it's an insecure and unreliable OS that nobody should use today.

~ wrote: ... and in exchange have a modern and capable compiler.

We have modern compilers already, e.g.: gcc, clang. There's even a DOS port of gcc, called DJGPP, which already brings bits of POSIX to DOS (not all bits, though).

~ wrote: I'm starting this project. So far the code is only able to open a file and record the start/end offsets of each line, and also to count the number of lines present in the given source file.

Standing ovation. You have completed... 0.01% of the project.

~ wrote: Currently what I need to do next is to parse the source code to detect comments, strings and end-of-instruction/end-of-parameter characters like ; and , ... This is to be able to support extremely long lines being able to load only up to the point where an instruction starts and ends (line, column).

Instruction? Are you writing an assembler as well?

~ wrote: After that I will start by implementing the tasks that the preprocessor should perform.

Are you talking about a C/C++ preprocessor? A C preprocessor is a tricky thing. It's poorly documented and generally poorly understood. Implementing it correctly is not trivial. And a correct implementation is about as big as the rest of the C compiler (without any major optimization functionality).

~ wrote: If somebody wants to help me, just say so.

I don't think you'll have many (if any) people desiring to join your overly ambitious project of nearly zero value.
But we can help. And we already started helping. Don't do it.

~ wrote: The code is freely available so we will all learn.

I've already learned some things while writing mine. If learning compiler construction is your real goal, scale it down. You won't be able to do a tenth of what you told us above. And for the remaining nine tenths, if you don't lose your interest when you're done with the first tenth, you should drop your compiler and contribute to something that is more practical and valuable, e.g. gcc, clang, etc.

~ wrote: You can see what I've done and the explanations of what each part do and how to add them to a compiler.

Don't care. There's nothing to see yet.

~ wrote: You can help me by telling me how to implement things and maybe implementing functions so I can review them to fully adapt them to the compiler.

So, you're saying we should just dictate you your code? And what would be your learning part?

Solar · Post by **Solar** » Mon Nov 27, 2017 3:23 pm

alexfru wrote:The latest C++ standard includes over 1000 pages of poorly readable text. And then it refers to the C standard for some things, AFAIK, which is another 600 or so.

To be completely fair, a lot of all that is actually referring to the respective standard libraries. And, at least the C language standard is rather well-written, as far as tech docs go. (I've rummaged through that over and over and over while working on PDCLib.) Didn't read through the C++ standard yet.

None of this in any way invalidates the remaining criticism.

Schol-R-LEA · Post by **Schol-R-LEA** » Mon Nov 27, 2017 7:45 pm

Quick question for Tilde: can you post the definition for your Token struct type?

If you don't have one yet (either your own, or one defined by a lexer generator if you choose to use one), or you don't know what that is, then you aren't writing a C compiler at all.[1]

Because frankly, that is the very first step in writing a compiler.[2]

Mind you, it isn't the first step in designing a compiler, because compilers for insanely complex languages like C aren't something you can just pound a keyboard for. Even Ron Cain took a long time to think out how to write Small C, and Small C is about the simplest thing one could write and still call a C compiler. Sure, he used ad-hoc matching for the lexer, and a recursive-descent, direct-emit compiler that was almost shocking in its simplicity, and became the go-to example of a simple, hand-coded compiler for a generation (it was still referenced as an example in a compiler course I took in 2008, though only indirectly), but you have to recall that a) this was seven years before the first formal standard for C, and the K&R C of 1980 was a vastly simpler language than the C of today, or of even twenty years ago; b) it had terrible performance and produced ludicrously poor code; and c) it deliberately avoided implementing any of the really hairy parts of C (it didn't even have floating-point numbers) - he added exactly enough features to make writing a self-hosting C compiler bearable, and nothing more.

And you'll note that it may have had a lexer consisting primarily of brute-force string matching, but it still had a lexer.

As I explained in my recent post in my own language design thread, a meta-circular translator for some homoiconic languages like Lisp or Haskell can mostly get away with not explicitly defining a Token type or class because the data primitives plus their type tags basically amount to a set of implicit token types. Even then, a professional-grade compiler or interpreter will still use a proper lexer WRT the data structures describing the Deterministic Finite-state Automata used to recognize the tokens, because ad-hoc methods for tokenizing almost invariably have terrible performance.

C and C++ aren't homoiconic - C source code is not itself a literal for a C data structure. They are also languages which are frightfully complicated to tokenize and parse - especially C++ - because their grammars aren't truly context-free, meaning that even if they were completely regular and consistent (which they aren't), they would still need some ad-hoc code in addition to a standard parser, because there is no really workable approach to unambiguously parsing mildly context-sensitive languages like C.

Recognizing an arbitrary string as being in a context-sensitive grammar in general is a provably PSPACE-complete problem - and while there are ways to reduce this for specific languages such as C, where only a small part of the grammar is context-sensitive, those solutions are all a bit hacky.

Trying to handle the complexities of the modern C grammar in an entirely ad-hoc fashion? That's a good way to get a trip to a mental hospital. It can't be done! I've seen guys eat their keyboards trying! (NSFW)

Basically, there is every indication that you are experiencing the Dunning-Kruger Effect - you don't have enough experience to realize how little you understand about what you are trying to do. Contrary to the legend of Percival, this rarely works out well for anyone. You are going to have to wipe the lemon juice off of your face before you have any real chance of success.

[1] This isn't to say you can't write one, or will never be able to write one - though those could be the case, I dunno - just that you aren't writing one now, and probably aren't ready to start.

[2] Technically, the first step in the coding a compiler often is writing a set of internal input and buffer-management functions and data structures, but that's not really specific to compiler development. And, to reiterate, you really need to have a solid idea of your compiler design before you write any code.

alexfru · Post by **alexfru** » Mon Nov 27, 2017 8:13 pm

Solar wrote:
alexfru wrote:The latest C++ standard includes over 1000 pages of poorly readable text. And then it refers to the C standard for some things, AFAIK, which is another 600 or so.
To be completely fair, a lot of all that is actually referring to the respective standard libraries. And, at least the C language standard is rather well-written, as far as tech docs go. (I've rummaged through that over and over and over while working on PDCLib.) Didn't read through the C++ standard yet.

The C standard is better than the C++ one, IMO. As for the library, P. J. Plauger's book might be helpful in some areas.

Schol-R-LEA · Post by **Schol-R-LEA** » Mon Nov 27, 2017 8:35 pm

And now, a word from the Shameless Self-Promotion Dept.: if you need a guide in understanding this better - and you probably do - feel free to take a look at the simple Algol-60 subset compiler I wrote for the course I mentioned above, and maybe compare it to the some of the different Small C implementations, as well as any of the many other student compilers of this type that you can find floating around the web. Like with the various tutorials around on the subject (such as the infamous "Let's Write a Compiler" by Jack Crenshaw), none of them (especially mine, which not only betrays my own ludicrously poor understanding of DFAs for lexical analysis at the time I wrote it, but is also wretchedly incomplete - I keep meaning to get back to it, but...) are really adequate explanations on their own, but they would at least give you a baseline to start at - sort of like the various known-broken OS dev tutorials such as Bran's or Brokenthorn.

Also, this post of mine discusses several of the books and tutorials on the topic that are around (ignore the part about the CompilerDev site, that unfortunately died after an ignominiously short time). As I said in the thread, most college and university libraries will have at least a few of these, usually the Dragon book already mentioned several times.

Some of them also can be gotten from Amazon (who have a mountain of books on the topic to pick from) as used copies or Kindle eBooks for relatively cheap, though others are very expensive indeed - in particular, the prices for some editions of A Small C Compiler by James Hendrix (based on a derivative of the Ron Cain's compiler) are absolutely insane, though that's more due to rarity and its collectible status - so you'll have to decide what's worth paying for.

My recommendation, if feasible, is to get Ronald Mak's Writing Compilers and Interpreters: A Software Engineering Approach for the groundwork, then go on to Modern Compiler Design, which I personally find better written than Aho, Ullman, et al., and is also available as an eBook from the Springer-Verlag website for $70 (or $30 per chapter if you don't want the whole book - but trust me, you want the whole book, and you'd definitely want more than two chapters in any case).

There, that fulfills my windmill-tilting quota for the day.

~ · Post by ~ » Wed Nov 29, 2017 3:28 am

The first thing I will process will be #include directives.

_____________
_____________
_____________
_____________
_____________
For #includes, I need:

- Offset of previous structure element on disk (to return to the previous source file and delete unused entries for already-processed files).

- Length of full path string of the source file (maybe relative, I will test).

- Source file path string (ASCIIZ).

- Last source file pointer position, but mainly the last processed line/character position processed (should be after the last/current #include directive).

_____________
_____________
_____________
_____________
_____________

Start of lines and end of lines are also code elements.

Before trying to process the next code element I need to be able to skip blank spaces and comments to get to the actual code no matter how padding is arranged.

The source code itself already has a formal structure, so the compiler should be able to process the code back and forth in the same way as a CPU emulator following the sequence of opcodes. C code is more dense than the basic CPU instructions, but it can still be treated by opcode-styled functions for code generation.

Each C/C++ language element is a complete program in itself. Each element needs to be sequentially recorded individually with the line number, and start/end character/column number. It's type also needs to be recorded to make sure that the compiler is recognizing each element properly. Later the sequence of keywords will decide if it's a variable, function body, function declaration, preprocessor, etc., in the main compiler loop using a tree of IFs in the same way of a CPU emulator.

At the first level, the array of structures are always stored in files, not in memory, although the wrappers could be rewritten to use memory or other media.

The main parsing loop is thought to parse the code with precedence. First the things that need to be resolved first. I will start by processing very simple programs as they are already complex enough to start by supporting very few keywords.

The compiler is fully expression-oriented. Everything is an extension to arithmetic or bitwise expressions. Comments, declarations, blank spaces, etc., are detected in the main loop and treated by their respective programs which will process them fully.

It's like a sequential CPU emulator, where the context is followed by first inspecting the first byte for an instruction, entering a global IF or subfunction call, and testing all known cases.

Wajideus · Post by **Wajideus** » Wed Nov 29, 2017 4:17 am

Everything you've said makes absolutely no sense at all. To begin with, there are a bunch of open-source C compilers out there. There's literally no reason to write one at all aside from personal ego.

But if that's really the route you wanna go, first, you need to understand that the C preprocessor and the C compiler are 2 completely different things. The C preprocessor is a macro processor. It takes a sequence of files, does a bunch of string manipulation, and outputs one really long concatenated file to it's output. This typically works works by inserting `#line` directives like:

Code: Select all

#line 151 "file.c"

Into the output file, and using a stack to keep track of which file you're in. An "#include" directive would push a new file name onto the stack and reaching the end of that file would pop it off of the stack.

Aside from that, the preprocessor basically is just a glorified dictionary. A #define or #undef directive assigns a value to a symbol or removes it from the dictionary. The value is tuple consisting of a lazily evaluated subdictionary and a string of text with recursively substituted symbols in it. While the preprocessor is a very simple thing conceptually, alexfru is right when he says that it's poorly documented, poorly understood, and tricky to implement. It certainly doesn't help that POSIX (in all their retardation) decided to eliminate whatever distinction there was between the preprocessor and the language compiler.

iansjack · Post by **iansjack** » Wed Nov 29, 2017 5:15 am

I do hope that you are not going to post details of every minute step towards your goal here. It's entry level computer science stuff and this is a forum primarily dedicated to OS design and implementation rather than language compilers. Blog about it on your website by all means, but this site is not your personal blog.

~ · Post by ~ » Wed Nov 29, 2017 5:31 am

iansjack wrote:I do hope that you are not going to post details of every minute step towards your goal here. It's entry level computer science stuff and this is a forum primarily dedicated to OS design and implementation rather than language compilers. Blog about it on your website by all means, but this site is not your personal blog.

The minute details are in the URLs pointed in the original message. I only answered Schol-R-LEA about which data structures I was going to use to parse the code.

Solar · Post by **Solar** » Wed Nov 29, 2017 5:58 am

~ wrote:I only answered Schol-R-LEA about which data structures I was going to use to parse the code.

Only, you didn't.

Schol-R-LEA explicitly asked you about the token data structures.

And your answer shows that you did not even understand the question, and still picture your compiler as something of a beefed-up stream processor.

Quoting, emphasis mine:

~ wrote:Each C/C++ language element is a complete program in itself. Each element needs to be sequentially recorded [...] At the first level, the array of structures are always stored in files [...] Comments, declarations, blank spaces, etc., are detected in the main loop and treated by their respective programs which will process them fully.

It's like a sequential CPU emulator...

Not only does your mental picture eschew the distinction between the preprocessor step and the compiler step, you also show a very casual disregard for separating lexing (ignoring whitespace and comments, generating tokens) from parsing (turning a sequence of tokens into an abstract syntax tree) from code generation, and that isn't even talking about optimization at all.

Linking is also suspiciously absent from your deliberations.

I'd be very surprised if you had any idea of how C++ templates work, let alone lambdas, and what will be required of your compiler to properly support them.

(Hint: C++ templates are, in and of themselves, a Turing-complete language...)

OSDev.org

Public Domain C/C++ Compiler

Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler

Re: Public Domain C/C++ Compiler