OSDev.org

Posted: **Thu Apr 26, 2012 4:21 am**

I was thinking of something today... some of the things that bothers me with different programming languages is that they are either

1. High-level and easy to use but limited in flexibility
2. Low-level and complex but very flexible
3. Somewhere in-between 1 and 2

If we take C as an example I would fit that into category 3. C is so good because it has a great trade-off between being low-level enough to do complex things and high-level enough to be easy to use. ASM would be more of a pure category 2. You can argue if I'm correct on these assumptions but that is not the point, let's just agree C is more higher-level than ASM.

A compiler for C like GCC typically compiles C into ASM which in turn is compiled into machine instructions. One of the issues I found is that there is no easy way for me to take the resulting ASM and make some changes to it before passing it on to the assembler.

I find this strange... one of the benefits of having intermediate steps would actually be to be able to hand-optimize certain things before passing it on to whatever comes next. I don't think it would be very difficult to automate either.

As a real world example lets say I have a function written in C that we call A. I could have a seperate file called A.patch that contains the changes I want to apply to A after it has been compiled into ASM and when compiling I would just supply this file to the compiler and it would pick it up and apply it on A every time. Note that the patch is not the typical patch but more language specific. Instead of applying itself on specific lines it would apply itself on a specific function (or label). This means that you can change the C code as much as you like as long as you don't change A in which turn you might need to modify the patch.

If all compilers did this, turning it's code into intermediate steps and the resulting steps could be automatically hand-optimized it would be possible to write something in a very high-level language without any trade-offs because you can always optimize all steps leading down to the machine instructions.

Posted: **Thu Apr 26, 2012 4:39 am**

Jezze wrote:A compiler for C like GCC typically compiles C into ASM which in turn is compiled into machine instructions.

A common misconception. Somewhere in the process of turning C into machine instructions, GCC has the option of creating ASM source. In a normal compile, no ASM is created, not even temporarily - data is handed over from the frontend to the backend in an internal binary format.

One of the issues I found is that there is no easy way for me to take the resulting ASM and make some changes to it before passing it on to the assembler.

Easy: gcc -S. That's exactly the thing I described above. Edit that, then assemble with gas

Posted: **Thu Apr 26, 2012 5:30 am**

As Solar wrote, technically it would be rather easy to create an ASM file from some C source, modify it, and assemble it to binary output. You could even put these steps into a makefile. However, I wonder how an "automated patch" could work. What would the contents of a patch file look like? Something like "Don't use a stack frame in function A.", which is some magic command that tells your patching program to check whether gcc has generated a stack frame for A, and removes this stack frame? Or something like "Use the GS segment register in this function."? Or rather something like a diff file, with lines to be removed, added or replaced? With the latter approach it would be necessary to know in advance what gcc will output, because otherwise searching for lines to be removed or replaced would fail.

Consider the following example. Let's say the function A contains some loop which has to be executed a large number of times and thus should be optimized by hand. You compile your code with gcc and obtain some ASM file. You can now read this file, locate the loop, modify it and assemble the file. Doing this by hand is simple. But how would you automate it? How would you tell your patch program where the loop can be found? If you replace it with some code using register ECX, but your gcc output relies on ECX to be unchanged after the loop, how would you deal with that? Would the patch program check whether ECX may be trashed and save its contents if necessary? What happens if you use a different gcc version which produces a different ASM file - would your patch program still find the loop and patch it correctly?

Posted: **Thu Apr 26, 2012 5:38 am**

XenOS wrote:What would the contents of a patch file look like? Something like "Don't use a stack frame in function A.", which is some magic command that tells your patching program to check whether gcc has generated a stack frame for A, and removes this stack frame?

-fomit-stack-pointer.

/me ducks and runs...

Posted: **Thu Apr 26, 2012 6:13 am**

"Automatically hand-optimized" sounds like a contradiction in terms to me. If it can be automated a good compiler will already do it. If it's done by hand, why not just write the code in assembler to start with? If you examine the code produced by the compiler it can be quite difficult to follow all the interactions to see which code can be optimized; easier just to do it from scratch.

But it can be fun looking at the code that a compiler produces.

Posted: **Thu Apr 26, 2012 6:29 am**

I'm pretty sure the intent of inline assembly is to allow you to more precisely control the code that the compiler outputs, if you prefer to "hand-optimize" your C code like that.

Posted: **Thu Apr 26, 2012 7:52 am**

iansjack wrote:"Automatically hand-optimized" sounds like a contradiction in terms to me. If it can be automated a good compiler will already do it.

+1. Most of the times a good hard look at the optimization options of your compiler will be more productive. If you think there's a recurring pattern that could be done better, get into contact with the compiler maintainers and ask for a new optimization option. They'll either be happy that you found the issue, or likely be able to explain to you (in detail) why it isn't a good idea / not generally desired.

Posted: **Fri Apr 27, 2012 5:04 am**

The efficiency of the code generated by compilers is much less of a concern than it was just a couple of years ago when dealing with PC's. x86 CPU's are just so damn fast that in the vast majority of software, they aren't even close to becoming the bottleneck. Things are a little bit different on mobile platforms like Android\iOS, but the need for CPU optimization's is pretty minimal. The abstractions provided by high level platforms (e.g. Android SDK\Cocoa SDK) are easily worth any performance implications. Down at the embedded level, everything is typically done in C, and performance critical parts of the code (e.g. loops) can be done in assembly.

One thing that I think that may be feasible and possibly beneficial would be an optimizing assembler. Compilers are getting a lot better at producing instructions that can execute quickly even on out of order super scalars, however it's still quite common to see large dependency chains that could be broken up. On the other hand, it's quite common to see novice or even experienced assembly programmers to create equally 'bad' code by not having knowledge of CPU microarchitecture. In either situation, an assembler than could move around instructions breaking up large dependency chains and optimizing execution unit usage could be beneficial. Although it would be quite complicated to write something that could effectively make large changes to the code without changing it's semantics.

Posted: **Fri Apr 27, 2012 8:52 am**

actually, they already have that... and it is specially written for each and every x86 CPU that exists... in fact, it is impossible to run code on modern (P6+) x86 CPUs without first using it -- in fact, its built into each CPU, and can even reorder things differently each time the code is run depending on what else the CPU is doing at that particular moment, with more information than any assembler could ever hope to have about optimizing for that particular CPU and set of unique circumstances -- where if it was done at assembly time, it would be neither as flexible or as capable

modern CPUs not only reorder the instructions for maximum parallel execution, but also replace poorer performing instructions with faster equivalents at run-time (for example, older CPUs silently replace any multiplication or division by a power of 2 with a shift operation, which executes faster -- of course newer CPUs execute multiplication faster than shift, so such optimizations done at the assembly level actually result in worse code, while leaving it as multiplications results in just-as-fast code on older CPUs, and faster code on newer CPUs)

Posted: **Fri Apr 27, 2012 10:57 am**

Hi,

Rudster816 wrote:One thing that I think that may be feasible and possibly beneficial would be an optimizing assembler.

I've been thinking of writing an optimizing assembler for a long time. As I see it there's 3 advantages - the code optimisation itself, improved code maintainability, and (potentially) simpler compilers.

For a simple example, the assembler could turn this:

Code: Select all

%define VERSION 0 

    add eax,edx
    or eax,7
    mov [foo],eax

    mov edi,VERSION
    lea esi,[edi + 123]
    add esi,111

    mov ebx,[123]
    lea ecx,[myStructure.myFirstField + ebx]

Into this:

Code: Select all

    mov ebx,[123]
    add eax,edx
    xor edi,edi
    or eax,7
    mov esi,0+123+111
    mov ecx,ebx
    mov [foo],eax

If a programmer has to do these optimisations manually the code becomes a lot harder to maintain (e.g. code breaking if "VERSION" is defined with a different value or if "myFirstField" is shifted somewhere else in that structure, instruction scheduling that makes it harder to follow the code, etc).

JAAman wrote:modern CPUs not only reorder the instructions for maximum parallel execution,

Most modern CPUs do reorder instructions but some (e.g. Intel's Atom) don't. Even for CPUs that have very good instruction reordering there's practical limits and bottlenecks, and if the assembler did "pre-reordering" it'd make it possible for the CPU to go even further.

JAAman wrote:but also replace poorer performing instructions with faster equivalents at run-time (for example, older CPUs silently replace any multiplication or division by a power of 2 with a shift operation, which executes faster -- of course newer CPUs execute multiplication faster than shift, so such optimizations done at the assembly level actually result in worse code, while leaving it as multiplications results in just-as-fast code on older CPUs, and faster code on newer CPUs)

Even for modern CPUs, the ability to replace poorer performing instructions with faster equivalents is almost non-existent. There are no CPUs where multiplication/division is faster than shift (some may be as fast as shift for powers of 2, but not faster); and "lea" or "add eax,eax" (where possible) is often faster than shift left (or multiplication).

The other problems you're overlooking is that the CPU doesn't have as much information to work with as the assembler does, and doesn't have as much time to optimise as the assembler does either. For example, a CPU can't assume that a write to one address won't effect a subsequent read from a different address, but an assembler could know that the write doesn't effect the read and reorder anyway. A CPU also can't do extensive optimisation either (it'd be far too slow and hurt performance) while an assembler can.

Also note that the same applies to compilers; but nobody would suggest that a compiler shouldn't bother doing any optimisation because the CPU might be able to optimise a little bit.

Cheers,

Brendan

Posted: **Fri Apr 27, 2012 11:27 am**

JAAman wrote:actually, they already have that... and it is specially written for each and every x86 CPU that exists... in fact, it is impossible to run code on modern (P6+) x86 CPUs without first using it -- in fact, its built into each CPU, and can even reorder things differently each time the code is run depending on what else the CPU is doing at that particular moment, with more information than any assembler could ever hope to have about optimizing for that particular CPU and set of unique circumstances -- where if it was done at assembly time, it would be neither as flexible or as capable

modern CPUs not only reorder the instructions for maximum parallel execution, but also replace poorer performing instructions with faster equivalents at run-time (for example, older CPUs silently replace any multiplication or division by a power of 2 with a shift operation, which executes faster -- of course newer CPUs execute multiplication faster than shift, so such optimizations done at the assembly level actually result in worse code, while leaving it as multiplications results in just-as-fast code on older CPUs, and faster code on newer CPUs)

I think you seriously failed to comprehend what I mean't by an optimizing assembler. I also think you grossly overestimate the power of OOO\Super scalar pipelines.

Your statement about multiplication being faster than shifts is just flat out wrong. Even ~2 watt ARM chips have multiple barrel shifters.

CPU's can not reorder instructions at will, as they have to be commited in program order, and they can't depend on results from other instructions that haven't been executed yet. The hardware is farrr less flexible about how it can optimize execution. There's an entire branch of CPU architectures to offloading the (rather unnecessary) task of the CPU figuring out which instructions can be executed in parallel. The compiler knows the entire semantics of the program, and knows how to make instructions parallel (i.e. not reading a register immediately after writing to it), and can do it way better because it can change any instruction order, or rearrange an entire loop. CPU's are limited to looking at a handful of instructions at a time, and have to spend large amounts of silicon to do it. CPU's are incredibly good at what they do, but they aren't magic black boxes that can fix broken code.

Example:

...
mov rax, [0x746680]
add rax, 0x123450
mov rbx, rax
and rbx, 0xFF
test rbx, rbx
jz somelabel
mov [0x746680], rax

Every single instruction in that sequence has a RAW (Read after Write) dependency with the previous instruction. There is absolutely nothing that can be done in hardware to compensate for the parallelism in this sequence. Say the CPU fetches 4 instructions, with last one being the first in the above sequence. Once it has dispatched all of those, it will then fetch the next four, "look at them", and not be able to do a damn thing besides decode them and place them at a reservation station and wait for the first instruction to finish. If there is some work that is done above the sequence that is relatively unrelated, than you can intersperse that sequence with this one (this sequence only uses two registers out of 16), and "break up" the dependency. The CPU has no such luxury because it's probably already started or finished the execution of any sequence above it. Cases of large serial blocks like this one can absolutely kill your performance in a tight loop, and there isn't anything the CPU can do to help. You on the other hand (or the compiler, or theoretical optimizing assembler) can completely mix up all instructions in the program in order to break up such chains, and get rid of them entirely.

Simply put you don't do something in hardware that can be done better in software at a much lower overall cost. You also don't do something at execution time you can do at compile time. I think the latter statement has reiterated by 1000's of programmers\engineers.

OSDev.org

I don't know... multi-level-programming?

I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?

Re: I don't know... multi-level-programming?