OSDev.org

Posted: **Thu Mar 22, 2007 4:58 am**

lo,

I have lately been bizy on making my assembly code more efficient and have been reading the Intel Optimization reference to improve some of my code.

But does any one know of any other good places where i can find some additional info on what to do and what should not be done.

for example:
xor ebx,ebx
mov bl,al
shl ebx,8

must be: (this is about 8 cycle faster :S)
movzx ebx,al
shl ebx,8

Also what about:
Lodsx and stosx, shoud i use those? cant find much in the Intel
Optimization reference

Any other info you might know on this subject is welcome.
I have the following target system: Core 2 Duo.

Regards
Wilco van Maanen

Posted: **Thu Mar 22, 2007 7:16 am**

lods and stors are fairly fast string instructions, they for surely are faster with rep, than a mov loop construct

Posted: **Thu Mar 22, 2007 7:40 am**

String instructions are only faster than simple MOV and iterations when they are used in their Double Word (STOSD/MOVSD/...) form. I remember Intel articles actually mentioning that.

As per optimization articles, I think every programmer should read Agner Fog's optimization manuals. They are available in 5 different volumes named:

1) Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms.
2) Optimizing subroutines in assembly language: An optimization guide for x86 platforms.
3) The microarchitecture of Intel and AMD CPU’s: An optimization guide for assembly programmers and compiler makers.
4) Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel and AMD CPU's.
5) Calling conventions for different C++ compilers and operating systems.

About those two code snippets that you put here, both of them, not surprisingly, run at the speed of 2 clock cycles on my PIII 800 MHZ with 512 MB of SDRAM while the Code Segment is aligned on a DWORD boundary.

The below code runs at 2 clock cycles on the same machine with the CS aligned on DWORD boundary again:

Code: Select all

  XOR     EBX , EBX
  MOV     BH , AL

That does what you are doing with the previous codes. The below code also runs at 2 clock cycles:

Code: Select all

  MOV     EBX , EAX
  SHL     EBX , 08
  AND     EBX , 0x0000FF00

Now watch out for this one that runs at 9 clock cycles on the same processor with the same CS conditions:

Code: Select all

  MOV     BH , AL
  AND     EBX , 0x0000FF00

That's because of a partial General Purpose Register Access Stall, which is mentioned and described in Agner Fog's manuals also.

However, if you want to see what code runs at how many clock cycles, you should definitely check your code and use a profiler on different CPU architectures. You should know important instructions' latencies and throughputs such as MOVs on different forms.

Posted: **Sat Mar 24, 2007 1:49 pm**

thx, for the link, exactly what i was looking for

OSDev.org

x86 Efficient code

x86 Efficient code