x86 Efficient code

Programming, for all ages and all languages.
Post Reply
PyroMathic
Member
Member
Posts: 33
Joined: Wed Apr 26, 2006 11:00 pm

x86 Efficient code

Post by PyroMathic »

lo,

I have lately been bizy on making my assembly code more efficient and have been reading the Intel Optimization reference to improve some of my code.

But does any one know of any other good places where i can find some additional info on what to do and what should not be done.


for example:
xor ebx,ebx
mov bl,al
shl ebx,8


must be: (this is about 8 cycle faster :S)
movzx ebx,al
shl ebx,8


Also what about:
Lodsx and stosx, shoud i use those? cant find much in the Intel
Optimization reference



Any other info you might know on this subject is welcome.
I have the following target system: Core 2 Duo.


Regards
Wilco van Maanen
earlz
Member
Member
Posts: 1546
Joined: Thu Jul 07, 2005 11:00 pm
Contact:

Post by earlz »

lods and stors are fairly fast string instructions, they for surely are faster with rep, than a mov loop construct
User avatar
XCHG
Member
Member
Posts: 416
Joined: Sat Nov 25, 2006 3:55 am
Location: Wisconsin
Contact:

Post by XCHG »

String instructions are only faster than simple MOV and iterations when they are used in their Double Word (STOSD/MOVSD/...) form. I remember Intel articles actually mentioning that.

As per optimization articles, I think every programmer should read Agner Fog's optimization manuals. They are available in 5 different volumes named:

  • 1) Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms.
    2) Optimizing subroutines in assembly language: An optimization guide for x86 platforms.
    3) The microarchitecture of Intel and AMD CPU’s: An optimization guide for assembly programmers and compiler makers.
    4) Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel and AMD CPU's.
    5) Calling conventions for different C++ compilers and operating systems.

About those two code snippets that you put here, both of them, not surprisingly, run at the speed of 2 clock cycles on my PIII 800 MHZ with 512 MB of SDRAM while the Code Segment is aligned on a DWORD boundary.

The below code runs at 2 clock cycles on the same machine with the CS aligned on DWORD boundary again:

Code: Select all

  XOR     EBX , EBX
  MOV     BH , AL

That does what you are doing with the previous codes. The below code also runs at 2 clock cycles:

Code: Select all

  MOV     EBX , EAX
  SHL     EBX , 08
  AND     EBX , 0x0000FF00
Now watch out for this one that runs at 9 clock cycles on the same processor with the same CS conditions:

Code: Select all

  MOV     BH , AL
  AND     EBX , 0x0000FF00
That's because of a partial General Purpose Register Access Stall, which is mentioned and described in Agner Fog's manuals also.

However, if you want to see what code runs at how many clock cycles, you should definitely check your code and use a profiler on different CPU architectures. You should know important instructions' latencies and throughputs such as MOVs on different forms.
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
PyroMathic
Member
Member
Posts: 33
Joined: Wed Apr 26, 2006 11:00 pm

Post by PyroMathic »

thx, for the link, exactly what i was looking for :)
Post Reply