lo,
I have lately been bizy on making my assembly code more efficient and have been reading the Intel Optimization reference to improve some of my code.
But does any one know of any other good places where i can find some additional info on what to do and what should not be done.
for example:
xor ebx,ebx
mov bl,al
shl ebx,8
must be: (this is about 8 cycle faster :S)
movzx ebx,al
shl ebx,8
Also what about:
Lodsx and stosx, shoud i use those? cant find much in the Intel
Optimization reference
Any other info you might know on this subject is welcome.
I have the following target system: Core 2 Duo.
Regards
Wilco van Maanen
x86 Efficient code
String instructions are only faster than simple MOV and iterations when they are used in their Double Word (STOSD/MOVSD/...) form. I remember Intel articles actually mentioning that.
As per optimization articles, I think every programmer should read Agner Fog's optimization manuals. They are available in 5 different volumes named:
About those two code snippets that you put here, both of them, not surprisingly, run at the speed of 2 clock cycles on my PIII 800 MHZ with 512 MB of SDRAM while the Code Segment is aligned on a DWORD boundary.
The below code runs at 2 clock cycles on the same machine with the CS aligned on DWORD boundary again:
That does what you are doing with the previous codes. The below code also runs at 2 clock cycles:
Now watch out for this one that runs at 9 clock cycles on the same processor with the same CS conditions:
That's because of a partial General Purpose Register Access Stall, which is mentioned and described in Agner Fog's manuals also.
However, if you want to see what code runs at how many clock cycles, you should definitely check your code and use a profiler on different CPU architectures. You should know important instructions' latencies and throughputs such as MOVs on different forms.
As per optimization articles, I think every programmer should read Agner Fog's optimization manuals. They are available in 5 different volumes named:
- 1) Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms.
2) Optimizing subroutines in assembly language: An optimization guide for x86 platforms.
3) The microarchitecture of Intel and AMD CPU’s: An optimization guide for assembly programmers and compiler makers.
4) Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel and AMD CPU's.
5) Calling conventions for different C++ compilers and operating systems.
About those two code snippets that you put here, both of them, not surprisingly, run at the speed of 2 clock cycles on my PIII 800 MHZ with 512 MB of SDRAM while the Code Segment is aligned on a DWORD boundary.
The below code runs at 2 clock cycles on the same machine with the CS aligned on DWORD boundary again:
Code: Select all
XOR EBX , EBX
MOV BH , AL
That does what you are doing with the previous codes. The below code also runs at 2 clock cycles:
Code: Select all
MOV EBX , EAX
SHL EBX , 08
AND EBX , 0x0000FF00
Code: Select all
MOV BH , AL
AND EBX , 0x0000FF00
However, if you want to see what code runs at how many clock cycles, you should definitely check your code and use a profiler on different CPU architectures. You should know important instructions' latencies and throughputs such as MOVs on different forms.
On the field with sword and shield amidst the din of dying of men's wails. War is waged and the battle will rage until only the righteous prevails.
-
- Member
- Posts: 33
- Joined: Wed Apr 26, 2006 11:00 pm