Intel x86 segmentation guide

Congdm · Post by **Congdm** » Fri Nov 23, 2012 10:05 am

http://download.intel.com/design/intarc ... 326296.pdf

For anyone interested in segmentation like me, this guide is a good read. It explains what we need to take into consideration when using segmentation model. In conclusion:

The details on the memory protection mechanism implemented by the OS and/or the run-time execution environment can have a significant impact on performance when deploying software applications for the Intel® Atom™ microarchitecture.

Whenever possible, paging-based protection mechanisms should be preferred over solutions based on segmentation, since the same code running with non-zero segment base registers values can experience a delay in the access operations on the system memory.

However, when segmentation is needed by design, good coding techniques and the use of compilers optimized for the Intel® Atom™ microarchitecture can reduce the performance gap, significantly improving performance.

I am using segmentation so I must accept the fact that my code will run much more slower than those of everyone else. I don't blame Intel on that. But why the number of segment registers is only six? And segment table only has 8192 entries? They could done better on this.

Combuster · Post by **Combuster** » Fri Nov 23, 2012 10:13 am

Congdm wrote:They could done better on this.

They did. It's called the LDT.

Brendan · Post by **Brendan** » Fri Nov 23, 2012 11:53 am

Hi,

Congdm wrote:I am using segmentation so I must accept the fact that my code will run much more slower than those of everyone else.

The design of existing Intel Atom CPUs is "special", in that they are "mostly in-order" cores where hyper-threading is meant to hide problems (stalls, etc). Intel's little benchmark was a single-CPU benchmark. What this means is that they weren't using hyper-threading to hide problems. To put it another way, if your OS is using multiple CPUs, then on existing Atom CPUs your OS may not perform as badly as Intel's benchmark suggests.

I have also heard that upcoming Atom CPUs (expected to be released next year) will be out-of-order (and therefore more able to hide the cost of non-zero segment bases in "single-CPU only" code). However, I've also heard they won't support hyper-threading, so you'd gain some performance on single-CPU loads, but the out-of-order improvements will probably be cancelled out by the lack of hyper-threading on multi-CPU loads.

Congdm wrote:But why the number of segment registers is only six?

When CPUs were much slower segment register loads weren't as expensive, so there wasn't as much need for more segment registers (as you could just load new values into segment registers instead). Since then everyone stopped using segmentation, so there's still no real need for Intel to add more segment registers.

Congdm wrote:And segment table only has 8192 entries? They could done better on this.

As far as I know; the intent of the original design of protected mode segmentation was for each process to have its own LDT, so that only segments shared by multiple processes would need to be in the GDT, and so that each process has (up to) 8192 segments (plus any global/shared segments). I'd also assume that segments were intended for covering large areas (e.g. a process' entire heap) rather than thousands of tiny areas (e.g. a different segment for every individual "malloc()").

Cheers,

Brendan

rdos · Post by **rdos** » Fri Nov 23, 2012 12:46 pm

Congdm wrote:And segment table only has 8192 entries? They could done better on this.

Yes. They should have extended the segment registers to 32-bits when they launched the 32-bit architecture. That would have broken nothing, and would also have eliminated the later problems with running out of descriptors, which was an important reason why people started using flat memory models instead.

rdos · Post by **rdos** » Fri Nov 23, 2012 12:52 pm

Brendan wrote: When CPUs were much slower segment register loads weren't as expensive, so there wasn't as much need for more segment registers (as you could just load new values into segment registers instead). Since then everyone stopped using segmentation, so there's still no real need for Intel to add more segment registers.

There is no reason why loading and using an index register should take longer than loading a segment register, and using it without an index register. It's all bad hardware design that makes segmented code run slow on modern CPUs. Loading a descriptor is basically the same thing as doing a 64-bit load, and thus shouldn't take longer than reading a 64-bit address in long mode, if properly optimized.

trinopoty · Post by **trinopoty** » Fri Nov 23, 2012 7:28 pm

You do realize that there is no segmentation in long mode.
When paging was introduced in Intel CPUs, everyone started using paging and segmentation became a legacy. That is why intel developers do not feel like improving segmentation support.

Brendan · Post by **Brendan** » Fri Nov 23, 2012 10:20 pm

rdos wrote:
Brendan wrote: When CPUs were much slower segment register loads weren't as expensive, so there wasn't as much need for more segment registers (as you could just load new values into segment registers instead). Since then everyone stopped using segmentation, so there's still no real need for Intel to add more segment registers.
There is no reason why loading and using an index register should take longer than loading a segment register, and using it without an index register.

You're right - there's no reason why loading and using an index register should take longer than loading a segment register.

rdos wrote:It's all bad hardware design that makes segmented code run slow on modern CPUs. Loading a descriptor is basically the same thing as doing a 64-bit load, and thus shouldn't take longer than reading a 64-bit address in long mode, if properly optimized.

No. To load a normal register the CPU has to:

fetch the value
store that value into the register

To load a segment register the CPU has to (things in bold are things that aren't done when loading a normal register):

fetch the value
determine if it's in the GDT or LDT
check if the value is above the GDT/LDT limit
check if the entry is marked as "present"
check if there's a privilege level problem (CPL > DPL)
load (and decode) the GDT/LDT entry and extract "base address", "limit" and "attributes"
store that information into the register

Because loading a normal register is relatively simple it can be done without micro-code; and for some CPUs, in some cases (e.g. loading a register with another register) it isn't even a micro-op (e.g. handled by "register renaming" in the front-end). Loading a segment register is complicated (too complex for a simple micro-op) and therefore there's an extra "fetch micro-ops from micro-code" step involved (in addition to all the other work).

For all these reasons; loading a segment register should probably be around 10 times slower than loading a normal register.

For using (rather than loading) a normal index register vs. using a segment register; the CPU would calculate the virtual address (and would probably be optimised to do this very quickly as it's done very often in all code), and would then convert that virtual address into a linear address. What Intel's segmentation guide is saying is that; (for Atom CPUs) when the segment base is zero the CPU has been optimised to avoid an unnecessary addition and therefore converting virtual addresses into linear addresses costs nothing, and when the segment base is non-zero the CPU can't avoid the extra addition to convert virtual addresses into linear addresses and it costs 1 extra cycle. This makes perfect sense to me (e.g. it's better than not optimising the conversion from virtual addresses into linear addresses and making everything cost 1 extra cycle).

More generally, what Intel are doing is optimising things that matter most (e.g. virtual address calculation) instead of optimising things that aren't important.

Cheers,

Brendan

rdos · Post by **rdos** » Sat Nov 24, 2012 3:01 am

trinopoty wrote:You do realize that there is no segmentation in long mode.
When paging was introduced in Intel CPUs, everyone started using paging and segmentation became a legacy. That is why intel developers do not feel like improving segmentation support.

No so. If that was Intel's aim, they would never have added a whole new 32-bit segmentation concept along with paging.

rdos · Post by **rdos** » Sat Nov 24, 2012 3:16 am

Brendan wrote: No. To load a normal register the CPU has to:
fetch the value

store that value into the register
To load a segment register the CPU has to (things in bold are things that aren't done when loading a normal register):
fetch the value

determine if it's in the GDT or LDT

check if the value is above the GDT/LDT limit

check if the entry is marked as "present"

check if there's a privilege level problem (CPL > DPL)

load (and decode) the GDT/LDT entry and extract "base address", "limit" and "attributes"

store that information into the register

All of the things you marked in bold could be done in parallel in one cycle. Limit checking is done for every instruction / operation in protected mode anyway, and the virtual address of GDT / LDT already resides in hardware (register). The privilege level change could be assumed to pass, and could add some additional steps when there are problems (similar to jmp prediction). Decoding descriptors easily could be done in hardware (wouldn't require many transistors to do).

Brendan wrote: Because loading a normal register is relatively simple it can be done without micro-code; and for some CPUs, in some cases (e.g. loading a register with another register) it isn't even a micro-op (e.g. handled by "register renaming" in the front-end). Loading a segment register is complicated (too complex for a simple micro-op) and therefore there's an extra "fetch micro-ops from micro-code" step involved (in addition to all the other work).

That's no argument, but a preference of the chip designer that will bother to optimize some things but not others. In fact, segment register loads were optimized in the beginning, and it is a later development to not to bother with optimizing them any more.

Brendan wrote: For all these reasons; loading a segment register should probably be around 10 times slower than loading a normal register.

Look at the instruction timings for i386, and you can see that you are mistaken.

Brendan wrote: For using (rather than loading) a normal index register vs. using a segment register; the CPU would calculate the virtual address (and would probably be optimised to do this very quickly as it's done very often in all code), and would then convert that virtual address into a linear address.

Both GDT and LDT already have linear addresses, so there is one step less there when fetching a descriptor.

You might also compare with the 4-level paging scheme used for long mode. If that had similar level of optimization as segmentation, code would run 100s of times slower than it does. And the 4-level paging scheme affects each and every memory access, including code fetches, so it is really hardware intensive.

Scheme for long mode paging (every memory access):

determine PML4 entry
check PML4 entry and invoke page fault if bad
determine directory ptr entry
check directory ptr entry and invoke page fault if bad
determine directory entry
check directory entry and invoke page fault if bad
determine page entry
check page entry and invoke page fault if bad
read data from memory

In addition to that, we also have several page table sizes that require additional logic to decode, and paging comes in two variants, so the code needs to determine which variant to use.

Also note that each of these steps must be carried out in a sequence, since the next step requires the previous physical address, so it is impossible to do it in parallel.

In light of this, adding one cycle penalty for adding two numbers to form a linear address is plain stupid. The adder could just as well always be used in protected mode.

Also note that apparently the chip designer has managed to implement exactly the same functionality for FS and GS segment registers in long mode by loading a MSR, which require several instructions to execute as wrmsr uses several registers as it's input. One might wonder why it was necessary to reinvent the wheel here? Wouldn't the already existing FS and GS bases have been just as good?

I might add that apparently segment register loads are still supported in long mode, and they do what we expect (fill the descriptor caches, and check the descriptors and generate protection faults), and when the processor switches to compability mode, these descriptors apparently function as if they were loaded in protected mode. So it appears that the only thing done in long mode with segmentation is to disable adding the base and checking the limit. I think they could have done better than that, and implemented a simple 64-bit adder in hardware instead of in microcode. Limit checking could be done as branch prediction, you assume limits are ok, and handle exceptional cases with penalties.

Brendan · Post by **Brendan** » Sat Nov 24, 2012 4:15 am

Hi,

rdos wrote:
Brendan wrote: No. To load a normal register the CPU has to:
fetch the value

store that value into the register
To load a segment register the CPU has to (things in bold are things that aren't done when loading a normal register):
fetch the value

determine if it's in the GDT or LDT

check if the value is above the GDT/LDT limit

check if the entry is marked as "present"

check if there's a privilege level problem (CPL > DPL)

load (and decode) the GDT/LDT entry and extract "base address", "limit" and "attributes"

store that information into the register

All of the things you marked in bold could be done in parallel in one cycle.

That depends on how big a cycle is. If you make the CPU slow enough, even things like "fsqrt" can be done in one cycle. Of course if Intel did make the cycles larger they'd also want most instructions to complete in a fraction of a cycle; and you'd be complaining that your segment register loads take an entire cycle while most other things only take an eighth of a cycle.

Of course not all of these things can be done in parallel either. For example, you can't store the information into a register unless you already know what that information is.

rdos wrote:
Brendan wrote: Because loading a normal register is relatively simple it can be done without micro-code; and for some CPUs, in some cases (e.g. loading a register with another register) it isn't even a micro-op (e.g. handled by "register renaming" in the front-end). Loading a segment register is complicated (too complex for a simple micro-op) and therefore there's an extra "fetch micro-ops from micro-code" step involved (in addition to all the other work).
That's no argument, but a preference of the chip designer that will bother to optimize some things but not others. In fact, segment register loads were optimized in the beginning, and it is a later development to not to bother with optimizing them any more.

Brendan wrote: For all these reasons; loading a segment register should probably be around 10 times slower than loading a normal register.
Look at the instruction timings for i386, and you can see that you are mistaken.

Here's what the "INTEL 80386 PROGRAMMER'S REFERENCE MANUAL 1986" says:

Code: Select all

Opcode   Instruction       Clocks        Description
88  /r   MOV r/m8,r8       2/2           Move byte register to r/m byte
89  /r   MOV r/m16,r16     2/2           Move word register to r/m word
89  /r   MOV r/m32,r32     2/2           Move dword register to r/m dword
8A  /r   MOV r8,r/m8       2/4           Move r/m byte to byte register
8B  /r   MOV r16,r/m16     2/4           Move r/m word to word register
8B  /r   MOV r32,r/m32     2/4           Move r/m dword to dword register
8C  /r   MOV r/m16,Sreg    2/2           Move segment register to r/m word
8D  /r   MOV Sreg,r/m16    2/5,pm=18/19  Move r/m word to segment register
A0       MOV AL,moffs8     4             Move byte at (seg:offset) to AL
A1       MOV AX,moffs16    4             Move word at (seg:offset) to AX
A1       MOV EAX,moffs32   4             Move dword at (seg:offset) to EAX
A2       MOV moffs8,AL     2             Move AL to (seg:offset)
A3       MOV moffs16,AX    2             Move AX to (seg:offset)
A3       MOV moffs32,EAX   2             Move EAX to (seg:offset)
B0 + rb  MOV reg8,imm8     2             Move immediate byte to register
B8 + rw  MOV reg16,imm16   2             Move immediate word to register
B8 + rd  MOV reg32,imm32   2             Move immediate dword to register
C6       MOV r/m8,imm8     2/2           Move immediate byte to r/m byte
C7       MOV r/m16,imm16   2/2           Move immediate word to r/m word
C7       MOV r/m32,imm32   2/2           Move immediate dword to r/m dword

Code: Select all

Clock counts for instructions that have an r/m (register or memory) operand
are separated by a slash. The count to the left is used for a register
operand; the count to the right is used for a memory operand.

Code: Select all

pm=, a clock count that applies when the instruction executes in
Protected Mode. pm= is not given when the clock counts are the same for
Protected and Real Address Modes.

From this you can see that in protected mode something like "mov es,ax" costs 18 cycles, which is 9 times higher than most other MOV instructions and 4.5 times higher than the second slowest MOV instruction involving normal registers.

From this it's very obvious that (for protected mode) segment register loads have always sucked.

rdos wrote:
Brendan wrote: For using (rather than loading) a normal index register vs. using a segment register; the CPU would calculate the virtual address (and would probably be optimised to do this very quickly as it's done very often in all code), and would then convert that virtual address into a linear address.
Both GDT and LDT already have linear addresses, so there is one step less there when fetching a descriptor.

For using (rather than loading) a segment register; there's no need to fetch the segment's descriptor at all.

rdos wrote:You might also compare with the 4-level paging scheme used for long mode. If that had similar level of optimization as segmentation, code would run 100s of times slower than it does. And the 4-level paging scheme affects each and every memory access, including code fetches, so it is really hardware intensive.

Intel or AMD could have something like a "GDT/LDT descriptor cache" to make segment register loads faster (in the same way that the TLBs makes paging faster). Ironically, AMD researched this and then patented it, and then never bothered implementing it.

Cheers,

Brendan

rdos · Post by **rdos** » Sat Nov 24, 2012 4:57 am

Brendan wrote: From this you can see that in protected mode something like "mov es,ax" costs 18 cycles, which is 9 times higher than most other MOV instructions and 4.5 times higher than the second slowest MOV instruction involving normal registers.

The slowest version is reading from memory to register, which is the one you should compare to.

Brendan wrote: Intel or AMD could have something like a "GDT/LDT descriptor cache" to make segment register loads faster (in the same way that the TLBs makes paging faster). Ironically, AMD researched this and then patented it, and then never bothered implementing it.

To patent something like that was a real smart move in order to sabotage future use of segmentation.

Brendan · Post by **Brendan** » Sat Nov 24, 2012 6:22 am

Hi,

rdos wrote:
Brendan wrote: From this you can see that in protected mode something like "mov es,ax" costs 18 cycles, which is 9 times higher than most other MOV instructions and 4.5 times higher than the second slowest MOV instruction involving normal registers.
The slowest version is reading from memory to register, which is the one you should compare to.

Why? It would only make sense to compare something like "mov bx,[memory]" to something like "mov es,[memory]" (which doesn't exist).

Something like "mov es,ax" should be compared to something like "mov bx,ax", which clearly shows that segment register loads suck in a fair comparison. Another fair comparison would be "pop bx" (4 cycles) vs. "pop es" (21 cycles), which clearly shows that segment register loads suck again.

In fact the only case where segment register loads don't suck on 80386 is when the CPU is in real mode, where there's no GDT/LDT and no protection checks. Comparing something like "mov es,ax" in real mode to the exact same instruction in protected mode shows (beyond any reasonable doubt) that the source of the suckage is the overhead of fetching the descriptors and doing protection checks.

rdos wrote:
Brendan wrote:Intel or AMD could have something like a "GDT/LDT descriptor cache" to make segment register loads faster (in the same way that the TLBs makes paging faster). Ironically, AMD researched this and then patented it, and then never bothered implementing it.
To patent something like that was a real smart move in order to sabotage future use of segmentation.

Intel and AMD have a patent sharing agreement (otherwise AMD couldn't make 80x86 CPUs and Intel couldn't support long mode). In addition to that, like most patents, that specific patent looks extremely obvious and wouldn't remain valid for more than 10 minutes if someone like Intel (or VIA or IBM or anyone else) felt like challenging it.

Cheers,

Brendan

Owen · Post by **Owen** » Sat Nov 24, 2012 8:07 am

I'll also add that Intel's designers added segment register renaming to prototypes of the P6. The design of SYSENTER belies this: It was designed to execute in 1 cycle (As a 2 μop instruction) on the 422 decoder template* segment renamed pipeline of the prototypes.

As the P6 moved from the design phase to the product engineering phase (i.e. from design to product), they had to look at the performance vs die space tradeoff of various features. The P6 had a price point, and that price point and Intel's desired margins specified what the transistor budget was. As product engineering went on, features which had a lower performance vs die size ratio started to get "die dieted" out. Segment register renaming went, as did the 422 template in favor of a 411 template.

The rest, as they say, is history: Segment register operations returned to microcode and weren't renamed. This meant that any segmentation related instructions turned into microcode monstrosities which required OOo barriers (i.e. the OOo unit needs to be drained, the segmentation μops issued, then the OOo unit refilled).

SYSENTER went from 1 cycle to ~16 cycles, and lost its rationale for being SYSENTER and not SYSCALL. Segmentation operations went from similar (L1 latency/presence dependent) to much higher.

Most people avoided segmentation and privilege level transitions as much as possible (as they were slow); therefore, over time, optimizing for them has become less and less of a priority. It's a self fulfilling prophecy; but, fundamentally, neither Intel or AMD are going to make chips which run existing applications slower, because, let's face it, who would buy them?

Sauce

* NNN-decoder template: Each digit is an instruction decoder; the digit is the number of μops that said decoder can issue for a single instruction; a 422 template decoder can decode 3 instructions per cycle: a <=4μop instruction on the first decoder and a <=2μop on the remaining two

Gigasoft · Post by **Gigasoft** » Wed Jan 09, 2013 10:16 am

Brendan wrote:Why? It would only make sense to compare something like "mov bx,[memory]" to something like "mov es,[memory]" (which doesn't exist).

Sure it does, and I very recently read somewhere that it takes 19 cycles. Like, it was only a minute ago, what a coincidence! This is less than 5 times the execution time for mov bx,[memory].

OSDev.org

Intel x86 segmentation guide

Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide

Re: Intel x86 segmentation guide