Brendan wrote:
No. To load a normal register the CPU has to:
- fetch the value
- store that value into the register
To load a segment register the CPU has to (things in bold are things that aren't done when loading a normal register):
- fetch the value
- determine if it's in the GDT or LDT
- check if the value is above the GDT/LDT limit
- check if the entry is marked as "present"
- check if there's a privilege level problem (CPL > DPL)
- load (and decode) the GDT/LDT entry and extract "base address", "limit" and "attributes"
- store that information into the register
All of the things you marked in bold could be done in parallel in one cycle. Limit checking is done for every instruction / operation in protected mode anyway, and the virtual address of GDT / LDT already resides in hardware (register). The privilege level change could be assumed to pass, and could add some additional steps when there are problems (similar to jmp prediction). Decoding descriptors easily could be done in hardware (wouldn't require many transistors to do).
Brendan wrote:
Because loading a normal register is relatively simple it can be done without micro-code; and for some CPUs, in some cases (e.g. loading a register with another register) it isn't even a micro-op (e.g. handled by "register renaming" in the front-end). Loading a segment register is complicated (too complex for a simple micro-op) and therefore there's an extra "fetch micro-ops from micro-code" step involved (in addition to all the other work).
That's no argument, but a preference of the chip designer that will bother to optimize some things but not others. In fact, segment register loads were optimized in the beginning, and it is a later development to not to bother with optimizing them any more.
Brendan wrote:
For all these reasons; loading a segment register should probably be around 10 times slower than loading a normal register.
Look at the instruction timings for i386, and you can see that you are mistaken.
Brendan wrote:
For using (rather than loading) a normal index register vs. using a segment register; the CPU would calculate the virtual address (and would probably be optimised to do this very quickly as it's done very often in all code), and would then convert that virtual address into a linear address.
Both GDT and LDT already have linear addresses, so there is one step less there when fetching a descriptor.
You might also compare with the 4-level paging scheme used for long mode. If that had similar level of optimization as segmentation, code would run 100s of times slower than it does. And the 4-level paging scheme affects each and every memory access, including code fetches, so it is really hardware intensive.
Scheme for long mode paging (every memory access):
- determine PML4 entry
- check PML4 entry and invoke page fault if bad
- determine directory ptr entry
- check directory ptr entry and invoke page fault if bad
- determine directory entry
- check directory entry and invoke page fault if bad
- determine page entry
- check page entry and invoke page fault if bad
- read data from memory
In addition to that, we also have several page table sizes that require additional logic to decode, and paging comes in two variants, so the code needs to determine which variant to use.
Also note that each of these steps must be carried out in a sequence, since the next step requires the previous physical address, so it is impossible to do it in parallel.
In light of this, adding one cycle penalty for adding two numbers to form a linear address is plain stupid. The adder could just as well always be used in protected mode.
Also note that apparently the chip designer has managed to implement exactly the same functionality for FS and GS segment registers in long mode by loading a MSR, which require several instructions to execute as wrmsr uses several registers as it's input. One might wonder why it was necessary to reinvent the wheel here? Wouldn't the already existing FS and GS bases have been just as good?
I might add that apparently segment register loads are still supported in long mode, and they do what we expect (fill the descriptor caches, and check the descriptors and generate protection faults), and when the processor switches to compability mode, these descriptors apparently function as if they were loaded in protected mode. So it appears that the only thing done in long mode with segmentation is to disable adding the base and checking the limit. I think they could have done better than that, and implemented a simple 64-bit adder in hardware instead of in microcode. Limit checking could be done as branch prediction, you assume limits are ok, and handle exceptional cases with penalties.