Multiple CPUs - what to share

JamesHarris · Post by **JamesHarris** » Mon Oct 21, 2024 9:39 am

I'm looking at how best to handle multiple x86 CPUs (or CPU cores, if you prefer) on a single machine, and the question arises: what should be shared between all the cores and what should be per-core? Advice would be welcome!

At one extreme I could have every CPU use the same sets of page tables, the same GDT, process table, etc. At the other extreme each CPU could have its own sets of page tables, its own GDT, its own processes etc (almost as though they were separate machines) with inter-CPU communications being used to balance resources between them.

Or there could be a compromise such as having some address space shared and some exclusive to each CPU. For example, each CPU could have its own high 4M, say, and share the rest. In that scenario, assuming a 32-bit 4G environment for the sake of simplicity, each PDBR would have its highest entry refer to an exclusive 4k page table and the rest point to nearly 4M of shared page tables.

I'm not at the point of handling multiple CPUs yet but I've been including code such as

if (this_cpu() == 0) /* boot CPU */

and

tasks = this_cpu_workspace()->tasks

where each such function call returns either the current CPU number or a pointer to a struct which has data for the current CPU - i.e. the one which is running the code. So such questions are relevant as to how it's best to proceed to handle such structures as the GDT (or GDTs). Even more so about IDT handling.

Hence, what do you share or intend to share? What kind of approach looks best to you? As I say, comments would be welcomed!

nullplan · Post by **nullplan** » Mon Oct 21, 2024 1:04 pm

Let's start with the simple stuff first: Data structures. On x86, in order to build an OS, you are going to need an IDT, a GDT, and a TSS. Since the TSS contains the stack addresses, and each CPU really should have its own stacks, you can definitely not share the TSS. You could in theory share the GDT. This would then require the CPUs to cooperate when booting up, so that each CPU gets its own TSS into its own slots. Personally, I don't really see the point though, since the GDT is just an array of 7 64-bit words (one null seg, one kcode seg, one kdata seg, one ucode seg, one udata seg, and two slots for the TSS). Might as well allocate that one per CPU.

Leaves the IDT. I have a completely filled IDT, because between IOAPICs having a potentially unbounded number of inputs and MSI-X being a thing, there is no sensible upper limit to the number of IRQs a computer can have that is lower than what the architecture gives you. I'm using the IDT mostly just to transfer the IRQ number into a data item that the shared handler can look at, so there really is no need to have separate IDTs for each core, so I just share that one.

Next: Paging structures. I am building an upper-half OS, so the upper half of address space belongs to the kernel, and the lower half belongs to the user-space. So I have absolutely no problem sharing the kernel half with everything. Obviously, each virtual address space needs its own top-level structure, but that can just copy the kernel-side mappings from the master mapping table.

For the lower half, mappings are shared on the level of the process, so if multiple cores run threads of the same process, they will also both have the same CR3. Why should they not?

To summarize: In my OS, the cores share everything except

the stack
the IST stacks
the TSS
the GDT (though the layout is the same everywhere)
[*}assorted CPU-local variables

I remember a certain oft-quoted tutorial, which uses paging tricks to unshare the stacks while keeping the virtual addresses the same. That technique has the problem that with it, cores cannot share pointers to stack variables. I on the other hand keep the address space sane and just move new threads to new stacks.

Multi-CPU scheduling is a bit of a chore. The way I do it, and it seems to work for now: Every CPU has its own task list, and there is a list of orphaned tasks. Whenever a scheduler passes over a runnable task more than twice in a row, it moves that task to the orphaned list and sends an IPI to all other CPUs to reschedule. Whenever a CPU needs to reschedule, it considers both the CPU-local tasks and the list of orphaned tasks (which guarantees that at some point, the orphaned task will have the highest priority). And if an orphaned task is accepted into a CPU, it is of course removed from the orphaned list. Otherwise, bog standard round robin with priorities in each CPU.

Experimentally, this seems to spread the load pretty evenly on the cores and prevent too many spurious moves, except under high load. Once there are more runnable processes than cores, the wild mass swapping begins. But that situation is rare.

JamesHarris · Post by **JamesHarris** » Mon Oct 28, 2024 2:09 pm

Thanks for the reply.

nullplan wrote: ↑Mon Oct 21, 2024 1:04 pm Let's start with the simple stuff first: Data structures. On x86, in order to build an OS, you are going to need an IDT, a GDT, and a TSS. Since the TSS contains the stack addresses, and each CPU really should have its own stacks, you can definitely not share the TSS. You could in theory share the GDT. This would then require the CPUs to cooperate when booting up, so that each CPU gets its own TSS into its own slots. Personally, I don't really see the point though, since the GDT is just an array of 7 64-bit words (one null seg, one kcode seg, one kdata seg, one ucode seg, one udata seg, and two slots for the TSS). Might as well allocate that one per CPU.

That's a good analysis. I've been following it, though I've run into a small performance problem. With separate 7-entry GDTs I can't see a fast way to determine which CPU a piece of code is running on or where the current CPU's variables are stored. Maybe you have a way I've missed but if each CPU's TR is different (i.e. each points to a different GDT index) then the CPU number can be determined by such as

Code: Select all

cpu_number_get:
  str eax
  movzx eax, ax
  shr eax, 3
  sub eax, BASE_CPU_NUMBER
  ret

That's very fast (e.g. when last quantified by Intel in their documentation, str took just 2 cycles) and makes use of a genuinely per-CPU register, i.e. the Task Register. If each TR holds a different value, allocated in sequence, then the code can be used.

By contrast, if all TRs hold the same 16-bit value I can't think of a way to find out the CPU number so quickly.

So I'm thinking of having a single GDT where each CPU has its own TSS index.

As an aside, I have some ideas but I'm not completely sure what your second TSS is for.

Octocontrabass · Post by **Octocontrabass** » Mon Oct 28, 2024 2:37 pm

JamesHarris wrote: ↑Mon Oct 28, 2024 2:09 pmWith separate 7-entry GDTs I can't see a fast way to determine which CPU a piece of code is running on or where the current CPU's variables are stored.

Use segmentation. Long mode gives special treatment to GS so it can be used as your per-CPU data pointer. In protected mode, just add an extra kernel data segment with a nonzero base. (A 32-bit TSS only occupies one slot in the GDT, so it's still only 7 entries.)

rdos · Post by **rdos** » Mon Oct 28, 2024 2:54 pm

I use a paging trick to quickly determine CPU id/data. I map the first selectors in the GDT to different pages for different cores, and so I can just load a specific selector and then I have the core data directly accessible. The TR trick doesn't work for me, and neither does the gs. Each thread has it's own TSS, and gs is used by kernel since my kernel is segmented.

JamesHarris · Post by **JamesHarris** » Mon Oct 28, 2024 5:17 pm

rdos wrote: ↑Mon Oct 28, 2024 2:54 pm I use a paging trick to quickly determine CPU id/data. I map the first selectors in the GDT to different pages for different cores, and so I can just load a specific selector and then I have the core data directly accessible. The TR trick doesn't work for me, and neither does the gs. Each thread has it's own TSS, and gs is used by kernel since my kernel is segmented.

Cool. If you are using the APIC system (which I am not) IIRC there's also the Local APIC ID Register.

Octocontrabass · Post by **Octocontrabass** » Mon Oct 28, 2024 5:46 pm

JamesHarris wrote: ↑Mon Oct 28, 2024 5:17 pmIf you are using the APIC system (which I am not)

How exactly are you using more than one CPU without using any APICs?

JamesHarris · Post by **JamesHarris** » Tue Oct 29, 2024 3:44 am

Octocontrabass wrote: ↑Mon Oct 28, 2024 5:46 pm
JamesHarris wrote: ↑Mon Oct 28, 2024 5:17 pmIf you are using the APIC system (which I am not)
How exactly are you using more than one CPU without using any APICs?

I'm not. As I said in the initial post, I'm not at the point of handling multiple CPUs yet but I've been including multiprocessor-aware code. That's so there's less to change when I get to that point.

Octocontrabass · Post by **Octocontrabass** » Tue Oct 29, 2024 8:16 pm

Aha, that makes sense.

OSDev.org

Multiple CPUs - what to share

Multiple CPUs - what to share

Re: Multiple CPUs - what to share

Re: Multiple CPUs - what to share

Re: Multiple CPUs - what to share

Re: Multiple CPUs - what to share

Re: Multiple CPUs - what to share

Re: Multiple CPUs - what to share

Re: Multiple CPUs - what to share

Re: Multiple CPUs - what to share