[SOLVED] Single Kernel Stack Or Allocate for Each Process?

irvanherz · Post by **irvanherz** » Wed Dec 28, 2016 4:35 am

Do each process has its own kernel stack? or we just need single kernel stack to be initialized?
Recently, I've implemented software multitasking(before user mode implemented), where all registers stored to stack on task switching.

But its now getting nightmare when I know Intel change SS and ESP from TSS as kernel stack on interrupt (i'm using single TSS).

Could you give me sugestions, please?

JAAman · Post by **JAAman** » Wed Dec 28, 2016 8:10 am

yes, most people do use a separate kernel stack per thread, while it is possible to do it with only one stack, most people use a separate stack -- you don't need multiple TSSs, just patch the one TSS with the new thread's stack address on task switch -- just replace the TSS.ESP0 with the new TSS.ESP0 and your good

first you store the current context, then you switch stacks (current ESP, TSS.ESP0, and CR3), then you restore context -- its really very simple

to do this, you need to store ESP0 somewhere other than the stack, so that it can be restored from another thread, without needing to first know where the stack is (usually this will be stored in a thread structure that holds various information about the thread -- including the ESP0 and CR3 for the thread)

Boris · Post by **Boris** » Wed Dec 28, 2016 5:04 pm

On the top of that, I suggest you to allocate special stacks for double faults, and non maskable interrupts .
they will be used when you need to handle something bad or hardware watchdogs

Brendan · Post by **Brendan** » Wed Dec 28, 2016 10:02 pm

Hi,

irvanherz wrote:Could you give me sugestions, please?

There's advantages and disadvantages.

The main advantages of "single kernel stack per CPU" are:

It reduces memory consumed by kernel stacks (e.g. with 4 KiB per kernel stack, 1000 threads and 8 CPUs it'd be "4 MiB vs. 32 KiB").
It improves the efficiency of CPU's caches, because each CPU's kernel stack is likely to remain in that CPU's caches.
It's slightly easier to optimise kernel stacks for NUMA (the "kernel stack that this CPU is using was allocated for a different NUMA domain" problem).
It's faster when 2 or more things that would cause task switches occur between "kernel entry" and "kernel exit" (e.g. between SYSCALL and SYSEXIT, or between an interrupt handler starting and IRET). This is because you end up saving user-space thread state at "kernel entry" and restoring user-space thread state at "kernel exit", so things that cause task switches to occur between "kernel entry" and "kernel exit" just cause a "which thread to return to" variable to be changed.

The main advantages of "kernel stack per thread" are:

It's faster when nothing causes a task switch between "kernel entry" and "kernel exit". This is because you're not saving so much user-space thread state at kernel entry or restoring it at kernel exit.
It's much easier to handle kernel pre-emption (e.g. when kernel is in the middle of doing something lengthy/expensive and a high priority task unblocks, then you can switch to the high priority task). For "single kernel stack per CPU" you either end up with poor latency (because you can't preempt kernel and more important things have to wait) and/or end up using special/explicit "synch points" (where lengthy/expensive things are broken up into multiple smaller things separated by some sort of "should I switch to something else now that I've completed that last smaller thing" check).

Note that for micro-kernels there's a lot less "lengthy/expensive kernel things" (and breaking "lengthy/expensive kernel things into multiple smaller things" is easier) because the kernel does less (creating a process is typically the most lengthy/expensive thing, and that can be naturally broken up into creating "process metadata", creating a virtual address space, and creating an initial thread); and for micro-kernels you typically end up with a lot more task switches (mostly because drivers aren't in the kernel) and you're more likely to expect "1 or more task switches between kernel entry and kernel exit" (and less likely to care about "no task switch between kernel entry and kernel exit").

Essentially; "single kernel stack per CPU" tends to favour micro-kernels, and "single kernel stack per thread" tends to favour monolithic kernels.

Cheers,

Brendan

irvanherz · Post by **irvanherz** » Thu Dec 29, 2016 12:17 am

yes, most people do use a separate kernel stack per thread, while it is possible to do it with only one stack, most people use a separate stack -- you don't need multiple TSSs, just patch the one TSS with the new thread's stack address on task switch -- just replace the TSS.ESP0 with the new TSS.ESP0 and your good

Kernel stack is mapped in kernel space, right?
I think, it will be dangerous when stack overflow happens.

On the top of that, I suggest you to allocate special stacks for double faults, and non maskable interrupts .
they will be used when you need to handle something bad or hardware watchdogs

Thanks for your sugestions.

Essentially; "single kernel stack per CPU" tends to favour micro-kernels, and "single kernel stack per thread" tends to favour monolithic kernels.

OK, I'll prefer single kernel stack per thread. But, is 4KB enough for kernel in servicing the task? Or, should I switch another stack when kernel need to do some hard works?

FallenAvatar · Post by **FallenAvatar** » Thu Dec 29, 2016 12:26 am

irvanherz wrote:OK, I'll prefer single kernel stack per thread. But, is 4KB enough for kernel in servicing the task? Or, should I switch another stack when kernel need to do some hard works?

4KB was used as it is the smallest space a stack can take on x86 due to pages being 4KB (At there smallest). You obviously should provide demand paging (or something else) to allow the stack to grow as much as is needed.

- Monk

Brendan · Post by **Brendan** » Thu Dec 29, 2016 6:41 am

Hi,

irvanherz wrote:
Essentially; "single kernel stack per CPU" tends to favour micro-kernels, and "single kernel stack per thread" tends to favour monolithic kernels.
OK, I'll prefer single kernel stack per thread. But, is 4KB enough for kernel in servicing the task? Or, should I switch another stack when kernel need to do some hard works?

For "lean and mean micro-kernel in assembly", I've used 2 KiB kernel stacks without any problem. For monolithic you'd want larger, and for C (or worse, C++) you'd want larger. The best way is to worry about it later - take an educated guess (or just use something huge for now), and then measure it to find out what you actually do need. Typically (for measurement) it's enough to pre-fill the stack/s with a magic value (e.g. "0xF00DBABE") and let the system run for a while, then check to see how many of those magic values got overwritten (and then add some more for "just in case").

Also note that (especially if you're expecting to need larger stacks and/or support IRQ nesting) it's probably worth considering using additional stacks for IRQ handlers. For example; ignoring IRQs your worst case might be 3 KiB, and with IRQ nesting (where one IRQ handler interrupts another that interrupted another) the actual worst case might be 6 KiB. Instead of having 8 KiB of kernel stack for every thread (to cope with "worst case including IRQ nesting"), maybe you only need 4 KiB for each thread plus a "4 KiB per CPU" stack that is only used by IRQ handlers (where you switch to the special "IRQ only stack" at the start of an IRQ handler). Note: This is something that Linux does.

If you're considering dynamically resized stacks (e.g. allocating more if you get a page fault because your existing stack needs to be larger); be extremely careful. The basic problem is something like (e.g.) you run out of stack and get a page fault, but then the CPU triple faults because there isn't enough stack to start the page fault handler. However; there's a huge number of corner cases and variations beyond just that basic scenario. For one random example; maybe something else causes a page fault, the page fault handler starts but is immediately (before it can save CR2) interrupted by an NMI, then the CPU runs out of stack space while trying to start the NMI handler and that causes a second page fault (which overwrites the previous value in CR2 that wasn't saved yet); and now you have to dig your way out of "trouble, 3 layers deep".

Cheers,

Brendan

rdos · Post by **rdos** » Sun Jan 01, 2017 6:23 am

I actually have both. Each thread has its own kernel stack so it can be preempted in kernel. Each core also has its own stack that is used by the scheduler. As soon as a thread blocks, the scheduler will switch to the per-core stack, and then the stack will be reloaded with a thread kernel stack once it schedules a new thread. This makes it possible distribute the SMP scheduler so it can run on all cores at the same time. I also have a TSS per thread, but I don't use hardware task switching anymore. I use software to read and write the registers in the TSS instead.

As somebody pointed out, there should be special handlers for double faults and stack faults that are TSS-based. Otherwise, tripple faults will occur when kernel stack is exhausted or bad.

irvanherz · Post by **irvanherz** » Wed Jan 04, 2017 1:38 am

OK, all was clear now. Thanks for you all..

OSDev.org

[SOLVED] Single Kernel Stack Or Allocate for Each Process?

[SOLVED] Single Kernel Stack Or Allocate for Each Process?

Re: Single Kernel Stack Or Allocate for Each Process?

Re: Single Kernel Stack Or Allocate for Each Process?

Re: Single Kernel Stack Or Allocate for Each Process?

Re: Single Kernel Stack Or Allocate for Each Process?

Re: Single Kernel Stack Or Allocate for Each Process?

Re: Single Kernel Stack Or Allocate for Each Process?

Re: Single Kernel Stack Or Allocate for Each Process?

Re: Single Kernel Stack Or Allocate for Each Process?