Kernel preemption during software interrupts

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
User avatar
max
Member
Member
Posts: 616
Joined: Mon Mar 05, 2012 11:23 am
Libera.chat IRC: maxdev
Location: Germany
Contact:

Kernel preemption during software interrupts

Post by max »

Hey,

I can't quite understand how a preemptible kernel is supposed to work, meaning that one task can be interrupted while doing a syscall. My kernel supports user and kernel privilege threads. Each user level thread has a kernel stack and a userspace stack. Each kernel level thread only has one stack where the registers are just pushed on-top when an interrupt occurs. This means a system call from userspace currently looks like this:

Code: Select all

Userspace task running
-> System call is fired
-> CPU pushes SS, ESP, EFLAGS, CS, EIP (to kernel stack of user task)
-> Enters interrupt handler
--> CLI so we don't get interrupted
--> Push general registers (to kernel stack of user task)
--> Store register state in task structure
---> System call is processed
--> Restore register state from task structure
--> Interrupt handler pops general registers (off the kernel stack of user task)
-> IRET pops EIP, CS, EFLAGS, ESP SS
-> Returns into the userspace task
What I can't wrap my head around, how does it work if this is interrupted again, say by a timer interrupt? It would be a Ring 0 -> Ring 0 switch so the stack would not be switched.. where would you store the registers now? On top of the stack again? Like...

Code: Select all

Userspace task running
-> System call is fired
-> CPU pushes SS, ESP, EFLAGS, CS, EIP (to kernel stack of user task)
-> Enters interrupt handler
--> CLI so we don't get interrupted
--> Push general registers (to kernel stack of user task)
--> Store register state in task structure
--> Now STI again?
---> System call is processed
---> We now get interrupted by timer interrupt
---> ???
Any clarifications greatly appreciated!

Best greets
alexfru
Member
Member
Posts: 1111
Joined: Tue Mar 04, 2014 5:27 am

Re: Kernel preemption during software interrupts

Post by alexfru »

max wrote:... how does it work if this is interrupted again, say by a timer interrupt? It would be a Ring 0 -> Ring 0 switch so the stack would not be switched.. where would you store the registers now? On top of the stack again?
Yes, until you run out of stack space or interrupt priorities.
songziming
Member
Member
Posts: 71
Joined: Fri Jun 28, 2013 1:48 am
Contact:

Re: Kernel preemption during software interrupts

Post by songziming »

You should have a dedicated stack for interrupts. If your OS support multi-core, then one stack for each CPU. When handling interrupts, first switch to interrupt stack, then switch back.

Since interrupt stack does not belong to any task (thread), your can achieve task preemption. Find a new task to run in the ISR, then retrieve its rip saved during last interrupt, and switch to that stack.

For exceptions and syscall, they don't cause task preemtpion. So no need to switch stack (only switch stack when CPL=3)
Reinventing the Wheel, code: https://github.com/songziming/wheel
User avatar
JAAman
Member
Member
Posts: 879
Joined: Wed Oct 27, 2004 11:00 pm
Location: WA

Re: Kernel preemption during software interrupts

Post by JAAman »

task preemption requires absolutely nothing on your part -- it just works

there is really no problems here, it is a lot simpler than you are thinking:

Code: Select all

user stack:
systemCall -> switch to kernel stack

kernel stack:
<-user return CS:r/eIP, eflags, etc.
<-user preserved registers, segment registers, etc.

<-kernel using stack

--Interrupt occurs--
<-return information (r/eIP, etc.)
<-interrupt preserved information

<-interrupt usage

--interrupt decides to task switch--

--IMPORTANT: be sure to satisfy the hardware (whatever required to prevent repeated interrupts and allow the hardware to proceed) and send EIO before moving on to the next point

<-interrupt CALLs task switch program (return state information stored on stack)
STACK SWITCH : task switching code switches to another task, switching stacks in the process

<-task switch code returns to whatever called in [i]in the new process[/i], old process is suspended

here we will assume it was called by another ISR

<-interrupt completes, and local variables are removed from stack (note: this is the interrupt from the new task, not the old task)

<-ISR restores preserved information
<-IRET returns to kernel code which was interrupted by the interrupt

<-kernel syscall finishes processing, and removes locals from stack

<-kernel syscall restores preserved registers
<-kernel syscall returns to user code

:at this point the kernel stack is empty (or rather, at the same point it was before the syscall)
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: Kernel preemption during software interrupts

Post by nullplan »

max wrote:What I can't wrap my head around, how does it work if this is interrupted again, say by a timer interrupt? It would be a Ring 0 -> Ring 0 switch so the stack would not be switched.. where would you store the registers now? On top of the stack again? Like...
Yes, it is all pushed on top of the stack again. One difficulty you may encounter if you enable interrupts too early is that some of the work of switching to kernel space might not be done when the registers indicate as such. For instance on x86_64, you may need to run the SWAPGS instruction, but you only need to run it on entry from userspace. Which you can typically identify by the privilege level of the pushed CS in the interrupt frame. But if you enable interrupts too early, and then get interrupted between entering the kernel and running swapgs, in the interrupt, you will see that kernel mode was interrupted because the CS will be the kernel mode CS. But the GS is not swapped yet.

And in 32-bit mode you can have the same fun switching out the data segment registers.

For this reason, and because running out of kernel stack is generally a bad thing, you should always use interrupt gates in your IDT (or task gates where the corresponding TSS's EFLAGS has IF set to 0). Then use STI judiciously (i.e. only with plenty of stack left, and only after finishing the context switch).
Carpe diem!
User avatar
max
Member
Member
Posts: 616
Joined: Mon Mar 05, 2012 11:23 am
Libera.chat IRC: maxdev
Location: Germany
Contact:

Re: Kernel preemption during software interrupts

Post by max »

Thanks for your replies, it was pretty helpful to understand it better. I'll have to adjust my mutex implementation though so it will also disable interrupts when one or more locks are taken on a processor, so it can't get interrupted and deadlock the processor :P
nullplan
Member
Member
Posts: 1790
Joined: Wed Aug 30, 2017 8:24 am

Re: Kernel preemption during software interrupts

Post by nullplan »

Easiest implementation of that: Your lock function returns the previous EFLAGS register before disabling interrupts. Your unlock function then restores that register. That way, interrupts can be enabled when the outermost lock is taken. Of course, that requires the first lock to be taken to also be the last lock to be released, but that is generally a reasonable requirement.
Carpe diem!
User avatar
bellezzasolo
Member
Member
Posts: 110
Joined: Sun Feb 20, 2011 2:01 pm

Re: Kernel preemption during software interrupts

Post by bellezzasolo »

There's a few considerations here:
If you're dealing with a user->supervisor switch, kernel stack is loaded from TSS.RSP0. You just need a separate kernel stack for each CPU (each needs its own TSS which means a GDT). This is of course the trivial, non-preemptive case.

Now, with a kernel->kernel switch, the stack isn't switched automatically. This is why you need -mno-red-zone. However, on x86-64, we get the IST in the TSS - 7 entries per CPU where you can specify a custom stack to load for a specific interrupt. On x86, you don't have the red zone, so, as long as the kernel stack isn't almost consumed, you're fine to handle an interrupt. So perhaps mno-red-zone isn't necessary...

Of course, the IST only has 7 entries. You could get in trouble if you use it, and have preemption - you could have two interrupts sharing the same IST entry (pigeonhole principle and all that). Then, the stack is reset to the start, which is a bad thing

This means that you need to update the IST before you STI, to a fresh stack (as you might load a new kernel stack on x86 manually). Or you use the existing kernel stack, which runs the risk of overrun if you deeply nest interrupts.

However, there is one more case to consider. The NMI or MCE case - these can fire when interrupts are disabled. These are cases where the IST is extremely useful (otherwise, you have to load a known good stack on entry)

So, if using the IST, you need an entry for NMI and MCE. You most likely want an IST entry for the page fault as well (if your kernel stack overflows). The rest is really a design choice, but using the IST should avoid red zone issues.
Whoever said you can't do OS development on Windows?
https://github.com/ChaiSoft/ChaiOS
loonie
Posts: 7
Joined: Sat Jul 06, 2019 3:24 pm

Re: Kernel preemption during software interrupts

Post by loonie »

You need many different stacks, for security and to implement different approaches to things without rewriting everything.

Syscall handler that is invoked thru "syscall" instruction needs to have separate space to save/restore registers. You can set it up so that syscall handler executes with ints disabled, you manually change stack, then enable interrupts.
To exit syscall handler you manually disable ints (cli), change back to user stack and exit syscall with ints disabled. Ints are re-enabled after "sysret" execution.

If your design assumes interrupting IRQ handler or syscall handler then you need separate stack every time the "interruption" kicks in.

Sometimes you need to exit IRQ handler, to achive reasonable performance.
Mouse IRQ for example. You save regs, do sti, and do mouse cursor redraw and maybe window events.
During this time additional mouse IRQs may come in. You need pile-up mouse packets quickly, exit these extra IRQs, and you cpu will continue with mouse cursor redraw.
After mouse cursor redraw you check if extra mouse packets are worth processing (maybe depends on cursor refresh rate).
Finally you exit cursor redraw code, disable ints, and return back to thread that was interrupted (during this - ints are enabled).
This design requires 2 separate mem locations to save/restore registers per >>> single IDT entry <<<.
Post Reply