Reentrant scheduler

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

Reentrant scheduler

Post by OSwhatever »

In practice you can just disable interrupts around the entire scheduler and that way be "safe" but from a interrupt latency point of view this is bad. In order to reduce this you just lock those necessary regions in order to select the next execution token for the processor in particular. This means an interrupt can happen anywhere inside the scheduler and it should still work.

Everything works fine until an interrupt comes, you run the interrupt handling and then upon exit you run the scheduler in case the interrupt caused a higher priority job. This also works fine but the problem starts to manifest itself when there is an interrupt in the scheduler that was caused by a previous interrupt. First, the context you are running is probably the interrupt stack, but in the second interrupt typically run in the same context and you have two scheduler executions working on the same thread.

In order to solve this again you can always assign a new execution context for each new interrupt but then again in order to have you have to involve the entire memory system which means some mutex will destroy it somewhere.

Is there someone here who has solved the problem with a fully reentrant scheduler and fully reentrant interrupt handling?
User avatar
Love4Boobies
Member
Member
Posts: 2111
Joined: Fri Mar 07, 2008 5:36 pm
Location: Bucharest, Romania

Re: Reentrant scheduler

Post by Love4Boobies »

This question is almost entirely nonsense. Did you read up on any of the things you've mentioned? Because it sounds like a buch of buch of buzzwords strung together.
"Computers in the future may weigh no more than 1.5 tons.", Popular Mechanics (1949)
[ Project UDI ]
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Reentrant scheduler

Post by Korona »

There are several separate problems that you have to solve:

The first one is allowing nested IRQs. x86 allows multiple IRQ priorities (determined by the vector number) and potentially there can be one active IRQ per priority level. In order to allow this you would either have to (i) run IRQs on the usual kernel stack and not via a separate task gate (on i386) or IST stack (on x86_64) or (ii) have some scheme to handle IRQs that overwrite live stacks. Both of these hurdles can possibly be overcome by some engineering. (i) increases the required size of all kernel stacks (which is a problem if your kernel uses one kernel stack per thread) while (ii) is much more difficult to implement correctly. Note that usage of syscall on x86_64 basically demands (ii) because if you run IRQs on the default stack there is a race between stack switching (i.e. movq gs:some_offset, %rsp; swapgs; sysret) and IRQs. This could probably be solved by swapping (no pun intended) swapgs and sysret, at the cost of an additional clobbered register.

The second problem then becomes the implementation of a lockfree scheduler. While something like the O(1) scheduler can be implemented using lockfree linked lists, advanced schedulers require more complex data structures. I do not know of any competitive lockfree tree implementation that would enable something like a lockfree CFS. Note that lockfree data structures usually perform much worse than locking data structures unless there is heavy contention. Another problem might be serializing access to thread data structures: For example my scheduler locks threads during scheduling in order to make sure that it is not killed/ptrace()d/whatever during waiting -> running and running -> waiting state transitions. That would have to be solved in a lockfree manner, too.

The last problem is how to perform a context switch with interrupts enabled. This involves multiple problems: For example on x86_64 the kernel might have to do a swapgs during context switch. The IRQ handler would have to deal with situations where there the interrupt occurred after swapgs but before iret. This adds further complications (and thus runtime overhead) to each IRQ dispatch.

Because of these difficulties and overheads I doubt whether such a system would be even desirable.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

Re: Reentrant scheduler

Post by OSwhatever »

Korona wrote:There are several separate problems that you have to solve:

The first one is allowing nested IRQs. x86 allows multiple IRQ priorities (determined by the vector number) and potentially there can be one active IRQ per priority level. In order to allow this you would either have to (i) run IRQs on the usual kernel stack and not via a separate task gate (on i386) or IST stack (on x86_64) or (ii) have some scheme to handle IRQs that overwrite live stacks. Both of these hurdles can possibly be overcome by some engineering. (i) increases the required size of all kernel stacks (which is a problem if your kernel uses one kernel stack per thread) while (ii) is much more difficult to implement correctly. Note that usage of syscall on x86_64 basically demands (ii) because if you run IRQs on the default stack there is a race between stack switching (i.e. movq gs:some_offset, %rsp; swapgs; sysret) and IRQs. This could probably be solved by swapping (no pun intended) swapgs and sysret, at the cost of an additional clobbered register.

The second problem then becomes the implementation of a lockfree scheduler. While something like the O(1) scheduler can be implemented using lockfree linked lists, advanced schedulers require more complex data structures. I do not know of any competitive lockfree tree implementation that would enable something like a lockfree CFS. Note that lockfree data structures usually perform much worse than locking data structures unless there is heavy contention. Another problem might be serializing access to thread data structures: For example my scheduler locks threads during scheduling in order to make sure that it is not killed/ptrace()d/whatever during waiting -> running and running -> waiting state transitions. That would have to be solved in a lockfree manner, too.

The last problem is how to perform a context switch with interrupts enabled. This involves multiple problems: For example on x86_64 the kernel might have to do a swapgs during context switch. The IRQ handler would have to deal with situations where there the interrupt occurred after swapgs but before iret. This adds further complications (and thus runtime overhead) to each IRQ dispatch.

Because of these difficulties and overheads I doubt whether such a system would be even desirable.
Sorry, I wasn't clear enough what I meant with "lockfree scheduler". You can use lockfree primitives for a completely lockfree transition but "quick" locks and disable interrupt are also allowed. This as opposed to a scheduler where you disable the interrupts through out most the transition and interrupts a first allowed again "on the other side" in the new context. I'm currently investigating how you can avoid disabling the interrupts for that long in the scheduler as this is really "forever" in computer time.

Correct me if I'm wrong but I think it can be possible if you run in a new context every time including interrupts. In this case an irq stack isn't enough, you would need a new fresh context and stack for every irq handler entry. You can solve this by preallocating those thread for the interrupt handler but for a system with 256 interrupt priorities, do you preallocate 256 threads? Also allocating on demand wouldn't work that well in interrupt handlers as it would likely involve the memory subsystem and you will end up in deadlocks.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Reentrant scheduler

Post by Korona »

x86 supports 14 IRQ priorities: The IRQ priority is determined by the highest 4 bits of the interrupt vector and the first 32 interrupt vectors are reserved. Thus there can only be 14 active IRQs at the same time.
OSwhatever wrote:Correct me if I'm wrong but I think it can be possible if you run in a new context every time including interrupts. In this case an irq stack isn't enough, you would need a new fresh context and stack for every irq handler entry. You can solve this by preallocating those thread for the interrupt handler but for a system with 256 interrupt priorities, do you preallocate 256 threads? Also allocating on demand wouldn't work that well in interrupt handlers as it would likely involve the memory subsystem and you will end up in deadlocks.
That is possible on i386 via task gates. It is not possible on x86_64 tough as there are only 7 IST slots in the TSS and at least two of them are usually required for other purposes, namely the NMI and MCE handlers. You would need to do the switch switch yourself inside the handler code. Doing this should be possible but I don't think it is easy.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
zaval
Member
Member
Posts: 656
Joined: Fri Feb 17, 2017 4:01 pm
Location: Ukraine, Bachmut
Contact:

Re: Reentrant scheduler

Post by zaval »

Instead of that, maybe it's worth to make your scheduling tasks synchronized. Speaking in NT parlance, you have DPC level of IRQL (DISPATCH_LEVEL) which is lower than any HW IRQLs (DIRQLs). Dispatching (scheduling) is implemented as DPC routines and runs at DPC_LEVEL masking out DISPATCH software interrupt. There is per-processor DPC queue, which is drained when IRQL lowers <DPC_LEVEL.
1. Dispatcher makes its job (IRQL == DPC_LEVEL), DIRQL occurs and interrupts it.
2. HW ISR adds new dispatching DPC into the queue (thread priority lift), then it completes and lowers IRQL.
3. Another pending device Irqs are run.
4. Finally IRQL returns to the DPC_LEVEL and dispatcher finishes its task. Then it lowers IRQL and
5. pending DPC interrupt is fired, DPC queue drain occurs - here dispatcher again does its job of rescheduling of that thread whose priority got increased etc.
This way interrupted scheduler won't be incorrectly called twice or more, its execution is synchronized.
ANT - NT-like OS for x64 and arm64.
efify - UEFI for a couple of boards (mips and arm). suspended due to lost of all the target park boards (russians destroyed our town).
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

Re: Reentrant scheduler

Post by rdos »

Another problem is when the scheduler runs on multiple cores, so you can actually schedule several threads at the same time on different cores. Then, of course, you could have IRQs going to a single core, or to multiple cores.

My scheduler has one stack per core (so it can run on multiple cores). When a thread is blocked, and execution of a new thread is needed, the scheduler will switch to the core stack and then enable interrupts. The only time interrupts are disabled are in a few spinlocks. Actually, when things run on multiple cores, enabling and disabling interrupts is not a usable way of synchronizing, so spinlocks are required instead.

A central design issue is how IRQs can interact with the scheduler. Different OSes solve this in different ways. For instance, NT (and probably Linux too) has a strange solution of different IRQ levels. This kind of design makes debugging IRQs almost impossible. RDOS typically uses server threads to handle the bulk of the things that NT and Linux do in IRQs. The central feature that IRQs in RDOS uses to interact with the scheduler is the "signal" syscall. It can wakeup a (typically high priority) thread that is waiting for an IRQ. This is the feature that makes it possible to move most of the code that other OSes have in IRQs to a server thread. So, basically, the only interaction that IRQs do with the scheduler is to wakeup other threads. The scheduler handles this function by keeping a wakeup list (protected by a spinlock), and when all IRQs have exited, all threads in the wakeup list are handled, and if there is a higher priority thread, it will be selected for execution immediately.
Post Reply