Since this topic of event systems seems to be a recurring theme these days, let me report my experiences from the fully asynchronous event handling in Managarm.
First, let's talk about IRQs. IRQs give you only
one bit of information: did the IRQ happen or not? The same applies to notifications that poll(), select() or epoll() return. You only get one bit of information per possible event: did this file become readable in the meantime? Did it become writable?
The first crucial observation is that these kinds of events can be handled in a stateless way (from the kernel's point of view). The main technique that is required for this are sequence numbers. Sequence numbers are incredibly powerful, even for more complex events. In the case of IRQs, it works as follows: the IRQ handler increments a sequence number. Threads (whether they are in the kernel or in userspace) can wait for the sequence number to increment. In the case that two IRQs happened in the meantime, the consumer just sees that the sequence number incremented twice (but for well-designed drivers, this does not make a difference). This mechanism does not require any per-consumer state inside the kernel since each consumer just keeps track of the last sequence number that it saw. Thus, there is no resource exhaustion issue since
the kernel never queues an unbounded number of notifications.¹
A second crucial observation is that a pull-based model is much easier to control than a push-based model w.r.t. resource accounting and concurrency issues. Here, by push-based, I mean a model where the kernel (or any
producer) pushes notifications to
consumers, e.g., by invoking a callback. On the other hand, in a pull-based model, the consumer asks for notifications (e.g., the Linux poll() function follows this model). From a concurrency point of view, the main advantage of a pull-based model is that all operations flow into one direction: from the consumer to the producer. Let's visualize the direction of the operations:
Code: Select all
PUSH-BASED
Consumer -------- attach consumer -------> Producer
Consumer <-------- invoke callback ------- Producer [X]
Consumer -------- detach consumer -------> Producer
PULL-BASED
Consumer -------- attach consumer -------> Producer
Consumer ------- pull notification ------> Producer [✓]
Consumer -------- detach consumer -------> Producer
This means that the producer can freely take a lock to process the operation without any chance of deadlocking. In the push-based case, one needs to be careful not to introduce producer -> consumer -> producer deadlocks. It also makes attaching and detaching consumers easier since the attach/detach operations are ordered w.r.t. the operations that pull new notifications from the producer. In the push-based case, some mechanism is required that makes sure that after a consumer is detached, no new notifications will be sent (not even from concurrently running CPUs).
This is hard because producers cannot just call into consumers with locks held (otherwise, the producer -> consumer -> producer deadlock can occur). Regarding accounting, it is easier to ensure that resource exhausting does not happen: the number of concurrently queued operations can just be limited on a per-consumer basis. In almost all situations, the producer just has to buffer a single operation per consumer at a time.
¹ As a side note: yes, (level-triggered) IRQs do have to be masked when they are not handled synchronously, but that's an orthogonal issue. Other than the overhead of two additional writes to the PIC, it is also does not impact the system's performance.