SMP and the lost + wake-up problem

kzinti · Post by **kzinti** » Wed Jan 27, 2021 10:27 pm

I have decided that now is the time to get rid of my "big kernel lock" and start addressing a number of things I have been postponing. One of them is the infamous "last wake-up" problem where you basically check for a condition and put the thread to sleep. The problem is that the condition might change between the time you check for it and the time you put your thread to sleep.

Consider this mutex implementation for example:

Code: Select all

class Mutex
{
public:
    Mutex();

    void lock();
    bool try_lock();
    void unlock();

private:
    std::atomic_bool m_lock;
    int              m_owner;
    WaitQueue        m_waiters;
};

Code: Select all

Mutex::Mutex()
:   m_owner(-1)
{
}


void Mutex::lock()
{
    while (!try_lock())
    {
        // TODO: we are suffering from the lost wake-up problem here!
        m_waiters.Suspend(TaskState::Mutex);
    }
}


bool Mutex::try_lock()
{
    if (!m_lock.exchange(true, std::memory_order_acquire))
    {
        m_owner = cpu_get_data(task)->m_id;
        return true;
    }
    else
    {
        return false;
    }
}


void Mutex::unlock()
{
    assert(m_owner == cpu_get_data(task)->m_id);

    m_owner = -1;

    m_lock.store(false, std::memory_order_release);

    m_waiters.WakeupOne();
}

Inside the "Mutex::lock()" method, there is a comment showing where the problem is:

Q: what happens if the owner of the mutex unlocks it between the while() condition and the call to suspend the thread?
A: a lost wake-up. The thread is put to sleep waiting on the mutex to be released, but that will never happen because the mutex is in fact not locked anymore.

Now the solution to this without SMP is pretty straightforward: you disable interrupts between checking the condition and suspending the thread.

How do you handle this with SMP though? Well use a spinlock I suppose... So basically you need to lock/unlock the spinlock like you would disable/enable interrupts in the non-SMP case.

So how can I accomplish that? What I think needs to happen is this: you lock the spinlock, you call Suspend(reason, spinlock) and after the thread switch (i.e. sched_schedule()/sched_switch()), you unlock that spinlock. This way no other thread running on some other CPU can release the mutex before your thread is probably put to sleep.

How do you pass the spinlock to the other thread? I am thinking that it will need to be stored in a per-cpu data structure (I have one that I access with the fs or gs register). So basically:

1) Add a spinlock inside the mutex class
2) When unlocking the mutex: lock the spinlock, wakeup a waiting thread if any, unlock the spinlock
3) When locking the mutex: lock the spinlock and try to get the mutex
3a) if that succeeds, great, unlock the spinlock and we are done
3b) if that fails, call suspend with a reference/pointer to the spinlock. Save the spinlock pointer in the per-cpu data structure, sched_schedule()/sched_switch(), and one of the first thing to do when resuming the new thread is check if there was a spinlock held by the previous thread, and if so unlock it.

Sounds like this would work... But I was wondering if people had some feedback / other ideas.

nullplan · Post by **nullplan** » Wed Jan 27, 2021 11:59 pm

kzinti wrote:One of them is the infamous "last wake-up" problem where you basically check for a condition and put the thread to sleep. The problem is that the condition might change between the time you check for it and the time you put your thread to sleep.

One solution to this is to take a spinlock (disabling interrupts) before checking the condition and before changing it. If the condition means you have to go to sleep, you somehow register as waiter on that condition. Then you release the spinlock but don't necessarily restore interrupts. Instead you enter the scheduler. If the scheduler schedules another task, it restores the interrupt flag that it had, and if it goes to sleep, it does so with "sti; hlt;". The manipulator also takes the spinlock before changing the condition. If there is a waiting list for the condition, the manipulator will then send wakeup IPIs to the waiters. If the manipulator manages to get in between the other process releasing the spinlock and going to sleep, the disabled interrupts have the effect of blocking the IPI until released. Since interrupts are released one instruction before halting, there is no window for the IPI to hit the reading process such that the "hlt" is not interrupted.

However, for a mutex, what you really want is a futex. Admittedly this is a bit of a cop-out. Essentially, you encapsulate all of these synchronization measures inside one data structure, and then you only have to get it right once and then it never has to bother you again. The interface for a futex is that a futex is a pointer to an int. You can wait on it, specifying a value that the int should have, and you can wake it, waking up the threads that are waiting. The interface is such that if one thread waits with a specific value, and another simultaneously changes the value and calls the wake function, the waiter will definitely wake up or not even go to sleep.

The implementation is that there is a global linked list of futex waiters (for process shared futexes). The waiter takes a spinlock for that list, then checks that the futex still has the correct value. If it doesn't, the lock is released and the waiter doesn't even suspend. But if it does, it registers a waiter structure in the list, sets its own process state to interruptible sleep, releases the spinlock, and calls the scheduler. The waker also takes the spinlock, looks for the correct address in there, and wakes up a waiter by calling a scheduler primitive that will correctly wake up a sleeping process, no matter where the scheduler is right now.

Obviously I'm glossing over some things here. Linux implements an optimization where the futexes can use a process-local list instead of a global one if they only want to talk among local threads. The address has to be a physical address if it is going to be shared between processes, but that also means the page has to be locked into memory (else the physical address corresponding to a given virtual one might change). And getting that scheduler wakeup primitive correct is a hard one.

But once you have that, a mutex is simple:

Code: Select all

class Mutex {
private:
  atomic_int lck = 0;
public:
  Mutex() = default;
  Mutex(const Mutex&) = delete;
  int try_lock() { return cas(&lck, 0, -1) == 0; }
  void lock() {
    while (!try_lock()) {
      futex_wait(&lck, 0);
    }
  }
  void unlock() {
    lck = 0;
    futex_wake(&lck, 1);
  }
};

This can also be improved, though a bit more complicated, to avoid the futex_wake() call if there was no contention. You can copy the idea from musl if you want.

rdos · Post by **rdos** » Thu Jan 28, 2021 2:56 am

Interesting about how futexes work in Linux. I have a variant of this so the Futex is only created in kernel space if there is contention for the mutex in user-space. When there is contention, a kernel object for the futex is allocated that has a wait-list and a spin-lock. I don't keep futexes in lists (instead, the caller need to know its handle), and they only don't work between user processes.

kzinti · Post by **kzinti** » Thu Jan 28, 2021 11:10 am

nullplan wrote:One solution to this is to take a spinlock (disabling interrupts) before checking the condition and before changing it. If the condition means you have to go to sleep, you somehow register as waiter on that condition. Then you release the spinlock but don't necessarily restore interrupts. Instead you enter the scheduler. If the scheduler schedules another task, it restores the interrupt flag that it had, and if it goes to sleep, it does so with "sti; hlt;".

This doesn't sound realistic to me. Why would I want the new task to halt? I want that CPU to keep going. It should not wait for some other CPU to release the mutex (or other synchronization primitive).

I used a mutex for illustrative purposes. I have the same problem with semaphore, condition variables, wait queues, and so on. Basically: what is needed is a "suspend + unlock the spinlock" in one atomic operation. In use space you would use a condition variable, but in kernel space a condition variable would suffer from the problem.

nullplan wrote:However, for a mutex, what you really want is a futex. Admittedly this is a bit of a cop-out. Essentially, you encapsulate all of these synchronization measures inside one data structure, and then you only have to get it right once and then it never has to bother you again.

I was going to look into futexes next for my user space implementation. If futexes can solve the lost wake-up problem than I might just implement them now and use that as a primitive for all my synchronization needs inside the kernel.

Thanks for the feedback!

thewrongchristian · Post by **thewrongchristian** » Thu Jan 28, 2021 1:06 pm

kzinti wrote: Now the solution to this without SMP is pretty straightforward: you disable interrupts between checking the condition and suspending the thread.

How do you handle this with SMP though? Well use a spinlock I suppose... So basically you need to lock/unlock the spinlock like you would disable/enable interrupts in the non-SMP case.

So how can I accomplish that? What I think needs to happen is this: you lock the spinlock, you call Suspend(reason, spinlock) and after the thread switch (i.e. sched_schedule()/sched_switch()), you unlock that spinlock. This way no other thread running on some other CPU can release the mutex before your thread is probably put to sleep.

How do you pass the spinlock to the other thread? I am thinking that it will need to be stored in a per-cpu data structure (I have one that I access with the fs or gs register). So basically:

1) Add a spinlock inside the mutex class
2) When unlocking the mutex: lock the spinlock, wakeup a waiting thread if any, unlock the spinlock
3) When locking the mutex: lock the spinlock and try to get the mutex
3a) if that succeeds, great, unlock the spinlock and we are done
3b) if that fails, call suspend with a reference/pointer to the spinlock. Save the spinlock pointer in the per-cpu data structure, sched_schedule()/sched_switch(), and one of the first thing to do when resuming the new thread is check if there was a spinlock held by the previous thread, and if so unlock it.

Sounds like this would work... But I was wondering if people had some feedback / other ideas.

Rather than a bare spinlock, my lowest level primitive is a spinlock based monitor.

Basically, when you "enter" the monitor, you lock the spinlock (while also disabling interrupts).

While in the monitor, you can leave (unlocking the spinlock), signal the monitor (waking up other threads waiting for this monitor) or wait for a signal (going to sleep while also releasing the spinlock.)

So my monitor structure is:

Code: Select all

struct interrupt_monitor_t {
  spinlock_t spin;
  thread_t owner; // For debugging and deadlock detection
  thread_queue_t waiting;
};

The act of waiting is an atomic sleep and spin lock release, allowing other threads to now enter the monitor. No need to pass spinlock ownership between threads, and no lost wakeups. In pseudo code:

Code: Select all

void interrupt_monitor_wait(interrupt_monitor_t * monitor)
{
  add current thread to monitor->waiting;
  interrupt_monitor_leave(monitor);
  // Unlocked at this point, and the current thread is asleep on monitor->queue
  reschedule();
  // Another thread has signaled us from monitor->waiting
  interrupt_monitor_enter(monitor);
}

Being spinlock based with interrupts disabled, it's SMP and interrupt safe, and is the primary means of synchronization between device drivers and their interrupt handlers.

It's no heavier weight than regular spinlocks in the common case of just entering/leaving the monitor, and the monitor can be used to build higher level objects. I use it to build a higher level monitor that doesn't keep the spinlock held, interrupts disabled and handles recursive use, and a reader/writer lock.

Conveniently, as it also handles waiting processes, it is also in fact the only structure I use to manage waiting processes. Everything that is sleeping is waiting on one of these structures, else it is on the run queue.

You might use it something like:

Code: Select all

  interrupt_monitor_enter(monitor);
  start_device();
  while(status != interrupt_complete) {
    interrupt_monitor_wait(monitor);
  }
  interrupt_monitor_leave(monitor);

In this case, the condition is waiting for an interrupt to have occurred (as indicated by status), so you'd have your interrupt handler do something like:

Code: Select all

  interrupt_monitor_enter(monitor);
  status = device_get_status();
  interrupt_monitor_signal(monitor);
  interrupt_monitor_leave(monitor);

All my interrupt handling uses these locks to co-ordinate between driver and interrupt handler.

I don't do SMP yet, but I don't think there are any SMP holes in my current implementation (assuming my spinlock is SMP safe.)

kzinti · Post by **kzinti** » Thu Jan 28, 2021 1:27 pm

@thewrongchristian

That's interesting, because I am essentially building the same primitive as your monitor and call it a "WaitQueue". But it does suffer from the lost wake-up problem, hence my question here.

https://github.com/kiznit/rainbow-os/bl ... tqueue.hpp
https://github.com/kiznit/rainbow-os/bl ... tqueue.cpp

Question from your post:

Code: Select all

void interrupt_monitor_wait(interrupt_monitor_t * monitor)
{
  add current thread to monitor->waiting;
  interrupt_monitor_leave(monitor);
  // Unlocked at this point, and the current thread is asleep on monitor->queue
  reschedule();
  // Another thread has signaled us from monitor->waiting
  interrupt_monitor_enter(monitor);
}

You say "Unlocked at this point, and the current thread is asleep on monitor->queue". How can this be true? If the current thread is sleeping, it is not running. Both cannot be true.

How would you implement a mutex using your monitor primitive? How would you block & wait without losing any "signal" from a thread unlocking the mutex on a different CPU?

thewrongchristian · Post by **thewrongchristian** » Thu Jan 28, 2021 3:27 pm

kzinti wrote:@thewrongchristian

That's interesting, because I am essentially building the same primitive as your monitor and call it a "WaitQueue". But it does suffer from the lost wake-up problem, hence my question here.

https://github.com/kiznit/rainbow-os/bl ... tqueue.hpp
https://github.com/kiznit/rainbow-os/bl ... tqueue.cpp

Question from your post:
Code: Select all
void interrupt_monitor_wait(interrupt_monitor_t * monitor)
{
  add current thread to monitor->waiting;
  interrupt_monitor_leave(monitor);
  // Unlocked at this point, and the current thread is asleep on monitor->queue
  reschedule();
  // Another thread has signaled us from monitor->waiting
  interrupt_monitor_enter(monitor);
}
You say "Unlocked at this point, and the current thread is asleep on monitor->queue". How can this be true? If the current thread is sleeping, it is not running. Both cannot be true.

If you were to examine the "state" of the this thread at this point, it would be THREAD_SLEEPING and it's on a double linked list that forms the queue for the monitor.

Sure, its PC is still advancing, it's still running on its own stack, and is itself calling reschedule, but wherever you do that task switch, the outgoing thread still needs to get to the point that the task switch occurs.

But the point is that it is on the sleep queue, before the structure is unlocked. Upon calling interrupt_monitor_leave(monitor), it is true that an interrupt could come in an whisk the code away into an interrupt handler, but that's OK, because an interrupt handler cannot assume any current thread state, and cannot sleep, so it cannot end up on another sleep queue. All it can do is potentially wake up another thread.

But, as I write this, it's made me think this may be the source of some thread queue corruption I'm seeing, which is why the linked code below is in a state of flux.

EDIT: Bingo, I've been banging my head against this for a couple of evenings, and disabling the recently added kernel level pre-emption sorts it. Pre-emption is done on interrupt return, so I'm being pre-empted at exactly this point and this is corrupting my lists. Thanks! (this is why it's good to chat about this stuff)

kzinti wrote: How would you implement a mutex using your monitor primitive? How would you block & wait without losing any "signal" from a thread unlocking the mutex on a different CPU?

My full source can be found here (warning, this code isn't fully working, I still have an unknown race condition in my monitor implementation, as I've rewritten much of this in the last week to focus all the sleep logic in the interrupt_monitor_t structure. )

My mutex is the same as my monitor, so the mutex lock is just a call to the following function:

Code: Select all


struct interrupt_monitor_t {
	spin_t spin;
	thread_t * volatile owner;
	thread_t * volatile waiting;
	timer_event_t timer[1];
};

struct monitor_t {
	interrupt_monitor_t lock[1];
	thread_t * volatile owner;
	volatile int count;
};

void monitor_enter(monitor_t * monitor)
{
	INTERRUPT_MONITOR_AUTOLOCK(monitor->lock) {
		thread_t * thread = arch_get_thread();

		while(monitor->owner && thread != monitor->owner) {
			/* Locked by someone else */
			interrupt_monitor_wait(monitor->lock);
		}
		monitor->owner = thread;
		monitor->count++;
	}
	assert(monitor->owner == arch_get_thread());
}

Here, INTERRUPT_MONITOR_AUTOLOCK is a wrapper macro round my primitive interrupt_monitor_t, which bookends the enclosed block with enter/leave calls to the monitor. If I was using C++, I'd be using RAII.

For a mutex/monitor, we hold the interrupt_monitor_t spin lock while we inspect/change the monitor/mutex ownership and state, and upon leaving monitor_enter() the spin lock is not held anymore. The rest of the monitor state stops other threads entering an owned mutex/monitor.

kzinti · Post by **kzinti** » Thu Jan 28, 2021 4:37 pm

I am starting to think that my WaitQueue works just fine and it's the fact that I decided to build my mutex on top of it that is causing problems.

In your case, you are using the same mechanism for both (and not building a mutex on top of the monitor). That's where I think the difference is.

I'll spend some time thinking about this and looking at futexes. Thanks for the discussion.

thewrongchristian · Post by **thewrongchristian** » Thu Jan 28, 2021 5:16 pm

kzinti wrote:OK so let's say that thread B has the mutex and you are thread A.

You put the current thread A on the wait list, release the spinlock and call reschedule().

How do you wake up that blocked thread A when thread B releases the mutex?

Specifically what happens if B releases the mutex after your checks in the while() loop and before reschedule?
Code: Select all
while(monitor->owner && thread != monitor->owner) {
         /* Locked by someone else */
         interrupt_monitor_wait(monitor->lock);
      }


void interrupt_monitor_wait(interrupt_monitor_t * monitor)
{
  add current thread to monitor->waiting;
  interrupt_monitor_leave(monitor);
  // Unlocked at this point, and the current thread is asleep on monitor->queue

--> B releases the mutex here, then what?

  reschedule();
  // Another thread has signaled us from monitor->waiting
  interrupt_monitor_enter(monitor);
}

This is my monitor_leave:

Code: Select all

void monitor_leave(monitor_t * monitor)
{
	assert(monitor->owner == arch_get_thread());
	INTERRUPT_MONITOR_AUTOLOCK(monitor->lock) {
		monitor->count--;
		if (0 == monitor->count) {
			monitor->owner = 0;
			interrupt_monitor_signal(monitor->lock);
		}
	}
}

When B executes this code to release the monitor, interrupt_monitor_signal() is called while protected by monitor->lock, which is the spin lock based monitor. At this point, we own this spin lock, and we know (certainly in the case of my currently UP kernel) that thread A cannot own monitor->lock. In the SMP case, thread A might actually be still running on another processor, between unlocking its hold on monitor->lock and reaching reschedule(), but reschedule has its own spin lock protecting the run queue, so there is no race to the ready queue.

Either:

A will make it to the run queue first, in which case another thread will be scheduled and start running (thread C), after which thread B can take the thread A off the wait queue and put it back onto the run queue.

B will make it to the run queue first, in which case it will take thread A off the wait queue and deposit it on the run queue before A gets the run queue lock. In which case, by the time A gets the run queue lock, it is back on the run queue anyway and may well pick it's own thread to schedule next.

I'm not saying there is no lost wakeup. In the course of this discussion I've already found that bug with pre-emption. But I can't think of a scenario where I can lose a wakeup.

kzinti · Post by **kzinti** » Thu Jan 28, 2021 5:31 pm

Sorry I deleted the message you replied to. I think your monitor works fine. See my last post.

nexos · Post by **nexos** » Thu Jan 28, 2021 5:38 pm

Why do you need that loop at all? IMO it would be better to use something more like a semaphore. In my OS, I plan on using semaphores for all synchronization. This does cause the problem of convoys, (which I am currently researching and me try to find a solution, but it avoids the lost wakeup.

thewrongchristian · Post by **thewrongchristian** » Thu Jan 28, 2021 6:17 pm

nexos wrote:Why do you need that loop at all? IMO it would be better to use something more like a semaphore. In my OS, I plan on using semaphores for all synchronization. This does cause the problem of convoys, (which I am currently researching and me try to find a solution, but it avoids the lost wakeup.

Which loop? This one?

Code: Select all

while(monitor->owner && thread != monitor->owner) {
         /* Locked by someone else */
         interrupt_monitor_wait(monitor->lock);
}

This is just the mechanism used to put the thread to sleep waiting for the mutex to become free. You'll need an equivalent in the semaphore down() operation if the semaphore value is already 0.

nexos · Post by **nexos** » Fri Jan 29, 2021 7:51 am

I wasn't talking about your implementation

. Your implementation looks quite good to me!

kzinti · Post by **kzinti** » Fri Jan 29, 2021 10:03 am

I might have found another problem... You could end up with a CPU picking up a "suspended" thread that is not yet really suspended:

Code: Select all

CPU 1 thread A                  CPU 2 thread B

A: enter monitor
A: try to get mutex (fail)
A: queue in wait list
A: leave monitor
                                 B: enter monitor
                                 B: release mutex
                                 B: wakeup thread A
                                 B: leave monitor
                                 B: schedule()
                                 A: thread A starts to run
A: schedule()
C: thread C starts to run

How would your code prevent the above from happening? This is basically revisiting something I previously mentioned: when you add the current thread the the monitor's wait list, it is still running. The CPU is still using it's stack and address space at a minimum. That stack should not be used by another CPU until your thread switch completes on the current CPU.

PS: I am not trying to point out flaws in your implementation, I am just trying to fix mine and understand how to do it

.

Gigasoft · Post by **Gigasoft** » Fri Jan 29, 2021 12:30 pm

You don't need a "switch to thread and release spinlock" function. It is fine to add a thread as waiting or ready even if it is still running. Threads should have a state variable that indicates that the thread is running. When trying to switch to a thread, you have to wait until it is no longer running, using a busy loop. This case should only happen rarely.

OSDev.org

SMP and the lost + wake-up problem

SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem

Re: SMP and the lost + wake-up problem