Quick question: why have you implemented mutexes and not semaphores? As I'm sure you know, semaphores don't require much more effort and you can implement mutexes in terms of semaphores very easily.
Yes, I have mutexes, but you're right, it's easy just to add a counter to implement semaphores. I just wanted to make sure that I can build synchronization primitives on top of my scheduler interface (and started with mutex as the simplest and widely used one).
What do you do right after resuming a thread that is waiting on a mutex?
I have a list of paused threads, each thread is a structure holding a CPU context (and other stuff). This list is guarded by a spinlock. If a new thread tries to acquire mutex, it checks this list. If it isn't empty the thread adds itself to list and pauses (removes itself from scheduling queue). When some thread releases the mutex, it removes a next waiting thread from the mutex list and inserts it to the scheduler queue, so next quantums it will be scheduled. This is simplified explanation, there are many details I have thrown off.
Which 16 byte structure is then allocated on the heap only to be freed a couple of lines later on?
Didn't exactly understand your question, but if you emphasize inefficiency in memory allocation, I can respond that all allocations are done through memory pool (fixed sized chunks, yes it's still not implemented, just a temporary stub directly calling malloc/free), so this should be very efficient.