Multiprocessor-friendly Tree Based Allocator

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
wxwisiasdf
Member
Member
Posts: 34
Joined: Sat Sep 07, 2019 5:17 pm
Libera.chat IRC: Superleaf1995

Multiprocessor-friendly Tree Based Allocator

Post by wxwisiasdf »

In simple terms, this tree based allocate creates locks based on used nodes, the root node is oversaturated at start but a cache could be implemented so sizes >= cached_node[index] could be retrieved to make the recursion faster.

There are two types of lock:
read lock: "please anyone don't read anything because i will be writing over here"
write lock: "please anyone don't write anything because i will be reading over here"

Locks are created per node and unlocked when node is no longer used.

The tree does not self-balance, to make initial insertions faster

Any toughs of this?

Also happy new year! \o/
:-)
User avatar
bzt
Member
Member
Posts: 1584
Joined: Thu Oct 13, 2016 4:55 pm
Contact:

Re: Multiprocessor-friendly Tree Based Allocator

Post by bzt »

SuperLeaf1995 wrote:Any toughs of this?
Why do you need read locks? It's enough to have only write locks. If the lock is free, nodes can be read. If it's set, then further reads and writes are put on hold.
SuperLeaf1995 wrote:Also happy new year! \o/
To you too!

Cheers,
bzt
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Multiprocessor-friendly Tree Based Allocator

Post by Korona »

Non-balanced search trees are not that interesting for OSdev since you usually do care about worst-case behavior (e.g., in real-time scenarios). Search trees that use rotations are quite hard to use a in scalable way (scalable to multiple CPUs). The issue is that it is not safe to traverse the tree during rotations, even if the write side does use locks. In contrast, radix trees allow for completely wait-free readers (even if the write side still uses locks).
Last edited by Korona on Fri Jan 01, 2021 11:02 am, edited 1 time in total.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: Multiprocessor-friendly Tree Based Allocator

Post by thewrongchristian »

SuperLeaf1995 wrote:In simple terms, this tree based allocate creates locks based on used nodes, the root node is oversaturated at start but a cache could be implemented so sizes >= cached_node[index] could be retrieved to make the recursion faster.
I'm not sure I understand the purpose here. Is your lock API based on something like:

Code: Select all

// Lock whatever p points to
lock(void * p);
// Unlock whatever p points to
unlock(void * p);
Where p will be managed in your tree, and the actual lock implementation based on the tree nodes?

This is something I also implemented, and intend to look at again, but I ended up just using inline synchronization primitives.

It is useful, though, for synchronization on data you may not necessarily have control over.
SuperLeaf1995 wrote: There are two types of lock:
read lock: "please anyone don't read anything because i will be writing over here"
write lock: "please anyone don't write anything because i will be reading over here"
Sounds like a standard read/write lock, but you have the terms the wrong way round.

Read lock: "please anyone don't write anything because i will be reading over here"
Write lock: "please anyone don't read/write anything because i will be writing over here"

So, you get a read lock when you want to read some data without fear of it being updated, and get a write lock when you want to update some data, without fear of corrupting other readers. Writers block readers, but multiple readers can read aiding in scalability for data that is not updated often.
SuperLeaf1995 wrote: Locks are created per node and unlocked when node is no longer used.

The tree does not self-balance, to make initial insertions faster

Any toughs of this?
Korona wrote: Non-balanced search trees are not that interesting for OSdev since you usually do care about worst-case behavior (e.g., in real-time scenarios). Search trees that use rotations are quite hard to use a in scalable way.
Worst case, the tree could degrade to a linked list, which would mean O(N) to find the node you want. For small N, this may not be a problem.

Note, also, that not balancing will NOT necessarily make insertions faster, as you'll still have to find where to insert, and for the worst case, this could mean O(N) traversal of the entire tree. Flattening the tree is critical for performance of inserts for this reason, and the extra time spent flattening could remove these worst case scenarios and improve performance of inserts as well.

But, you can replace a balanced tree implementation with a splay tree, which as well as being simpler to implement than a balanced tree, also keeps recently accessed nodes near the root of the tree, making contended or recent nodes quick to find.

BST form the basis of almost all my ordered data structures in my kernel, with balance algorithm chosen at tree creation time (currently, I use splay and treap.)

However, note also that each pointer is stand-alone data, unrelated to other pointers, so you don't actually need to keep them in an ordered list. A hash implementation might be a better choice, with a well chosen hash giving O(1) average access to the lock in question.
thewrongchristian
Member
Member
Posts: 426
Joined: Tue Apr 03, 2018 2:44 am

Re: Multiprocessor-friendly Tree Based Allocator

Post by thewrongchristian »

SuperLeaf1995 wrote:In simple terms, this tree based allocate creates locks based on used nodes, the root node is oversaturated at start but a cache could be implemented so sizes >= cached_node[index] could be retrieved to make the recursion faster.

Locks are created per node and unlocked when node is no longer used.

The tree does not self-balance, to make initial insertions faster

Any toughs of this?
Actually, I think I completely misinterpreted what you're trying to achieve.

You're writing a memory allocator, and want to do the locking based on the tree of existing allocations, is that right?

The tree would then allow you to quickly free memory using the passed in pointer to find the allocation node.

And you want it to be MP friendly, which means not locking nodes unnecessarily.

Based on that, my thoughts are similar to what I wrote previously in terms of tree implementation and balancing. You'll want to keep your tree balanced but as Korona noted, balancing in a scalable way is tricky, as you'll need write locks on individual nodes.

If you can do some sort of RCU, however, you can do the balancing in a lock-less manner, which will help scalability. RCU will allow you to do the rotations by basically doing copies of changed nodes. An example paper on such a tree:

http://people.csail.mit.edu/nickolai/pa ... bonsai.pdf

But, personally, I'd first implement the allocator using a simple BST with at least some balancing, with a single lock. Then measure how much lock contention you end up with based on this single lock (simple contended/uncontended counts might be sufficient), then address specific bottlenecks as and when necessary.

First, make it work. Then, make it work fast.

But, to be truly scalable across processors, you might be better off investigating alternatives such as slab based allocation. Slabs provide natural synchronization points, and as long as different concurrent threads are allocating from different slabs, they can operate in parallel trivially.
nullplan
Member
Member
Posts: 1801
Joined: Wed Aug 30, 2017 8:24 am

Re: Multiprocessor-friendly Tree Based Allocator

Post by nullplan »

SuperLeaf1995 wrote:read lock: "please anyone don't read anything because i will be writing over here"
write lock: "please anyone don't write anything because i will be reading over here"
This is usually called a read-write lock and the meaning is typically the reverse of what you want. A read-write lock can be read-locked an arbitrary number of times, but can be write-locked only when it is free (not even read-locked). The usual usage is to read-lock a structure in places where it will be read, and write-lock it in places where it will be written. A connection between read and write is necessary, because otherwise it is possible for readers and writers to miss each other, and cause data races, and nobody wants that.

A simple implementation of this would just use a counter. When the counters is 0, the lock is free, when it is positive, the lock is read-locked, and when it is -1, it is write-locked. Read-locks basically just atomically increment the counter, but special case a read of -1, which means the task has to be suspended until the counter is no longer -1. Unlocks from read-locks then just atomically decrement the counter, and maybe wake up any waiting thread if the counter is 0 by the end. Write locks just try to atomically replace a 0 with a -1 in the counter. If that fails (because the counter is not 0), they must block until it is zero. That implementation is simple, but can lead to writer starvation: If the lock is constantly in read-locked state, because multiple threads are constantly reading the same data structure, so statistically, some thread will always be in the critical section, then the writers will never be able to observe a 0 value on the counter, and a write lock will never succeed.

A more complicated algorithm would be the ticket algorithm. The idea is like at the DMV: Everyone who comes in takes a ticket and waits for that ticket to be called. This would allow for some fairness, but the algorithm can be very tricky to implement. Especially for a read-write lock. You may want to allow for early return for readers unless there is a writer queued up. So that requires you to track the number of writers queued up. Unlocking is a bit tricky as well. You can't just increase a counter, you would have to set it to your own ticket number, unless the counter is already higher than that. Then there is handling the multiple readers: You don't want to just always call up the next ticket, since the ticket just turned in might be a reader, but the next one might be a writer. With multiple readers allowed into the lock at the same time, it is possible that some might overtake each other. There is also the question of how to handle failure and back-off. Maybe a signal arrived in a thread while blocking. Alright, how do you back out of this ticket operation? By the time you back off, you might no longer be the most recent arrival. And how do you handle overflow? All of these are tricky questions to answer.
SuperLeaf1995 wrote:Locks are created per node and unlocked when node is no longer used.
Well, that is hard. How does that work? If the lock is inside the node, than whoever takes a lock must already know the address of a node. So the lock inside the node cannot protect the node itself, only its contents. Consider this pseudo-code:

Code: Select all

find_node(key):
  n = root
  while (n):
    lock(n->lock)
    if (key < n->key)
      next = n->left
    else if (key > n->key)
      next = n->right
    else
      return n
    unlock(n->lock)
    n = next
This already illustrates one problem: Does the code return the node in locked state or not? If yes, the calling code must unlock the node when it is done. Possible but error-prone. If not, then a different process might delete the node from the tree as soon as the unlock is processed. Or you could return only the value, but then you must copy the value into a temporary before releasing the lock, so there are no accesses to the node while the lock is not held.

But now for the actual reason to bring this up: The problem is, as soon as the unlock is called, that node and all of its children might vanish. Therefore, the access to the nodes child-links, even though copied to a temporary, might be invalid, since by the time the lock in the child nodes is taken, the child node itself might no longer exist. Or look at the problem from the other side: While the locator thread is between the unlock on its current node and the lock on the next, it is holding no locks. Therefore, finding the locks free is not actually a guarantee that nobody holds a reference to the node, and so it is never safe to delete a node. Garbage collection would take care of this at great cost.

Whereas, if you just had one big lock for the tree, if you hold it, you can manipulate the tree however you want, since you definitely are the only manipulator in the tree at that time. If the lock is also generally not held for a great length of time (only for a single look-up, insert, or remove), then the fine-grained approach may very well be more trouble than worth, especially considering:
SuperLeaf1995 wrote:The tree does not self-balance, to make initial insertions faster
Unless you know your input data is not going to devolve into pathological cases (sorted input sequence being a pathological case), this is a horrible idea. Especially with the fine-grained locking approach. An unbalanced tree can devolve into a linked list (every node with only one child). If that were to happen, the order of locking would be the same in all threads that attempt to use the list, and therefore no overtaking of threads would be possible. All threads would queue up at exactly the same locks in the same order, and you get the same effect as if you just had a big tree lock, but in a more complicated fashion.

With binary trees, deletion is always more complicated than insertion. If ease of implementation is your primary concern, might I suggest using Anderson trees? And if speed is the reason: Red-black trees will only require at most two tree rotations on insert. It does not get much better.
bzt wrote:Why do you need read locks? It's enough to have only write locks. If the lock is free, nodes can be read. If it's set, then further reads and writes are put on hold.
What? Then if a reader is preempted after testing the lock, but before actually performing the read, it will read stale data. A read-write lock is necessary so the writer can know with confidence that no other thread is currently accessing the same data structure.
Carpe diem!
User avatar
bzt
Member
Member
Posts: 1584
Joined: Thu Oct 13, 2016 4:55 pm
Contact:

Re: Multiprocessor-friendly Tree Based Allocator

Post by bzt »

nullplan wrote:What? Then if a reader is preempted after testing the lock, but before actually performing the read, it will read stale data.
That's what atomic operations are for. On x86 you have hardware support for them (with the prefix called - surprise, surprise - "lock"), but one could simply signal the kernel not to be preempted, etc. etc. etc. (but I guess the PMM is in the kernel so it cannot be preempted anyway...)
nullplan wrote:A read-write lock is necessary
Not really. You can also use monitors and critical sections too, there are many different ways to solve concurrency issues.

Cheers,
bzt
nexos
Member
Member
Posts: 1081
Joined: Tue Feb 18, 2020 3:29 pm
Libera.chat IRC: nexos

Re: Multiprocessor-friendly Tree Based Allocator

Post by nexos »

Okay, here is an example for why readers-writers locks are necessary.
So say the write executes this line of code

Code: Select all

new_item->prev = item;
new_item->next = item->next;
item->next->prev = new_item;
item->next = new_item;
Then if, a reader reads this in the middle of the operation, things could go south very quick, as it is reading inconsistent data.
"How did you do this?"
"It's very simple — you read the protocol and write the code." - Bill Joy
Projects: NexNix | libnex | nnpkg
nullplan
Member
Member
Posts: 1801
Joined: Wed Aug 30, 2017 8:24 am

Re: Multiprocessor-friendly Tree Based Allocator

Post by nullplan »

bzt wrote:That's what atomic operations are for.
But some operations are by nature not atomic. A tree rotation requires two writes at quite distant addresses before the tree is consistent again. Node insertion into/removal from a doubly linked list requires the same. How is that supposed to work in this case? Reader tests the lock, and then? If you want to say: "Then it reads the key and the next pointer", then that is already three operations that cannot be done atomically. While that is going on, another thread might take the lock, and change the data, or the structure.
bzt wrote:but one could simply signal the kernel not to be preempted,
Yes, and then everything fails as soon as another CPU is added, where preemption is not the only reason for concurrency. And on a uniprocessor system, disabling preemption is like taking a big global lock, so you are preventing everything else from running for that time. If you were going to do that anyway, might as well get rid of the fine-grained locking, create a big mutex for the entire tree, and everything gets a whole lot simpler.
bzt wrote:(but I guess the PMM is in the kernel so it cannot be preempted anyway...)
Again, multiprocessor systems exist. Preemptible kernels exist as well.
bzt wrote:Not really. You can also use monitors and critical sections too, there are many different ways to solve concurrency issues.
Well, if you cut up my sentence so it changes its meaning, then yes, that sentence was wrong. It wasn't what I wrote, though, I wrote that such a lock is necessary so the writer can know with confidence that no other thread is currently accessing the same data structure. You might notice that both monitors and critical sections serve the same purpose. In English, even a trailing sentence can change the meaning of the preceding sentence. I take great care not cut up other people's responses when quoting, at least not enough to change their meaning.
Carpe diem!
User avatar
bzt
Member
Member
Posts: 1584
Joined: Thu Oct 13, 2016 4:55 pm
Contact:

Re: Multiprocessor-friendly Tree Based Allocator

Post by bzt »

Considering all that you wrote there's still no need for two locks. One is enough for both the readers and writers. Because:
1. you want to lock-protect the reads too
2. for writers there's no difference if the object is locked by a reader or by another writer

However I'd like to point out that lock-protecting the reads in any way will result in terrible performance, and should be avoided at all costs. Using two locks won't solve this issue, using methods like ordered queue is the way to go. Or use a data structure that does not need locks in the first place.

Cheers,
bzt
Post Reply