Page 1 of 1

Allocator pondering

Posted: Sat Oct 13, 2007 12:18 am
by elderK
Yes, as always, I am pondering allocators.
Funnily enough, Im still pondering on interesting ways to implement a buddy allocator (the other allocator design I was working on has gotten slightly boring for now, so... here we are).

A long while ago, I remember JNC spoke of an allocator he created, which was bitmap powered. IIRC, Seperate bitmaps for each granularity.

Back around then, I implemented a clone of the design he spoke of, did some basic benchmarks and found it would be usable in a real system even though it had a relatively high count in clock cycles for a the entire managed address space to be split into blocks of the minimum granularity.

So, I began to ponder...
and wander.
And ponder some more.
:P Much time passed, many bad things happened and life, oh life, how it has changed.

Out of the recent carnage, I have sat down and started tinkering an alternative design for my buddy allocator. Unlike JNCs, it is powered by an AVL tree. Well, multiple AVL trees.

Now, I get the feeling that a lot of people will cringe about this, yknow, since Trees typically have large overhead and/or node structure (pointers, etc). But, because I strive to be an eager-beaver, I have managed to cut down the node size insanely.

I have ended up encoding the following into 8bytes.
If anyone has any idea on how I could make it 4bytes, please speak up...

- 3 node pointers (left, parent, right).
- node balance
- block address and size.
+ a few of the bits left over being used to make the AVL code slightly
easier.

The idea is similar, I have an two arrays of trees. A tree for each granularity, one for blocks of granularity size in use and available.

Anyway, the question here... isnt about the internals of my little idea, but more in the tone of... Would using an AVL tree for this kind of system (Addresses can be in random order, but are typically ascending... lookup must be fast, for finding blocks of requested size, or freeing blocks of requested size, aswell as merging blocks... which, the AVL's property of sorting will make deviously easy...) actually be worthwhile, considering the somewhat massive increase in overhead?

ie: Would the speed gain be WORTH the gain in overhead?

Ie, Lets say we are tracking 8MiB of space, total. Minimum granularity is 4K.

Using JNC's system, it would take around ~520 to ~540 bytes to store bitmaps, tracking granularities 4K to 8MiB. Using my system, it would take 16KiB. In this case, my system uses roughly 31 times the space JNC's requires.

Point is, would it actually be worth it? In terms of performance? Or, is this not really a straightforward question? Does it depend more on... say, Is the buddy allocator going to be in use a GREAT deal? Or is it one of those deeply burried Kernel subsystems that is called once in a while...

If it is called once in awhile, then it is relatively acceptable for it to be a tad slow. If it is going to be called a great deal, in short succession, then it really does need to be as fast as possible. Right?

Anyways, as usual, any feedback on this matter would be greatly appreciated.

Id be really quite interested to see what your opinions are on this :)

~Z

Posted: Sun Oct 14, 2007 4:19 am
by jnc100
So I'll assume when you talk about an 'allocator' here, you're referring to a physical page provider (because that's what I thought the original discussion was about), rather that a virtual address space management system or a malloc-like heap manager (AFAIK Doug Lea's malloc uses some sort of tree structure for fast lookups and easy merging of adjacent regions).

To set the background, previously I was using a buddy system (with a few optimisations) to track the physical memory, in chunks of 4 MiB down to 4 kiB, because it allowed me to use various block sizes (e.g. I could allocate 4 kiB pages and 2 MiB/4 MiB large pages easily, along with other sizes - for example I wished to have 1 MiB of contiguous page tables to use as my kernel page tables, so they could be easily set up prior to paging being initialised), and also because I could easily allocate specific blocks for DMA devices etc. It worked well, and life was good, but I've decided to tweak it a bit since then for performance reasons, and settled on a hybrid system with a bitmap for regions < 16 MiB (DMA) and a stack for those above. I don't think you can improve much on the performance of a stack for pops/pushes.

The system you are describing (I think) is like a buddy system, but each buddy is implemented as an AVL tree rather than a bitmap. If so, it is a good improvement on the bitmap system. The problem was always having to test each bit (actually you could do 32-bit/64-bit tests at once, but even so) of a large bitmap to find one free block. In terms of size overhead, its probably best to think of it in terms of a fraction of total memory, e.g. you're suggesting using 16 kiB of 8 MiB for a physical memory allocator, which is 0.2%, so pretty acceptable to me.

I think the question you need to answer about using AVL trees (which I do use for virtual address management, by the way) is exactly why you need to be able to allocate varying block sizes and specific addresses throughout the entire physical address space. If you have a good reason to, then I believe the speed/size tradeoff you've made is a good one, otherwise, I recommend using a stack of 4 kiB pages for all addresses > 16 MiB (possibly until you get into PCI space etc...). There's a bit of a discussion about how to implement a physical memory stack here which, strange as it sounds, actually requires only 4 bytes of memory for the whole of the stack. Brendan also makes some good points about 1 stack per cpu, numa domain, page colour etc. That is the other good thing about having quick physical memory allocators - in the (albeit rare for a well designed system) case that two cpus need to access the same allocator at the same time, having a short access time (with locks etc) actually decreases the time both cpus have to wait.

Regards,
John.

Posted: Thu Oct 18, 2007 7:31 am
by elderK
Heya John, its good to see you still around here :).

I should have specified that, this allocator isnt for use with Physical memory. For that, I have a very... strange and specialized bitmap-stack system.

Like most things in my rather oddball kernel, everything is strange and specialized...

The reason why I require a buddy allocator, specifically, why I need different block sizes and fast compaction is because... the allocator will be working in Kernel-space.

Now, I understasnd that right away doesnt mean all that much, so let me give some background...

My Kernel uses a ... statebased memory management system. For the most part, Physical and Virtual memory management (atleast at the lowest levels) work using identical interfaces. This was developed so that the Kernel itself, could place itself at arbitrary memory locations, to best suit different situations, should they have arised in the future.

Needless to say, I like this feature.
So, Kernelspace in the scope of this thread means the small contiguous, virtualized region of memory that the kernel and its structures reside in.

Previous iterations of my kernel used the bitmap-stack system to manage the kernel's virtualized space. This worked okay, apart from the fact... that it would get fragmented in heavy use.

This happens because... well, you cant virtual-virtual map virtual space :P.
Buddy allocation is my solution as I already use (and spent a long,long time developing a decent) pool system for kernel structure allocation. So, I can negate the internal fragmentation that comes with the buddy algorithm. The idea is, buddy should allow me to get rid of the external fragmentation / checkerboarding of kernel space, which would be a BIG deal!

When you put 16K to being 0.2% of 8MiB, that seems far more tolerable.
8MiB is the size of kernelspace, in all configuarations (in future, this will be contollable via Lilac-configured bootloader stuff... but for now, its hardwired).

As for using AVL, the reason I use the structure is because it is one of the fastest ADTs I have in my little grimoire of witchcraft... :P

Thanks for the feedback! :)

~z

Posted: Sun Oct 21, 2007 3:38 am
by jnc100
zeii wrote:Heya John, its good to see you still around here :).
Yeah, still hovering on the fringes, when I get time, which isn't often. I'm currently working flat out on my latest, top secret, rather hush hush project.

Now am I right in thinking that the idea here is to avoid fragmentation of the virtual address space available to your kernel? i.e. what you're trying to produce is a malloc-like interface? For that, I think choosing a minimum granularity of 4 kiB is a bit silly, as if you just need to allocate a small structure of several dozen bytes you are wasting a lot of space (i.e. a whole page).

To give you an idea of how I approach the problem: the virtual address space is managed by 'sections'. I store the used sections in a linked list (in previous versions) and a sorted binary tree (in more recent versions). I can add a new section by attempting to add it at an absolute address (and failing if that's in use) or by searching for free space of a certain length. This searching is an O(n) operation as you have to examine the gap between each section sequentially. It is not much of a problem, however, as there is a relatively small list of sections, and adding a new one occurs rarely (also, if the list is sorted, you can just add to the end, if there's room, making it O(1)).

Now what is a section? Well, its an arbitrary collection of memory, with certain properties (e.g. readable, writable, executable, is_heap, is_stack, is_userspace, is_threadlocal, is_sharedmemory, is_device etc). When a process is loaded, its executable sections are transformed into sections at the required addresses. A heap is also created. A stack is created per-thread. The same is true for the kernel, which defines sections for video memory etc for its debugging console.

Sections can be small, with the possibility to expand. For example, my userspace heap is defined to be initially 0 bytes in size, but it defines that it can grow up to 2 GiB (or whatever, more in 64bit), so no new sections are placed in its potential space. Obviously you allocate the heap last, so it doesn't tread on any parts of memory the executable wishes to use. Furthermore, sections can expand manually or automatically. Manual expansion is possible for the heap, and is accomplished by the sbrk call (same for kernel or user space - if the call is via an interrupt then expand the user heap, else expand the kernel heap). Automatic expand-down is enabled for the stack.

My kernel-space malloc is simply dlmalloc, using the sbrk interface.

If, then, you are using AVL trees for a section-type management like this, then yes, I agree its a good idea (well I use it :wink:), but I'm not convinced that using it for a malloc interface with a minimum granularity of 4 kiB is a good idea, unless your kernel ever only allocates blocks of that size.

Regards,
John.

Posted: Sun Oct 21, 2007 10:29 am
by elderK
;) Its a support mechanism for something called ABM. The Abstract block manager is the Kernel's MALLOCy interface, if you could call it mallocy ;).

ABM is a pool system, it caches arrays of structures. Its very, very fast but, it needs an equally fast page-mechanism in the background, thus... the buddy allocator to replace the older one, that was fast but high fragmentation...

Top secret hush hush huh?
:P My alterproject is a little MUD thing Im making with a friend. We started the project like, in April this year... and still yet to have a thing to show. :P
But, thats how we like it. lol.

~Z

Posted: Sun Oct 28, 2007 3:56 pm
by jnc100
Well if you're looking for a way to allocate virtual space, I suggest a simple heap with an easily expandable end-of-heap-pointer as probably the easiest and quickest way. E.g. your pool system is asked for an area of 32 bytes, which it already has lying around somewhere and returns that. Then it is asked for 8 kiB, which it doesn't have in its managed space. It asks the higher level virtual space allocator for more space (via a sbrk/morecore or similar) which simply expands the space the pool system is managing by enough so that the free space is 8 kiB, composed of the size of the free block at the end of the pool and whatever else was requested. It then returns that.

Obviously the problem is when the space immediately after the heap is used up by something else, then your fragmentation problem occurs (which makes the pool allocator slightly more tricky, but certainly not impossible). To do that, I simply ensure that the start of heap is followed by a large enough range of free space that such a thing will never happen (i.e. a large range is reserved for the future expansion of the heap).
zeii wrote:Top secret hush hush huh?
Well, okay, not so secret. Its basically an ahead-of-time compiler/assembler (call it what you will) for CIL to x86_64 native code with certain extensions to allow a kernel to be written in anything that targets CIL. Most of the instruction set is implemented, its just a nightmare to test. Add that on to the fact that I'm no longer a student and am working long hours and you'll understand why its a rather slow process.

Regards,
John.