(I understand that this can most probably be only answered by benchmarks, but it looks like the answer should be well-known...)
When implementing a page frame allocator, one has a choice between either (1) storing the data structures for managing free memory (stack, queue, several of these, a tree for buddies; the following doesn’t really apply to bitmaps) inside the free memory itself, or (2) permanently dedicating a portion of memory to them (thus “wasting” it). The first approach is clearly superior in terms of memory consumption; however, any operation will need to walk the management structures, and one would need to map free the memory into the virtual address space. Meanwhile, changing mappings (with TLB shootdowns) is rather expensive or requires ingenious concurrent programming. (Dedicated physical memory structures can also need windowing, but it’s not as severe as one page mapping per node access.) Assuming the worst, i.e. a multiprocessor with no hardware TLB consistency and a kernel virtual address space significantly smaller than the physical memory, are the memory savings of (1) really worth hitting the VM slow paths?
Is storing free memory info inside free memory worth it?
-
- Posts: 15
- Joined: Thu Aug 29, 2013 4:10 pm
-
- Member
- Posts: 5591
- Joined: Mon Mar 25, 2013 7:01 pm
Re: Is storing free memory info inside free memory worth it?
The third option is to use your allocator to allocate and free memory to store the free memory structure used by your allocator.
Initializing it is a bit tricky (since the allocator depends on itself), but it gives a pretty good compromise between "wasting space for empty free memory lists" and "wasting time constantly changing page tables to access the free memory lists".
Initializing it is a bit tricky (since the allocator depends on itself), but it gives a pretty good compromise between "wasting space for empty free memory lists" and "wasting time constantly changing page tables to access the free memory lists".
Re: Is storing free memory info inside free memory worth it?
In my opinion, it doesn't really matter at all how much memory is required to store the information about a free page. It doesn't even matter if it takes 4096 bytes to store the fact that a page is free, because, well, the page is free.
What really matters is how much space is required to store information about the page when it is in use. If you intend to support the features of an advanced mm like COW, demand loading, shared memory then for each page in use a structure will be required to store the current 'state' of that page. If every page is currently being used for something then every page will need to have a structure describing it. In my OS I have a small structure permanently allocated for each physical page. I use a part of that structure when the page is free to store it in the free list.
When I started out I spent a lot of time on my physical page allocator weighing up bitmaps, stacks and all sorts of algorithms. I'm afraid that time was wasted.
What really matters is how much space is required to store information about the page when it is in use. If you intend to support the features of an advanced mm like COW, demand loading, shared memory then for each page in use a structure will be required to store the current 'state' of that page. If every page is currently being used for something then every page will need to have a structure describing it. In my OS I have a small structure permanently allocated for each physical page. I use a part of that structure when the page is free to store it in the free list.
When I started out I spent a lot of time on my physical page allocator weighing up bitmaps, stacks and all sorts of algorithms. I'm afraid that time was wasted.
If a trainstation is where trains stop, what is a workstation ?
Re: Is storing free memory info inside free memory worth it?
Hi,
Consider this:
Now consider this:
Basically...
The first approach is clearly superior in terms of memory consumption; however, there's almost never any other difference (if it's done right).
Cheers,
Brendan
There's almost never any sane reason to allocate a physical page and not map it somewhere; so you have to map it (and do the TLB stuff) regardless of whether you use the first approach or the second approach.AlexShpilkin wrote:When implementing a page frame allocator, one has a choice between either (1) storing the data structures for managing free memory (stack, queue, several of these, a tree for buddies; the following doesn’t really apply to bitmaps) inside the free memory itself, or (2) permanently dedicating a portion of memory to them (thus “wasting” it). The first approach is clearly superior in terms of memory consumption; however, any operation will need to walk the management structures, and one would need to map free the memory into the virtual address space. Meanwhile, changing mappings (with TLB shootdowns) is rather expensive or requires ingenious concurrent programming.
Consider this:
Code: Select all
void alloc_virtual_page(void *virtual_address) {
uint64_t phys;
lock_virtual_page(virtual_address); // Make sure other CPUs can't allocate the same page at the same time
phys = pre_alloc_physical_page();
map_page(virtual_address, phys);
post_alloc_physical_page(virtual_address);
fill_page_with_zeros(virtual_address); // Needed to prevent sensitive data "leaking" from one process to another
unlock_virtual_page(virtual_address);
}
uint64_t pre_alloc_physical_page(void) {
spinlock_acquire(free_page_list_lock);
return free_page_list_top;
}
void post_alloc_physical_page(void *virtual_address) {
free_page_list_top = *(uint64_t *)virtual_address;
spinlock_release(free_page_list_lock);
}
Code: Select all
void alloc_virtual_page(void *virtual_address) {
uint64_t phys;
lock_virtual_page(virtual_address); // Make sure other CPUs can't allocate the same page at the same time
phys = alloc_physical_page();
map_page(virtual_address, phys);
fill_page_with_zeros(virtual_address); // Needed to prevent sensitive data "leaking" from one process to another
unlock_virtual_page(virtual_address);
}
The first approach is clearly superior in terms of memory consumption; however, there's almost never any other difference (if it's done right).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
-
- Posts: 15
- Joined: Thu Aug 29, 2013 4:10 pm
Re: Is storing free memory info inside free memory worth it?
Spot on as usual, thanks! Put another way: if you need to touch a free list node, you’re in the process of allocating (or freeing) the page it resides in, thus it will need to be (or is) mapped anyway. Missed that!Brendan wrote: There's almost never any sane reason to allocate a physical page and not map it somewhere; so you have to map it (and do the TLB stuff) regardless of whether you use the first approach or the second approach.
This doesn’t hold if you have a more complicated structure (buddy allocator?), but presumably you only need this to allocate consecutive physical pages, which is weird. (And AFAIU page colouring ≈ multiple free lists[?] so it’s not “complicated” in this sense.)
Last edited by AlexShpilkin on Mon Apr 04, 2016 3:51 pm, edited 5 times in total.
-
- Posts: 15
- Joined: Thu Aug 29, 2013 4:10 pm
Re: Is storing free memory info inside free memory worth it?
Well then, technically, in your interpretation a free space bitmap costs 4096 bytes + 1 bit per page, so some of your time wasn’t wasted However, given what Brendan said, my original question really is irrelevant.gerryg400 wrote:In my opinion, it doesn't really matter at all how much memory is required to store the information about a free page. It doesn't even matter if it takes 4096 bytes to store the fact that a page is free, because, well, the page is free.
Also it’s much more complicated, so I’m still afraid of it. A story for another time I guessgerryg400 wrote:What really matters is how much space is required to store information about the page when it is in use.
-
- Posts: 15
- Joined: Thu Aug 29, 2013 4:10 pm
Re: Is storing free memory info inside free memory worth it?
This is interesting, and I missed it too. It’s more complex but increases locality. However, given that page mapping is expensive in any case, I don’t think cache performance would really matter here. (Still all speculation and no benchmarks, of course.)Octocontrabass wrote:The third option is to use your allocator to allocate and free memory to store the free memory structure used by your allocator.
In case someone else worries about the self-referential initialization: an example description of this can be found in the MIT xv6 commentary starting on page 32, to be studied along with the source code.Octocontrabass wrote:Initializing it is a bit tricky (since the allocator depends on itself), but it gives a pretty good compromise between "wasting space for empty free memory lists" and "wasting time constantly changing page tables to access the free memory lists".
Re: Is storing free memory info inside free memory worth it?
Hi,
For virtual address space management, most of the time it's 1 page only and it's allocated within the page fault handler. For example; caller might ask for 100 MiB, but you just map the same page full of zeros everywhere in that 100 MiB as "read only", and don't actually allocate a physical page until/unless they write to it and trigger a page fault by writing to a page. Anything involving "copy on write shared memory" is mostly the same (allocate one page in the page fault handler when there's a write); which includes memory mapped files (where file data is loaded into VFS cache and shared with processes that mapped the file). Swap space (e.g. allocating a physical page to load data from swap space) is always one page at a time too.
Sometimes you'll need a page that has a 32-bit physical address (e.g. for a "32-bit only" PCI device); and for that reason I have different free page stacks for pages with 32-bit physical addresses and for pages with larger physical addresses.
For some devices you need physically contiguous RAM (and sometimes with more bizarre restrictions, like "must be below 16 MiB and can't cross a 64 KiB boundary"), and free page stacks aren't good for that. For these cases; I normally use a bitmap for the first ~16 MiB of RAM instead.
For page colouring and for NUMA optimisations; I just have many free page stacks. For example; with 2 NUMA domains and 256 page colours I might have 2*256=512 free page stacks; and do something like "best_free_page_stack = (requested_NUMA_domain << NUMA_shift) | requested_page_colour;". Many free page stacks (with one spinlock per stack) also reduces the chance of multiple CPUs wanting to acquire the same spinlock at the same time (or, more free page stacks means less lock contention).
I also tend to do a trick to improve performance of higher priority threads. Essentially, have a "dirty page stack" (for pages that still contain stale data from before it was freed last) and a "clean page stack" (for pages that have been filled with zeros). When a high priority thread frees a page you put it on the dirty page stack, and when a high priority thread allocates a page you try to allocate from the clean page stack; and when a low priority thread frees a page you fill it with zeros and put it on the clean page stack, and when a low priority thread allocates a page you allocate from the dirty page stack and fill it with zeros. This means that most of the "page cleaning" is done with low priority threads (where performance/overhead doesn't matter as much); and most of the time you can allocate/free pages for high priority threads faster (without the overhead of filling pages with zeros).
Note that with all of this combined you end up with a kind of "search order" when allocating normal pages. E.g. you might start by trying to allocate a clean page above 4 GiB, and if there's none try for a dirty page above 4 GiB, and if there's none try for a clean page with a 32-bit address; and so on (until you're trying for wrong page colour and/or wrong NUMA domain as a last resort).
For large pages; I simply don't bother and only use large pages for memory mapped IO (e.g. video display memory). This is partly due to the nature of my OS (many smaller processes communicating rather than large "single process" applications) which makes large pages less useful (more wasteful). You could have additional free page stacks for large pages; but the large pages tend to get split up into smaller pages while the OS is running until there are none (so you end up with 4 KiB pages and nothing else anyway); and to avoid that you need some way to recombine many free 4 KiB pages back into a free large page, which tends to be messy and requires a much more complicated/slower physical memory manager (and the extra physical memory manager overhead can easily negate any performance advantages of using large pages).
Cheers,
Brendan
For a buddy allocator I doubt it can work. However, a buddy allocator is more suited to heap (e.g. "malloc()") rather than virtual address space management anyway.AlexShpilkin wrote:Spot on as usual, thanks! Put another way: if you need to touch a free list node, you’re in the process of allocating (or freeing) the page it resides in, thus it will need to be (or is) mapped anyway. Missed that!Brendan wrote:There's almost never any sane reason to allocate a physical page and not map it somewhere; so you have to map it (and do the TLB stuff) regardless of whether you use the first approach or the second approach.
This doesn’t hold if you have a more complicated structure (buddy allocator?), but presumably you only need this to allocate consecutive physical pages, which is weird. (And AFAIU page colouring ≈ multiple free lists[?] so it’s not “complicated” in this sense.)
For virtual address space management, most of the time it's 1 page only and it's allocated within the page fault handler. For example; caller might ask for 100 MiB, but you just map the same page full of zeros everywhere in that 100 MiB as "read only", and don't actually allocate a physical page until/unless they write to it and trigger a page fault by writing to a page. Anything involving "copy on write shared memory" is mostly the same (allocate one page in the page fault handler when there's a write); which includes memory mapped files (where file data is loaded into VFS cache and shared with processes that mapped the file). Swap space (e.g. allocating a physical page to load data from swap space) is always one page at a time too.
Sometimes you'll need a page that has a 32-bit physical address (e.g. for a "32-bit only" PCI device); and for that reason I have different free page stacks for pages with 32-bit physical addresses and for pages with larger physical addresses.
For some devices you need physically contiguous RAM (and sometimes with more bizarre restrictions, like "must be below 16 MiB and can't cross a 64 KiB boundary"), and free page stacks aren't good for that. For these cases; I normally use a bitmap for the first ~16 MiB of RAM instead.
For page colouring and for NUMA optimisations; I just have many free page stacks. For example; with 2 NUMA domains and 256 page colours I might have 2*256=512 free page stacks; and do something like "best_free_page_stack = (requested_NUMA_domain << NUMA_shift) | requested_page_colour;". Many free page stacks (with one spinlock per stack) also reduces the chance of multiple CPUs wanting to acquire the same spinlock at the same time (or, more free page stacks means less lock contention).
I also tend to do a trick to improve performance of higher priority threads. Essentially, have a "dirty page stack" (for pages that still contain stale data from before it was freed last) and a "clean page stack" (for pages that have been filled with zeros). When a high priority thread frees a page you put it on the dirty page stack, and when a high priority thread allocates a page you try to allocate from the clean page stack; and when a low priority thread frees a page you fill it with zeros and put it on the clean page stack, and when a low priority thread allocates a page you allocate from the dirty page stack and fill it with zeros. This means that most of the "page cleaning" is done with low priority threads (where performance/overhead doesn't matter as much); and most of the time you can allocate/free pages for high priority threads faster (without the overhead of filling pages with zeros).
Note that with all of this combined you end up with a kind of "search order" when allocating normal pages. E.g. you might start by trying to allocate a clean page above 4 GiB, and if there's none try for a dirty page above 4 GiB, and if there's none try for a clean page with a 32-bit address; and so on (until you're trying for wrong page colour and/or wrong NUMA domain as a last resort).
For large pages; I simply don't bother and only use large pages for memory mapped IO (e.g. video display memory). This is partly due to the nature of my OS (many smaller processes communicating rather than large "single process" applications) which makes large pages less useful (more wasteful). You could have additional free page stacks for large pages; but the large pages tend to get split up into smaller pages while the OS is running until there are none (so you end up with 4 KiB pages and nothing else anyway); and to avoid that you need some way to recombine many free 4 KiB pages back into a free large page, which tends to be messy and requires a much more complicated/slower physical memory manager (and the extra physical memory manager overhead can easily negate any performance advantages of using large pages).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.