Virtual Adress Space Reservation

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
User avatar
zaval
Member
Member
Posts: 660
Joined: Fri Feb 17, 2017 4:01 pm
Location: Ukraine, Bachmut
Contact:

Virtual Adress Space Reservation

Post by zaval »

First, let's call a reservation the initial allocation request from the client, when no mapping, PTE creating etc is involved, just reserving a piece of Virtual Address Space (VAS) for the future use, any - stacks placement, pools, image sections, file mapping, custom process entities.
What do you think is the best approach in VAS reservation. The client asks you to reserve a block of memory in its VAS, without address specified. Especially if you should care about a simultaneous possibility, that the client might ask to reserve at the specific address too, so for the algorithm, there is no guarantee that the space is split into 2 parts - purely empty and somewhat used.
I know about efficient lookups for already reserved blocks - keyed by addresses balanced/self-adjusting trees quickly answering whether this block is already reserved or not (and more answers in fact). But I don't see an effective means to quickly find an address for the reservation request.
Maybe on 48, 64 bits it's good to just move upwords all the way, it hardly would be exhausted anytime, but it's not the case for 32 bits.
What's your choice? Is it the same for the user space and kernel paged entities? I mean in kernel, one might do VA reservation somewhat hardcoded way. Is it good? It's not good for the trivially mapped MIPS's kseg0 for sure, :D but on the architectures, where mapping is always non-trivial (involving PTs) it might or might not be good. Honestly I don't see how hardcoded reservation in the kernel is good. Anyway, for the user space, there is no such a possibility, since a process could have maybe even thousands different reserved areas, they are too dynamic.
ANT - NT-like OS for x64 and arm64.
efify - UEFI for a couple of boards (mips and arm). suspended due to lost of all the target park boards (russians destroyed our town).
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Virtual Adress Space Reservation

Post by Brendan »

Hi,
zaval wrote:What's your choice?
My choice is that a process tells the kernel what to do with each area of the process' part of a virtual address space (e.g. "change the area from <start> to <end> to <virtual_area_type>") and the kernel does what it's told.

If the process wants to reserve an area and/or dynamically allocate areas, then that's a user-space problem that should be implemented in a user-space library (kernel has no reason to know or care about it). More specifically; if user-space calls the "mmap()" function in a POSIX library without using the MAP_FIXED flag, then the library figures out what the address should be before asking the kernel to create the mapping; and the kernel's lower-level implementation always behaves as if MAP_FIXED flag was set.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
alexfru
Member
Member
Posts: 1112
Joined: Tue Mar 04, 2014 5:27 am

Re: Virtual Adress Space Reservation

Post by alexfru »

If you need a specific location or a large size, you should probably have a way to reserve early or the general-purpose reservation shouldn’t ever reserve in the special area.
Orherwise, the simplest is to keep available regions in a balanced tree, ordered by size (size being a page or multiple pages, e.g. 4 KB to 64KB). With this you can quickly find if you can satisfy reservation and where (e.g. find the region big enough and possibly split it).
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Virtual Adress Space Reservation

Post by Korona »

zaval wrote:I know about efficient lookups for already reserved blocks - keyed by addresses balanced/self-adjusting trees quickly answering whether this block is already reserved or not (and more answers in fact). But I don't see an effective means to quickly find an address for the reservation request.
Once you have the balanced tree¹ reserving regions becomes easy: Just store the size of the largest available region in the inner nodes of this tree. Then during reservation you only descend into nodes that have enough space left. This means that the kernel is able to satisfy non-fixed reservation requests at no additional overhead - in fact, if you implement the same functionality in user-space like Brendan suggested, you will have some non-zero overhead.²

Therefore, my (micro-)kernel has three reservation modes: Reserve area at fixed address P, reserve area at lowest available address above P and reserve area at highest address below P. An additional advantage of this scheme (compared to a user-space solution) is that different actors do not need to synchronize their reservation requests: For example, in my OS, the POSIX layer (that implements the VFS and manages open file descriptors) runs in a separate process. Child processes requests the POSIX process to perform mmap()s for them³ but they can still do their own reservations for non-POSIX related memory regions. For example, if a device driver wants to map a PCI BAR, it bypasses the POSIX layer without the need for any additional communication.

¹ Or a radix tree that might work better for SMP systems as it allows single-update/multiple-lookups concurrency.
² See my implementation here: Click me.
³ This is necessary to handle some cases of mmap() that would otherwise be very hard to resolve. For example, the memory map in /proc/self/maps needs to be updated synchronously with reservations. Another thing that is hard to handle in an application itself is support for userfaultfd().
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Virtual Adress Space Reservation

Post by Brendan »

Hi,
Korona wrote:Once you have the balanced tree¹ reserving regions becomes easy: Just store the size of the largest available region in the inner nodes of this tree. Then during reservation you only descend into nodes that have enough space left. This means that the kernel is able to satisfy non-fixed reservation requests at no additional overhead - in fact, if you implement the same functionality in user-space like Brendan suggested, you will have some non-zero overhead.²
For the kernel's low level code (mainly the page fault handler) the usage is always something like "find the information corresponding to this page number" where you want to be able to modify the information for a single page without caring about corrupting the information for adjacent pages; and where that information includes things like if the page is also stored in swap space and where, if the page was modified, when the page was accessed last, if the page is part of a memory mapped device, if the page is part of a memory mapped file and which offset within the file, etc.

For the management of "higher level virtual memory zones" you don't want most of the information that kernel's low level code needs and you're mostly searching based on "zone size" (and not doing a lookup of "page number"). It's impossible to design a single data structure that is efficient for both of these completely different use cases; and if you try you're guaranteed to end up with a data structure that is inefficient for at least one of the usages.

If you do have "one data structure designed and optimised for lower-level page fault handling" and "completely different data structure designed and optimised for managing higher level virtual memory zones" (to improve performance); then there's no benefit in putting the latter in the kernel at all, and by putting it in the process (e.g. in a library) you're able to avoid the overhead of a system call in some cases (e.g. if it needs to return an error like EINVAL or ENONMEM, if an area already uses the type that you're changing it to, etc) and it increases the flexibility (e.g. a process could use a completely different allocator that's custom designed and optimised for the process rather than using a "jack of all trades" generic allocator). In other words, by improving performance (by using different data structures designed for different use cases) it creates additional possibilities to improve performance.

Of course this also follows micro-kernel principles (e.g. shifting things out of the kernel to reduce the risk of kernel bugs and improve security) - performance isn't necessarily the only goal.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Virtual Adress Space Reservation

Post by Korona »

Well, what you say is partially true but not the whole story. Yes, a multi-layer mechanism in the kernel is probably more costly than a data structure that can only handle page faults at maximum efficiency. However, this is only a concern if you do not need the multi-layer mechanism anyway to implement richer features. Handling stuff like swapping, CoW, memory mapped files, userfaultfd() and transparent memory on accelerator cards (e.g. GPUs) needs some support in the kernel anyway and requires the multi-layer approach independent of the ability to reserve non-fixed ranges.

We could actually have a discussion on this, but it is a bit off-topic in this thread (maybe split it into another thread?). In my microkernel, virtual memory is not managed in a single layer but in four different layers. The layers are:
  • Page bundle (grouping physical pages). Those objects are basically just a linear list of physical pages. For system RAM, this list is implemented as a tree that allows for efficient lookup. Each file gets its own bundle. Each GPU buffer gets its own page bundle on the GPU and in system RAM. Anonymous memory gets their own bundles. Swapping is handled on this layer, by marking each page as present or non-present.
  • Bundle view (remapping physical pages to virtual areas). Resolves addresses into (page bundle object, bundle offset) pairs. For files and anonymous memory, this mapping is 1:1. Bundle views can be shared among multiple mappings inside multiple address spaces. We handle transparent memory on accelerators here: For example, a buffer on a GPU can have a bundle in system RAM and a bundle on the GPU and we can transparently let the view point to either one.
  • Mapping (tracking per-address-space access). This layer handles CoW. Mappings are local to an address space and point to a bundle view. userfaultfd() is handled at this layer.
  • Address space (handling page tables). This layer manages the actual page tables. It stores a tree of mappings.
Page bundles, bundle views and address spaces are visible to user-space and can be manipulated by system calls. Mappings are not visible - thus it is not possible to form CoW loops. This architecture gives user-space enough features to implement page caches, userfaultfd() and GPU drivers efficiently. However, it does not give user-space the tools to construct arbitrarily nested virtual memory configurations. It is flexible enough to implement all features that we want (e.g. all of POSIX and Linux) but not too flexible - in particular, it cannot be used for denial-of-service against the kernel. After all, almost all page faults can be resolved by performing a address space -> mapping and a page bundle -> physical page lookup in radix trees.

I'm open about improvements to this architecture and about how the number of layers can be improved here. I found it quite hard to come up with something that works efficiently for all use cases and this architecture does that for me. Note that it can of course be simplified if you drop some features here: For example, if you do not want to support transparent memory on GPUs you do not need a remapping layer. The goal here, however, is to support all features of existing APIs without dropping too much in performance.
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Virtual Adress Space Reservation

Post by Brendan »

Hi,
Korona wrote:Well, what you say is partially true but not the whole story. Yes, a multi-layer mechanism in the kernel is probably more costly than a data structure that can only handle page faults at maximum efficiency. However, this is only a concern if you do not need the multi-layer mechanism anyway to implement richer features. Handling stuff like swapping, CoW, memory mapped files, userfaultfd() and transparent memory on accelerator cards (e.g. GPUs) needs some support in the kernel anyway and requires the multi-layer approach independent of the ability to reserve non-fixed ranges.

We could actually have a discussion on this, but it is a bit off-topic in this thread (maybe split it into another thread?). In my microkernel, virtual memory is not managed in a single layer but in four different layers. The layers are:
  • Page bundle (grouping physical pages). Those objects are basically just a linear list of physical pages. For system RAM, this list is implemented as a tree that allows for efficient lookup. Each file gets its own bundle. Each GPU buffer gets its own page bundle on the GPU and in system RAM. Anonymous memory gets their own bundles. Swapping is handled on this layer, by marking each page as present or non-present.
  • Bundle view (remapping physical pages to virtual areas). Resolves addresses into (page bundle object, bundle offset) pairs. For files and anonymous memory, this mapping is 1:1. Bundle views can be shared among multiple mappings inside multiple address spaces. We handle transparent memory on accelerators here: For example, a buffer on a GPU can have a bundle in system RAM and a bundle on the GPU and we can transparently let the view point to either one.
  • Mapping (tracking per-address-space access). This layer handles CoW. Mappings are local to an address space and point to a bundle view. userfaultfd() is handled at this layer.
  • Address space (handling page tables). This layer manages the actual page tables. It stores a tree of mappings.
Page bundles, bundle views and address spaces are visible to user-space and can be manipulated by system calls. Mappings are not visible - thus it is not possible to form CoW loops. This architecture gives user-space enough features to implement page caches, userfaultfd() and GPU drivers efficiently. However, it does not give user-space the tools to construct arbitrarily nested virtual memory configurations. It is flexible enough to implement all features that we want (e.g. all of POSIX and Linux) but not too flexible - in particular, it cannot be used for denial-of-service against the kernel. After all, almost all page faults can be resolved by performing a address space -> mapping and a page bundle -> physical page lookup in radix trees.

I'm open about improvements to this architecture and about how the number of layers can be improved here. I found it quite hard to come up with something that works efficiently for all use cases and this architecture does that for me. Note that it can of course be simplified if you drop some features here: For example, if you do not want to support transparent memory on GPUs you do not need a remapping layer. The goal here, however, is to support all features of existing APIs without dropping too much in performance.
Mine has different design goals - specifically there will never be any POSIX compatibility, shared memory (it's horrible for distributed systms), "read/write" memory mapped files (it's a versioning file system where you can't modify an existing file and can only create a new version of it), or "fork()" (it's just incredibly stupid); and user-space is split into "process space" and "thread space" (for multiple reasons); and for kernel code there's no portability at all (instead it's more like multiple "not portable at all" kernels where each kernel is custom designed for a specific CPU but all kernels provide the same API). For both of the 80x86 kernels (32-bit and 64-bit) I pack as much as possible into the "available" flags of page table entries to maximise the chance that the page fault handler will be able to handle a page fault with no cache misses whatsoever. Additional data structures are only needed for memory mapped files, swap space management and permission checking for memory mapped IO areas; but because there's no "read/write memory mapped files" I only need to care about "not present memory mapped file pages" and don't need to care about "already present memory mapped file pages" which means that I can store a (30-bit or larger) "memory mapped file number" in the ("not present") page table entry and use a (per process) array of memory mapped file entries; and because most devices only have a small number of memory mapped IO areas (and because this permission checking is only needed when mapping a memory mapped IO area into a driver's virtual address space) that's not performance critical (and can be incredibly simple, like a list with a linear search, without any reason to care). The swap space management is done in user-space via. a "swap space manager" process and doesn't exist in the kernel at all - the kernel only needs a hint to tell it whether it should/shouldn't consult or update the swap space manager.

Note: When there isn't much free physical memory some OSs do a bunch of work to keep track of when each page was accessed last (to figure out which page is the "best" page to sent to swap space). I don't bother with this (scanning all the "accessed" flags for all pages is slow). I only track when threads were given CPU time last and then send randomly selected pages (from least recently executed threads) to swap space. However; I use tiered swap space where the first level is "(compressed) RAM as swap space". This means that frequently accessed pages might bounce between "not in swap space" and "RAM as swap space" a few times (which allows the swap space manager to track LRU itself) but frequently used pages won't make it to lower/slower tiers of swap space (e.g. "swap space on disk"). For this reason my kernel doesn't need to store any data to keep track of when each page was accessed last.

For user-space; I expect each process to be custom designed for a specific well defined purpose, and I expect this to include whatever it does with its virtual memory management. For most processes "process space" and "thread space" are split into zones at compile time and there's no need for the process to manages zones at run-time at all.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
zaval
Member
Member
Posts: 660
Joined: Fri Feb 17, 2017 4:01 pm
Location: Ukraine, Bachmut
Contact:

Re: Virtual Adress Space Reservation

Post by zaval »

Thank you for all your replies.
If the process wants to reserve an area and/or dynamically allocate areas, then that's a user-space problem that should be implemented in a user-space library
No matter where, it still should have been done. How do you think is the best way to do it? And also, is this library involved in process creation, when you need to reserve blocks for its image sections, stacks, pools etc?
If you need a specific location or a large size, you should probably have a way to reserve early or the general-purpose reservation shouldn’t ever reserve in the special area.
Orherwise, the simplest is to keep available regions in a balanced tree, ordered by size (size being a page or multiple pages, e.g. 4 KB to 64KB). With this you can quickly find if you can satisfy reservation and where (e.g. find the region big enough and possibly split it).
That was tempting, but merging free blocks becomes a performance disaster.
First, as this is not an in-place representation (not where free blocks reside), you cannot check adjacent blocks on their status directly. Only through page table entries.
So you end up with something like this sequence (the lower block check example):
1) check its last pte for some flag indicating used/unused. this pte is right before our new block pte.
2) if it is "not used", go to the previous pte until pte is used, this gives an address and size of the previous block.
3) find the item describing the block in a list/tree. for the list it's just a traversal linear by the number of blocks, bad, for the tree, its relatively quick since you find by the size and then find next until the address matches.
4) now finally modify this item increasing its size (key) and make increase-key operation restoring BST property if it's a tree.
1 is constant, 2 depends linearly on the previous block size in pages, might be huge (it's a free block) or much worse - might even not have page tables allocated, resulting in at least a great wastage of time. 3 depends linearly on the number of blocks with the same size and logarithmical on the number of blocks. 4 is just O(log(n)), n - number of blocks.
2 looks kind of not exactly inspiring. :)
Once you have the balanced tree¹ reserving regions becomes easy: Just store the size of the largest available region in the inner nodes of this tree. Then during reservation you only descend into nodes that have enough space left. This means that the kernel is able to satisfy non-fixed reservation requests at no additional overhead - in fact, if you implement the same functionality in user-space like Brendan suggested, you will have some non-zero overhead.²
This sounds interesting and honestly I need to clarify it yet. As well as learn what radix trees are. :D Thank you for the link. Still, the key moment is availability of the address P at the moment of the call to the reservation algorithm. But how do you find P?
ANT - NT-like OS for x64 and arm64.
efify - UEFI for a couple of boards (mips and arm). suspended due to lost of all the target park boards (russians destroyed our town).
alexfru
Member
Member
Posts: 1112
Joined: Tue Mar 04, 2014 5:27 am

Re: Virtual Adress Space Reservation

Post by alexfru »

You can have two trees. One to hold all regions (reserved and available), ordered by address. The other to hold available only, ordered by size. An available region will appear in both. When you unreserve a reserved region, you can immediately find its available neigbors to coalsce with.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Virtual Adress Space Reservation

Post by Brendan »

Hi,
zaval wrote:Thank you for all your replies.
If the process wants to reserve an area and/or dynamically allocate areas, then that's a user-space problem that should be implemented in a user-space library
No matter where, it still should have been done. How do you think is the best way to do it? And also, is this library involved in process creation, when you need to reserve blocks for its image sections, stacks, pools etc?
The library wouldn't be involved in process creation - you'd have an executable loader that loads/maps sections at "link time defined" virtual addresses (followed by doing similar for shared libraries if you want that). When the library is initialised it'd figure things out from "already set by static or dynamic linking symbols" (but mostly only really needs to find a single huge unused space between "end of .bss" and "start of where dynamically linked libraries might be").
zaval wrote:
Once you have the balanced tree¹ reserving regions becomes easy: Just store the size of the largest available region in the inner nodes of this tree. Then during reservation you only descend into nodes that have enough space left. This means that the kernel is able to satisfy non-fixed reservation requests at no additional overhead - in fact, if you implement the same functionality in user-space like Brendan suggested, you will have some non-zero overhead.²
This sounds interesting and honestly I need to clarify it yet. As well as learn what radix trees are. :D Thank you for the link. Still, the key moment is availability of the address P at the moment of the call to the reservation algorithm. But how do you find P?
Assume you have a binary tree (where each node has 2 pointers); but where each pointer has "prefix value" and "prefix mask" fields associated with it. To traverse the tree you'd do something like:

Code: Select all

    do {
        if(this node is what I'm looking for) {
            return node;
        } else {
            if( address & left_prefix_mask == left_prefix_value) {
                node = node->left;
            } else if( address & right_prefix_mask == right_prefix_value) {
                node = node->right;
            } else {
               return NOT_FOUND;
            }
    } while(CPU is constantly stalled due to TLB misses, cache misses and branch mispredictions);
;)


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Korona
Member
Member
Posts: 1000
Joined: Thu May 17, 2007 1:27 pm
Contact:

Re: Virtual Adress Space Reservation

Post by Korona »

zaval wrote:This sounds interesting and honestly I need to clarify it yet. As well as learn what radix trees are. :D Thank you for the link. Still, the key moment is availability of the address P at the moment of the call to the reservation algorithm. But how do you find P?
As I said, I use a tree (could be a (balanced) binary tree or a radix tree) to store all memory regions, ordered by address. In each (inner) vertex I store the size of the largest "hole" (i.e. the largest available region) that is a descendant from the vertex. This number can easily be updated during insertion and deletion (e.g. when you do tree rotations). Then I do a traversal from the root to a leaf of the tree but I only descend into vertices that have large enough holes at descendants.

Code: Select all

Let s be the size of the region that we want to reserve
if(root does not have a hole of size >= s)
    return reservation cannot be satisfied
v ← root
while(v is not a leaf)
    if(left(v) has a hole of size >= s)
        v ← left(v)
    else
        // By construction, right(v) has a hole of size >= s
        v ← right(v)
// By construction, v is not reserved yet
Cut a part u of size s from region v
Mark u as reserved
Insert u into the tree
return u
@Brendan. Yes, trees are not very cache-friendly. However, assuming that you use a radix tree with a branching factor of 256, you'll have at most 8 cache misses on 64-bit architectures (for allocation; of course, you'll have to add some more if you take changing the page tables into account). For most programs, assuming that they only use the highest and lowest 512 GiB of address space and factoring in that the first level will always be cached, you're down to 4. That number should not really be a problem considering that (a) reservation is a rare operation (and if it is not, more levels are likely to be cached) and (b) 4 L2 misses is already in the order of magnitude of a nop syscall. Furthermore, if you move reservation to user-space the user-space either still has to deal with the cache misses, needs a data structure that introduces less misses or needs to give up features (like forbidding dynamic region allocation or deallocation).
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].
User avatar
zaval
Member
Member
Posts: 660
Joined: Fri Feb 17, 2017 4:01 pm
Location: Ukraine, Bachmut
Contact:

Re: Virtual Adress Space Reservation

Post by zaval »

Thank you all, that was really helpful and interesting. This idea with holding information about holes inside an address keyed tree is attractive and I need to learn it deeply. :)
The library wouldn't be involved in process creation - you'd have an executable loader that loads/maps sections at "link time defined" virtual addresses (followed by doing similar for shared libraries if you want that). When the library is initialised it'd figure things out from "already set by static or dynamic linking symbols" (but mostly only really needs to find a single huge unused space between "end of .bss" and "start of where dynamically linked libraries might be").
Ok, process creation relies only on link-time defined addresses. However it's hard to get how all the next dynamic reservation requests can be satisfied with such an approach. For example stack creation for every (new) thread? additional pools (heaps), for synchronization reasons, mapped files? But I see, you don't like mapped files.
I mean it doesn't seem possible to limit with hard-fixed reservation. And the library should have some sofistication inside. You don't like trees. What do you see a better approach if not secret? :)
ANT - NT-like OS for x64 and arm64.
efify - UEFI for a couple of boards (mips and arm). suspended due to lost of all the target park boards (russians destroyed our town).
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Virtual Adress Space Reservation

Post by Brendan »

Hi,
zaval wrote:
The library wouldn't be involved in process creation - you'd have an executable loader that loads/maps sections at "link time defined" virtual addresses (followed by doing similar for shared libraries if you want that). When the library is initialised it'd figure things out from "already set by static or dynamic linking symbols" (but mostly only really needs to find a single huge unused space between "end of .bss" and "start of where dynamically linked libraries might be").
Ok, process creation relies only on link-time defined addresses. However it's hard to get how all the next dynamic reservation requests can be satisfied with such an approach. For example stack creation for every (new) thread?
Typically at the top of user-space you might have environment variables, command line arguments then the initial thread's stack. When a thread is spawned (e.g. using pthread library or whatever) the library figures out a whole bunch of stuff (where to put the new thread's thread local storage, where to put the new thread's stack, the thread's attributes, etc), where these allocations might (or might not) just use plain old "malloc()".
zaval wrote:additional pools (heaps), for synchronization reasons, mapped files?
A "generic C/C++/POSIX library" won't support additional pools or multiple separate heaps (or NUMA, or ...). For anything "better than bad" a process would need the flexibility to replace the generic allocator/s, which is easy when they're implemented in user-space (just don't use the generic "sys/mman.h" and/or "stdlib.h" and provide your own alternative) and hard when it's built directly in the kernel.

After the user-space allocator/s are initialised , memory mapped files are no problem at all. Before the user-space allocator/s are initialised the executable loader (and shared library loader) don't need a complex allocator (e.g. the executable can be mapped at the virtual address that the executable's headers say it must go, and shared libraries can use a trivial "memory_top -= size" approach; and the actual addresses can be randomised for ASLR if you want that).
zaval wrote:But I see, you don't like mapped files.
In general, memory mapped files are fine (beneficial in most cases). What I said was that my OS doesn't allow files to be modified (because it's a versioning file system) so "read/write memory mapped files" don't make any sense for my OS, and therefore my OS only has to care about "read only" memory mapped files.
zaval wrote:I mean it doesn't seem possible to limit with hard-fixed reservation. And the library should have some sofistication inside.
I don't really see how this is hard to figure out. If the kernel has "low-level, per page data" (designed and optimised to improve performance for things like page faults, and not designed for things like allocating and freeing zones), then you're going to have a have something else ("higher-level, per zone data") that is designed and optimised for allocating and freeing zones; and in that case (where everything is designed and optimised for its specific purpose) there's no reason to have that "something else" (the "higher-level, per zone data") in the kernel at all and there are multiple reasons (flexibility, performance, security) to make sure it is not in the kernel.
zaval wrote:You don't like trees. What do you see a better approach if not secret? :)
For better approaches:
  • If the process doesn't use more than one zone (e.g. one for heap and nothing else), then "no allocator at all" is better
  • If the process doesn't use more than maybe 16 zones, then maybe "simple array with 16 entries" (and linear searching) is better
  • If the process doesn't use more than maybe 256 zones and is running on something with a large enough virtual address space, then maybe "one bit for each of 256 slots" (and linear searching of the bitmap) is better
  • If the process happens to be single-threaded, then you don't need the overhead of acquiring/releasing locks (or over-complicated lockless or block free algorithms)
  • If the process allocates and frees lots of relatively small zones very often then maybe you can do some kind of "queue of zones for each power of 2" allocator, maybe combined with some kind of "don't both telling kernel when the area is free and just recycle when it's used again" behaviour to reduce overhead.
  • If the process happens to be something like a Java virtual machine then maybe the allocator should be build into object creation ("new") and needs to support the virtual machine's garbage collector
Note that I am NOT saying that trees aren't the best solution for SOME cases. All I am saying is that "generic code to suit all processes" is never ideal for any process - the only thing "generic code to suit everyone" is good for is to make it easy for lazy developers to churn out poor quality trash. That is the problem with "generic trees to suit all processes hard-coded in the kernel".


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply