Fully asynchronous IPC howto

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Fully asynchronous IPC howto

Post by jal »

One of the design goals of my current OS project is to make it fully asynchronous with regards to IPC, so sending a message is never blocking. Obvioulsy, this causes a few problems, and though I can solve these (and have, to a certain extend), I'm not really satisfied yet. So I though I'd ask you for your opinion on the subject. The two main problems are:

1) Replies. I send five messages to a service. Then I get replies (either by explicitly waiting for them, or by having some interface function called, or whatever). How do I know which reply is a reply to which request?
2) Synchronization: I could ask the OS to enter a critical section, but since it's all asynchronous, I'll have to wait for the (asynchronous) answer. How to handle this without spinning or the like?

I'm not asking for any solution. I'm asking for what you would find elegant or efficient etc.


JAL
User avatar
AJ
Member
Member
Posts: 2646
Joined: Sun Oct 22, 2006 7:01 am
Location: Devon, UK
Contact:

Re: Fully asynchronous IPC howto

Post by AJ »

Hi,

I'm not the most educated person when it comes to ipc, but to my simple mind, the following solutions would be my initial thoughts:
1) Replies. I send five messages to a service. Then I get replies (either by explicitly waiting for them, or by having some interface function called, or whatever). How do I know which reply is a reply to which request?
I'd probably make an application-defined (or user library-defined) ID a part of each message? This could be done in one of two ways: Either make the user lib auto-increment the number and make that ID the return value to the "callService()" function. Alternatively, the app itself picks a message ID number and uses that.

As an extension to this, you could use a complex ID to let your app-space program know how to handle the message:

Code: Select all

struct MessageId
{
   uint32_t type:8; // An app-defined code telling the app how to handle the response when it comes back.
   uint32_t id:24;   // A unique ID - auto incremented?
}
2) Synchronization: I could ask the OS to enter a critical section, but since it's all asynchronous, I'll have to wait for the (asynchronous) answer. How to handle this without spinning or the like?
Is it neat enough to simply add a sleep-type function which sleeps the application until a message is available? Again, if you used the ID system, you could even ask the system to sleep your app until a specific message ID gets passed back.

Cheers,
Adam
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Fully asynchronous IPC howto

Post by Combuster »

You can also use addresses to a similar end - if you send messages to X, then the answers will come from X, if you send messages to X Y and Z, and you receive messages from Z X and Y, you'll still know what message is a response to which request. Similarly, if you know a server responds to messages in the original order, then you can just count the answers and match them to the requests.

It also solves potential problems if you ask multiple similar devices for input concurrently (sharing the same interface, return the same message ID, now which one's player 1's joystick?)
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
mybura
Posts: 18
Joined: Wed Sep 02, 2009 12:59 am

Re: Fully asynchronous IPC howto

Post by mybura »

). How do I know which reply is a reply to which request?
A request will be some form of structure that will be passed to a generic routine to action the request (i.e. DispatchRequest(&RequestDetails)). The address of the structure alone is enough to uniquely identify a request in the system given that the reply from the service will be stored in some kind of structure inside the request structure:

Code: Select all


  struct Request 
  { ...
    int RequestCode;
    int RequestParameter;
    int Response;
    ...
   }

This way, when the requesting function wants to know which reply belongs to which request, it simply looks at the Response field of the requesting structure.

As for synchronization, using the same structure, lets add two fields:

Code: Select all


  struct Request 
  { ...
    int RequestCode;
    int RequestParameter;
    int Response;

    int Owner;
    int CallingThread;
    ...
   }

When you create the original request structure, set the Owner to 0. Calling DispatchRequest will set Owner to 1. When the request has been serviced, the driver or service will set the Owner back to 0. DispatchRequest will return immediatly to the calling code, eventhough the driver or service might take some time to complete (hence keeping Owner at 0). While waiting for Owner to return back to 1, the calling code can suspend the thread and

1. Periodically wake up and check if Owner is 1 or
2. Get the driver/service servicing the request to wake up the thread using details from the CallingThread field.

Incidently, IPC in Windows seems to work like this.
Craze Frog
Member
Member
Posts: 368
Joined: Sun Sep 23, 2007 4:52 am

Re: Fully asynchronous IPC howto

Post by Craze Frog »

2) Synchronization: I could ask the OS to enter a critical section, but since it's all asynchronous, I'll have to wait for the (asynchronous) answer. How to handle this without spinning or the like?
This makes no sense, think about it.
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: Fully asynchronous IPC howto

Post by jal »

Craze Frog wrote:
2) Synchronization: I could ask the OS to enter a critical section, but since it's all asynchronous, I'll have to wait for the (asynchronous) answer. How to handle this without spinning or the like?
This makes no sense, think about it.
Of course this makes sense. IPC is the only way to ask any component of the system to do anything. This includes the kernel. "SendMessage" is the only primitive going via INT or SYSENTER. So requesting to enter a critical section would require to send the OS services part of the kernel a message via IPC. However, the thread is not blocked on sending a message, so the traditional approach of "request a critical section and if you're not allowed to enter, you're blocked" is not possible. But please explain why that doesn't make sense, as I have thought about it quite some hours already, before posting this.


JAL
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Fully asynchronous IPC howto

Post by Combuster »

You don't need locking with pure message passing IPC, ever. There is no shared state that you would want to lock.
"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
NickJohnson
Member
Member
Posts: 1249
Joined: Tue Mar 24, 2009 8:11 pm
Location: Sunnyvale, California

Re: Fully asynchronous IPC howto

Post by NickJohnson »

True, but he never said it was purely message based, just purely asynchronous. I too would recommend a pure message passing system, but on the other hand, there is nothing faster than shared memory.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fully asynchronous IPC howto

Post by Brendan »

Hi,
jal wrote:1) Replies. I send five messages to a service. Then I get replies (either by explicitly waiting for them, or by having some interface function called, or whatever). How do I know which reply is a reply to which request?
For my messaging, each message has 3 standard fields (followed by zero or more bytes of additional data, depending on what the message is). The 3 standard fields are the length of the message, the sender ID and the message type. Message types 0xFF000000 to 0xFFFFFFFF are restricted - they have standard meanings and may only be sent by the kernel. Message types 0x00000000 to 0xFEFFFFFF are "agreed upon" by the sender and receiver. For example, a service might define a "messaging protocol" which determines which messages it responds to and which messages it sends as replies.

Now, if you send 5 messages to a service, and if the message type and sender ID are the same, and if you need to be able to tell the difference between the replies, then the design of the "messaging protocol" being used needs to include some other form of identification. This might be an object number, a file name, a reference ID, etc.

For a simple example, for file I/O there's an "open_file_request" message type where the extra data contains the name of the file you want to open. The reply message also contains the name of the file you want to open (so you can figure out which request the reply is for), and a file handle. All subsequent file I/O operations (read, write, seek, flush, close, etc) use the file handle in the requests and the replies. This means you can have any number of file I/O requests/replies and know exactly what each message is for.
jal wrote:IPC is the only way to ask any component of the system to do anything. This includes the kernel. "SendMessage" is the only primitive going via INT or SYSENTER.
My advice here is: don't use messaging for the kernel API. Instead use a normal API and normal system calls (e.g. call gate, software interrupt, SYSCALL/SYSENTER, etc) for all things the micro-kernel does itself. This includes "send_message()" and "get_message()"; but it may also include things like "alloc_page()", "spawn_thread()", "get_time()", etc if these things aren't done by user-level processes/services.
Combuster wrote:You don't need locking with pure message passing IPC, ever. There is no shared state that you would want to lock.
You're correct for "pure message passing IPC". However, "pure message passing IPC" is a bit of a nightmare. IMHO it's easier to have multi-threaded processes, where there's "process space" and "thread space" (and where "process space" is shared state, that's hopefully not over-used). In this case, careful use of user-level locks can make it easier to write multi-threaded processes (even though it's theoretically possible to design the process so that user-level locks aren't needed). The important thing is that if you do implement user-level locks, they aren't any different to user-level locks in any other OS (for e.g. futexes).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: Fully asynchronous IPC howto

Post by jal »

Thanks everyone for the discussion so far, you are being quite helpful.

As for a "pure message passing IPC", from your posts I understand this to mean a system that has no shared state among objects, basically one thread per process? That is, at least, a system I could see that locking would normally not be needed, but perhaps not very practical in terms of both performance and memory usage.
All subsequent file I/O operations (read, write, seek, flush, close, etc) use the file handle in the requests and the replies. This means you can have any number of file I/O requests/replies and know exactly what each message is for.
How would you handle a series of independent read operations, e.g. "read 500 bytes from offset 1024", "read 8412 bytes from offset 23556", "read 20 bytes from offset 53847"? Of course you could include all parameters in the reply, but this may not always be practical.
My advice here is: don't use messaging for the kernel API. Instead use a normal API and normal system calls (e.g. call gate, software interrupt, SYSCALL/SYSENTER, etc) for all things the micro-kernel does itself. This includes "send_message()" and "get_message()"; but it may also include things like "alloc_page()", "spawn_thread()", "get_time()", etc if these things aren't done by user-level processes/services.
Well, I'm striving for having a really, really small kernel, so basically "send_message" is the only think the kernel API offers. There's a "services" component that runs in ring 0, but that's just for the lower-level user space services like the memory manager and scheduler. It'd probably be impractical to use messages for those things, I'm still not certain how to handle.
it's easier to have multi-threaded processes, where there's "process space" and "thread space" (and where "process space" is shared state, that's hopefully not over-used).
That's at least currently the design I have.
The important thing is that if you do implement user-level locks, they aren't any different to user-level locks in any other OS (for e.g. futexes).
Didn't know about futexes, but they indeed seem something worth looking at. The first question that comes to mind is what a user level thread would do when failing to obtain the lock: spinning until it can get it? The ease with kernel locks is that the kernel can just put the thread to sleep when it cannot get the lock, and it is transparent for the thread.


JAL
Craze Frog
Member
Member
Posts: 368
Joined: Sun Sep 23, 2007 4:52 am

Re: Fully asynchronous IPC howto

Post by Craze Frog »

jal wrote:
Craze Frog wrote:
2) Synchronization: I could ask the OS to enter a critical section, but since it's all asynchronous, I'll have to wait for the (asynchronous) answer. How to handle this without spinning or the like?
This makes no sense, think about it.
Of course this makes sense. IPC is the only way to ask any component of the system to do anything. This includes the kernel.
BEEEEP! You didn't think.

IPC = Inter-Process Communication. Any method of passing data between processes running in separate address spaces.

The kernel isn't a process. It's a kernel. You don't use inter-process communication to interact with the kernel. You use system calls to interact with the kernel. Do you want asynchronous system calls? Because that's not the same as asynchronous IPC.
User avatar
AndrewAPrice
Member
Member
Posts: 2305
Joined: Mon Jun 05, 2006 11:00 pm
Location: USA (and Australia)

Re: Fully asynchronous IPC howto

Post by AndrewAPrice »

Craze Frog wrote:The kernel isn't a process. It's a kernel. You don't use inter-process communication to interact with the kernel. You use system calls to interact with the kernel. Do you want asynchronous system calls? Because that's not the same as asynchronous IPC.
That line is blurred even further in my OS where all applications exist in a single namespace and calling a function inside of a library may be executing local code or another program's code (without a significant difference in performance).
My OS is Perception.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fully asynchronous IPC howto

Post by Brendan »

Hi,
jal wrote:As for a "pure message passing IPC", from your posts I understand this to mean a system that has no shared state among objects, basically one thread per process? That is, at least, a system I could see that locking would normally not be needed, but perhaps not very practical in terms of both performance and memory usage.
Slow code that scales better can be faster than fast code that doesn't scale!

Imagine a small motorbike that's capable of doing a 200 Km trip in 90 minutes using 1 litre of fuel. Now imagine a massive truck that can do the same 200 Km trip in 120 minutes using 50 litres of fuel. Which vehicle is faster, and which vehicle is the most fuel efficient?

If you want to visit a friend the small motorbike would be faster and more fuel efficient, and the truck would be a waste of time and a huge waste of fuel. However, what if you need to deliver 1000 computers? With the motorbike you'd only be able to take one computer at a time and you'd need to make 1000 trips - it'd take 180000 minutes and use 2000 litres of fuel (including return trips). The "slow and inefficient" truck might be able to take 100 computers at a time, and might take 24000 minutes and 1000 litres of fuel (including return trips). The "fast and efficient" motorbike would be a huge waste of time and a waste of fuel.

For a complex process with many threads, traditional locking is a nightmare - it's easy to get locking wrong and end up with random (timing dependant) faults that are extremely difficult to find/fix, and extremely hard to write software that scales well.

For "pure messaging passing" you'd break the complex process into many separate objects. You can't get the locking wrong (there is none) and it's relatively easy to make it scale (most large super-computers with thousands of CPUs use message passing because of this).

For RAM usage, there isn't too much difference - "pure messaging passing" uses more RAM for code, etc, but that's mostly insignificant because most RAM is used by data anyway. For example, for a specific piece of software, a single threaded process might have 100 KiB of code and use 100 MiB of data, traditional locking with multiple threads might have 120 KiB of code and use the same 100 MiB of data, and "pure message passing" might have 150 KiB of code and the same 100 MiB of data.

Basically (to put it in very general and over-simplified terms) traditional locking is good until you've got more than 4 or 8 CPUs, and then it starts to suck due to scalability problems; while "pure message passing" sucks when there isn't many CPUs but doesn't have scalability problems - it's better for 8 or 16 CPUs, and for hundreds or thousands of CPUs it's a clear winner.

Now, look back in history. In the past, most (80x86) computers have had less than 4 or 8 CPUs, and in the past traditional locking has been better. Now, look into your crystal ball and tell me what you see! Most new computers have 4 or 8 (or 6) CPUs now, both Intel and AMD are planning "8 cores wth SMT" (16 logical CPUs) for their next chips, and Intel was showing off a 48-core prototype a few weeks ago. My crystal ball says that era of traditional locking is coming to an end, and a major change in the way software is written/designed is on the way. :D
jal wrote:
All subsequent file I/O operations (read, write, seek, flush, close, etc) use the file handle in the requests and the replies. This means you can have any number of file I/O requests/replies and know exactly what each message is for.
How would you handle a series of independent read operations, e.g. "read 500 bytes from offset 1024", "read 8412 bytes from offset 23556", "read 20 bytes from offset 53847"? Of course you could include all parameters in the reply, but this may not always be practical.
Sounds entirely practical to me.. :)
jal wrote:
My advice here is: don't use messaging for the kernel API. Instead use a normal API and normal system calls (e.g. call gate, software interrupt, SYSCALL/SYSENTER, etc) for all things the micro-kernel does itself. This includes "send_message()" and "get_message()"; but it may also include things like "alloc_page()", "spawn_thread()", "get_time()", etc if these things aren't done by user-level processes/services.
Well, I'm striving for having a really, really small kernel, so basically "send_message" is the only think the kernel API offers. There's a "services" component that runs in ring 0, but that's just for the lower-level user space services like the memory manager and scheduler. It'd probably be impractical to use messages for those things, I'm still not certain how to handle.
With asynchronous messaging, the sender creates a message, the message is stored somewhere, then (later on) the receiver gets the message. How is the RAM needed to create/store the message allocated? If you use messaging to allocate this RAM, then how is the RAM needed to create/store the "allocate some RAM" message going to be allocated?

A similar problem occurs with "scheduler runs as a separate task" - you can't send a message to the scheduler telling it to switch tasks, because the scheduler won't get CPU time until you do a task switch.

Both of these things do work with synchronous messaging (where sending a message implies an immediate task switch to the message receiver) - the message isn't stored for any length of time (and the scheduler doesn't have much control over when task switches occur).
jal wrote:
The important thing is that if you do implement user-level locks, they aren't any different to user-level locks in any other OS (for e.g. futexes).
Didn't know about futexes, but they indeed seem something worth looking at. The first question that comes to mind is what a user level thread would do when failing to obtain the lock: spinning until it can get it?
There's only 3 choices: spin until the lock becomes free, put the thread into a "waiting for lock" state (where the scheduler does a task switch and doesn't give the thread any CPU time until the lock becomes free), or a combination of both (e.g. if the conditions are right then spin, else enter a "waiting for lock" state).

Spinning is faster if the lock is held by another thread that's currently running on another CPU, and if the other thread will release the lock "soon". Otherwise it's a waste of time (and can lead to live-locks and lock contention problems - e.g. all CPUs spinning on lock/s that won't be freed until other tasks get CPU time).

Entering a "waiting for lock" state sounds like lots of overhead (switching to CPL=0, doing task switches, etc), but it's a little bit like the "motorbike vs. truck" thing: "lots of overhead" might be a lot less overhead.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: Fully asynchronous IPC howto

Post by jal »

Craze Frog wrote:The kernel isn't a process. It's a kernel. You don't use inter-process communication to interact with the kernel. You use system calls to interact with the kernel. Do you want asynchronous system calls? Because that's not the same as asynchronous IPC.
For the sake of discussion, I will resist scolding you for accusing me of not thinking. In my current design the actual kernel, i.e. the part that's mapped in the address space of every process, only does one thing: message passing (and then only parts of it). The only system call is "send message". All other things a traditional kernel does, even a typical micro kernel, is handled by user land processes (e.g. memory mangement, scheduling) or a supervisor process (OS services like changing page directories). So, in the current design, asking the OS for a critical section means requesting the OS services for it, by sending a message. I'm not saying that's a good design, or the final design, but it is the current one.


JAL
jal
Member
Member
Posts: 1385
Joined: Wed Oct 31, 2007 9:09 am

Re: Fully asynchronous IPC howto

Post by jal »

Brendan wrote:For RAM usage, there isn't too much difference - "pure messaging passing" uses more RAM for code, etc, but that's mostly insignificant because most RAM is used by data anyway. For example, for a specific piece of software, a single threaded process might have 100 KiB of code and use 100 MiB of data, traditional locking with multiple threads might have 120 KiB of code and use the same 100 MiB of data, and "pure message passing" might have 150 KiB of code and the same 100 MiB of data.
Maybe (probably) I'm missing something, but if one needs locking for access to data, wouldn't (part of) the data need duplication as well to avoid simultaneous access? And you'd need some kind of data synchronisation mechanism.

Also, if you do not share data, you'd have to have a page directory for each thread, which may be a bit costly as well.
Now, look into your crystal ball and tell me what you see!
I think we have the same crystal ball :).
With asynchronous messaging, the sender creates a message, the message is stored somewhere, then (later on) the receiver gets the message. How is the RAM needed to create/store the message allocated? If you use messaging to allocate this RAM, then how is the RAM needed to create/store the "allocate some RAM" message going to be allocated?
In my current design, each new process or thread (haven't decided yet whether I want this shared) gets at least some free memory, at least one page, for creating a message to send. Upon sending, the page with the message will be swapped out, and a new one swapped in (TBD whether that's a copy-on-write zero page or a newly allocated one). I haven't decided whether I want dedicated message memory or that any heap page is good for that.
A similar problem occurs with "scheduler runs as a separate task" - you can't send a message to the scheduler telling it to switch tasks, because the scheduler won't get CPU time until you do a task switch.
Tell me about it :). Not to mention sending a message to the memory manager, as the memory manager must map the message into the address space of the receiving process, in this case itself. Also, having to schedule the memory manager for every message is a bit on the ridiculous side. So in the current design, this is all done synchronously, although I don't like it.
Spinning is faster if the lock is held by another thread that's currently running on another CPU, and if the other thread will release the lock "soon". Otherwise it's a waste of time (and can lead to live-locks and lock contention problems - e.g. all CPUs spinning on lock/s that won't be freed until other tasks get CPU time).
Exactly. Waiting for someone to be scheduled on the same CPU while you are spinning away doesn't make much sense. On the other hand, if that would happen relatively sporadic, the overhead of needless spinning may be outweighed by the lack of entering a "wait for lock" state.
Entering a "waiting for lock" state sounds like lots of overhead (switching to CPL=0, doing task switches, etc), but it's a little bit like the "motorbike vs. truck" thing: "lots of overhead" might be a lot less overhead.
And, like the bike vs. truck thing, it depends on how often it is needed. Food for thought.

Thanks for your enlightening reply!


JAL
Post Reply