Generic support for hardware data paths (CPU bypass)

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
immibis
Posts: 19
Joined: Fri Dec 18, 2009 12:38 am

Generic support for hardware data paths (CPU bypass)

Post by immibis »

While looking at my PinePhone schematic I noticed that both the cellular modem and the Bluetooth chip have direct audio connections to the CPU chip (which has some kind of built-in audio peripheral).

Presumably each chip can be configured to send or receive audio directly over that connection instead of forming it into packets and sending over USB/SDIO - which can reduce latency and possibly save some power.

I also recalled that in some other phones where the modem and Bluetooth chip have direct audio connections to each other - allowing the CPU to remain off while doing a phone call with a Bluetooth headset! Then I wondered how an operating system would have to be designed to allow this kind of feature to be used effectively.

---

Suppose I'm writing a generic application that makes phone calls, on some hypothetical operating system that runs on many different phones. By default I'll write something equivalent to

Code: Select all

cat < /dev/bluetooth/headset0/microphone > /dev/phonecall &
cat < /dev/phonecall > /dev/bluetooth/headset0/speaker
On devices with this hardware fast-path, this is inefficient. Even assuming the best hypothetical OS that uses DMA directly to and from user-space buffers with no context switching, the fact the app uses the CPU at all is a waste of battery power. It should be activate_fastpath /dev/phonecall /dev/bluetooth/headset0; sleep forever. But my app is supposed to be generic, so it can't rely on this activate_fastpath command and it can't rely on this feature existing.

So what kind of interface would an OS have to have, to solve this kind of problem? I suspect something like a dataflow graph API: the app would have to do something like

Code: Select all

find_node N1 phonecall_input
find_node N2 bluetooth/headset0/speaker
link_node N1 N2
find_node N3 bluetooth/headset0/microphone
find_node N4 phonecall_output
link_node N3 N4 
commit_graph
sleep forever
(I think this is also how atomic modeset works on Linux)

Now the CPU isn't even pretending to see the bits. The OS/driver system is able to process this flow graph and (via some magic algorithm) find the most efficient way to implement it. If the data can be passed directly then it does so; if zero-copy DMA is available it uses that; if not, then it starts a loop to read and write data. The API works the same way on every device, although the performance can differ and even the quality of the data stream may differ.

If you do it this way you have a new problem: what if some processing is supposed to happen in the middle? I can't actually think of a case where an app would want to inject processing into a phone call audio, other than perhaps background noise reduction, which you'd want to be enabled all the time in every scenario. But maybe there is one. Then what if the hardware supports fast-path processing that isn't quite identical to what the application asks for? How do you communicate to the application what it can ask for and get?

---

Similar scenarios also apply to video data (some chips can forward camera data to the screen? probably? or at least from the camera to the GPU) and network traffic (there are definitely multi-port NICs with built-in switches), and giant supercomputers (that shuffle data between NIC and GPU although this one is probably just a plain old cross-device DMA).

Any interesting thoughts? This is just nerd sniping, not something I really need help with. There's probably no good answer.
nullplan
Member
Member
Posts: 1801
Joined: Wed Aug 30, 2017 8:24 am

Re: Generic support for hardware data paths (CPU bypass)

Post by nullplan »

Honestly, having the bus as part of the device path seems pretty horrible to me. Normal way is to have a collection of sound sources and a collection of sound sinks, and then maybe additional information about them on request. And then you ask the user to make a connection. If your application is supposed to be generic, it really should not hardcode that you wish to use a bluetooth headset.

In any case, I already see a problem with an attempted direct connection between sound source and sink: Audio format. It is unlikely that the phone device would have high-quality sound available (don't they usually have 8-bit sound on one channel at an embarrassing sample rate?). But for a high-quality audio device, there is little reason to support anything other than two-channel, 16-bit audio at 44100 or 48000 Hz. And so, you cannot directly connect sound source and sink, because you must convert formats. Changing sample-rate (which typically involves interpolation), sample width, and number of channels is no mean feat.

But if that wasn't a problem, if, for example, source and sink could be made to produce compatible sounds, then the OS could have an API to connect them together directly. However, that would also kill any possibility for error handling. What if the source cannot produce sounds fast enough? How will the sink handle underflow? If you really just set both peripherals to DMA the same buffer, then a source underrun would sound like a game crash-ash-ash-ash-ash.

So all in all, it is unlikely two audio devices could be made to work together so completely that no CPU intervention was at all necessary, and it is far more likely that constant CPU intervention is necessary. But even if not, CPU intervention for error cases (or even just situations approaching errors) is likely necessary or at least beneficial.
Carpe diem!
xeyes
Member
Member
Posts: 212
Joined: Mon Dec 07, 2020 8:09 am

Re: Generic support for hardware data paths (CPU bypass)

Post by xeyes »

So what kind of interface would an OS have to have, to solve this kind of problem?
Take a look at HDA spec, esp. the part about the CODEC.

In short, even cheap Realtek CODECs can do digital and analog loopback either internall or externally. The former means feeding your output to your input (record what you're playingback), the later means feeding your input to your output (direct monitoring).

Even HDA (without using 'vendor specifc extension') can support much more than this. Specific extensions or custom interfaces can do even more.

the fact the app uses the CPU at all is a waste of battery power.
Modern audio CODECs are not just hardware. They might have more CPU cores and way more complex firmware than "the app". They can be and usually are more energy efficient than the AP though.
immibis
Posts: 19
Joined: Fri Dec 18, 2009 12:38 am

Re: Generic support for hardware data paths (CPU bypass)

Post by immibis »

nullplan wrote:In any case, I already see a problem with an attempted direct connection between sound source and sink: Audio format. It is unlikely that the phone device would have high-quality sound available (don't they usually have 8-bit sound on one channel at an embarrassing sample rate?). But for a high-quality audio device, there is little reason to support anything other than two-channel, 16-bit audio at 44100 or 48000 Hz. And so, you cannot directly connect sound source and sink, because you must convert formats. Changing sample-rate (which typically involves interpolation), sample width, and number of channels is no mean feat.
Presumably the hardware designers thought about that before directly connecting two chips and they are compatible. Or it could be an analog signal.

If you're talking about the software path, actually I think auto-conversion makes sense. The goal is not "copy these bytes", it's "copy this sound". If you want to send the sound from the phone call to the Bluetooth headset, and they have incompatible formats, then obviously you want to do the conversion.
nullplan wrote:What if the source cannot produce sounds fast enough? How will the sink handle underflow? If you really just set both peripherals to DMA the same buffer, then a source underrun would sound like a game crash-ash-ash-ash-ash.
Then it's failed to act as a sound source.
nullplan wrote:So all in all, it is unlikely two audio devices could be made to work together so completely that no CPU intervention was at all necessary, and it is far more likely that constant CPU intervention is necessary.
Remember one of the scenarios is a direct connection between two chips, installed intentionally so that no CPU intervention is necessary. You could even imagine a direct analog audio connection. Some of these chips do have analog inputs and outputs.

xeyes wrote: Take a look at HDA spec, esp. the part about the CODEC.

In short, even cheap Realtek CODECs can do digital and analog loopback either internall or externally. The former means feeding your output to your input (record what you're playingback), the later means feeding your input to your output (direct monitoring).

Even HDA (without using 'vendor specifc extension') can support much more than this. Specific extensions or custom interfaces can do even more.
That's the hardware interface, but we don't expose raw registers to userspace.

Direct monitoring is a good scenario where this problem applies. I assume it's currently set using something like a hardware-specific ioctl (which toggles the register), and that's not very good.
Post Reply