NVMe driver

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

NVMe driver

Post by rdos »

I'm on to the NVMe driver again, and this time it's more serious.

The first priority is to decide what to discard in the enormously complex specification. I have two NVMe discs, one low-cost and low performance, and another very fast that can handle 7GB/s in read speed. I can read out the PCI BAR, and the two discs have a few comonalities:

1. Both of them support only 4k blocks, so it seems safe to refuse to work with discs that cannot handle 4k blocks.
2. Both of them support two interrupts on MSI. I think this can be used to assign the I/O queue to the second interrupt.
3. Both of them support more than one 4k page for the admin submit and complete queues. Thus, allocating one 4k page per admin queue seems reasonable.

I can successfully send the identify command, but I have not yet analysed the commonalities there. I would suspect namespaces and all the rest only have a single entry, and this is something that probably can be relied on too.

Another issue I have is sector size. The controller only seems to handle 4k requests, and FAT has 512 byte sectors. How do Windows format FAT on NVMe? Will it write 512 bytes per sector in the boot record?
devc1
Member
Member
Posts: 439
Joined: Fri Feb 11, 2022 4:55 am
Location: behind the keyboard

Re: NVMe driver

Post by devc1 »

How do Windows format FAT on NVMe? Will it write 512 bytes per sector in the boot record?
Yes, the bootsector is always 512 bytes so you just have to read the first 512 bytes of the sector.

You should look for the sector size value in the bootsector as it should match the sector size of the drive.

Afterwards, instead of asking and waiting a long day for a response u could have just tested the values and see how they point to different sectors. For e.g. I would look for the FAT sysinfo LBA and read both the LBA using 512 bytes sectors and 4096 byte sectors and see which one points to the correct sector. That would show me how things actually work in a filesystem.
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: NVMe driver

Post by Octocontrabass »

rdos wrote:1. Both of them support only 4k blocks, so it seems safe to refuse to work with discs that cannot handle 4k blocks.
Many NVMe SSDs support 512-byte blocks, and some of them are factory-formatted with 512-byte blocks.
rdos wrote:2. Both of them support two interrupts on MSI. I think this can be used to assign the I/O queue to the second interrupt.
Most NVMe drives support more than two interrupts, but that may only be through MSI-X. NVMe is meant to be used with one set of I/O queues per logical CPU to avoid IPC.
rdos wrote:3. Both of them support more than one 4k page for the admin submit and complete queues. Thus, allocating one 4k page per admin queue seems reasonable.
That should be more than enough for the admin queues.
rdos wrote:I would suspect namespaces and all the rest only have a single entry, and this is something that probably can be relied on too.
Support for multiple namespaces is rare, but they do exist.
rdos wrote:Another issue I have is sector size. The controller only seems to handle 4k requests, and FAT has 512 byte sectors. How do Windows format FAT on NVMe? Will it write 512 bytes per sector in the boot record?
...I'll have to dig up a 4k-sectored disk and get back to you on that.
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

The "slow" & cheap controller only supports two IO queues, while the fast one supports 129 IO queues. Both only have a single namespace. I think I might want to use a read queue and a write queue. The read queue often involves threads waiting for results, and can only have a single active entry. The write queue doesn't need any notification and all writes can be queued on the IO queue.

The identity command (type 1) contains 100s of parameters, but only three are of any interest. These are the min & max IO queue sizes (however, both discs report the minimum size and maximum size as the structure size in the specification). It also reports the number of name spaces.

I now realize there are three specifications, and the 3:rd also describes identity command type 0 which takes a NSID and reports disc parameters like sector size and number of sectors.
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

Octocontrabass wrote: Most NVMe drives support more than two interrupts, but that may only be through MSI-X. NVMe is meant to be used with one set of I/O queues per logical CPU to avoid IPC.
That won't fit with my VFS model. I only have one server thread per disc device, and it will run on a suitable CPU. I think I will use a read queue and a write queue only, provided the NVMe hardware supports at least two IO queues.
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

devc1 wrote:
How do Windows format FAT on NVMe? Will it write 512 bytes per sector in the boot record?
Yes, the bootsector is always 512 bytes so you just have to read the first 512 bytes of the sector.

You should look for the sector size value in the bootsector as it should match the sector size of the drive.

Afterwards, instead of asking and waiting a long day for a response u could have just tested the values and see how they point to different sectors. For e.g. I would look for the FAT sysinfo LBA and read both the LBA using 512 bytes sectors and 4096 byte sectors and see which one points to the correct sector. That would show me how things actually work in a filesystem.
I'm a bit from being able to read anything from the disc, but things are advancing. :-)
devc1
Member
Member
Posts: 439
Joined: Fri Feb 11, 2022 4:55 am
Location: behind the keyboard

Re: NVMe driver

Post by devc1 »

rdos wrote:
devc1 wrote:
How do Windows format FAT on NVMe? Will it write 512 bytes per sector in the boot record?
Yes, the bootsector is always 512 bytes so you just have to read the first 512 bytes of the sector.

You should look for the sector size value in the bootsector as it should match the sector size of the drive.

Afterwards, instead of asking and waiting a long day for a response u could have just tested the values and see how they point to different sectors. For e.g. I would look for the FAT sysinfo LBA and read both the LBA using 512 bytes sectors and 4096 byte sectors and see which one points to the correct sector. That would show me how things actually work in a filesystem.
I'm a bit from being able to read anything from the disc, but things are advancing. :-)
I would better open the disk in a hex decoder and it will be much easier to read and see how the file system works,
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: NVMe driver

Post by Octocontrabass »

rdos wrote:That won't fit with my VFS model. I only have one server thread per disc device, and it will run on a suitable CPU.
It sounds like your VFS model might be a bottleneck for NVMe.
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

Octocontrabass wrote:
rdos wrote:That won't fit with my VFS model. I only have one server thread per disc device, and it will run on a suitable CPU.
It sounds like your VFS model might be a bottleneck for NVMe.
Not necesarily. The important parameters is no copying of data, long requests and as little overhead as possible. Also, it is only file read and write that are performance sensitive. The read operation has no copy, and the driver will get an array of 4k pages that it links to the queues. The 4k pages are then mapped in userspace where the read operation is done. When a file is accessed sequentially, the read operation will notice this and send ahead-of-time reads to the disc driver. This means that everything works in parallel. The write operation is not yet done, but the sensitive stage is increasing file size. File data writes will be done by queueing them on the write queue and writes will be detected by the dirty page bit. There is no wait for write data.

A design where many CPU cores queue things on NVMe seems like a design with no cache where userspace sends reads directly to the disc driver. Given that the low-end NVMe disc doesn't support this, I don't think popular OSes use this scheme.
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

The namespace identify (type 0) now works, and just as the other identify results, only a few parameters are useful. The capacitity of the drive (number of LBA sectors), and the bytes per sectors are useful. Bytes per sector is hidden in an array with one mandatory element, and I really fail to see why it's designed like that. Both my discs have 512 bytes per sector. The set ID is also needed, but it's zero on both discs. A strange thing is that on the high-end disc, NSIZE and NCAP both have the same value, but NUSE is zero. I'm unsure why this is so, and if it's important. I initially used NUSE as size of the disc, but switched to NCAP instead.

Next, I created the IO completion queue (one per namespace), and two IO submission queues.

Once these queues are functional, I could then read data from the discs. Turns out they both have the boot sector zeroed (although one had 4 bytes of data at offset 0x1B8).

Next step is to integrate with the VFS and see if I can initiate the disc for MBR and create partitions. I have this working with the USB disc driver, and I can see how it performs with my USB analyser.
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

devc1 wrote: I would better open the disk in a hex decoder and it will be much easier to read and see how the file system works,
The USB disc driver already have the correct structure, and when I initiate it with my OS and create partitions (and files), I can then move it to a Windows machine and verify that it's correct. I even use the repair tool to check for errors, but there are none. Given that the NVMe has 512 bytes per sector too, the structure should be the same.

My USB analyser also allows me to see exactly what data is read/written and to which sectors. It's a very nice tool.
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: NVMe driver

Post by Octocontrabass »

rdos wrote:A design where many CPU cores queue things on NVMe seems like a design with no cache where userspace sends reads directly to the disc driver.
Or the cache is thread-safe. But now that you mention it, high-end NVMe is fast enough that some userspace applications (e.g. large databases) might prefer to bypass the OS cache.
rdos wrote:Given that the low-end NVMe disc doesn't support this, I don't think popular OSes use this scheme.
The Linux NVMe driver prefers to assign one set of I/O queues per logical CPU, but it still works when there aren't enough queues.
rdos wrote:Bytes per sector is hidden in an array with one mandatory element, and I really fail to see why it's designed like that.
It gives you a list of all supported sector sizes. You can format the namespace to change the sector size.
rdos wrote:A strange thing is that on the high-end disc, NSIZE and NCAP both have the same value, but NUSE is zero. I'm unsure why this is so, and if it's important. I initially used NUSE as size of the disc, but switched to NCAP instead.
NSIZE is the size of the disk. NCAP is how much data the disk can actually hold. NUSE is how much data the disk currently holds. You should see NUSE go up as you write data and go back down as you erase it.
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

Octocontrabass wrote:But now that you mention it, high-end NVMe is fast enough that some userspace applications (e.g. large databases) might prefer to bypass the OS cache.
Right. A database would likely want to bypass the OS cache since it wants the cache in userspace where it has direct access instead of in kernel space.

Octocontrabass wrote:It gives you a list of all supported sector sizes. You can format the namespace to change the sector size.
If the device supports formatting, which I'm not sure if any of my drives does.
Octocontrabass wrote: NSIZE is the size of the disk. NCAP is how much data the disk can actually hold. NUSE is how much data the disk currently holds. You should see NUSE go up as you write data and go back down as you erase it.
So I should use NSIZE for the highest sector number (less one) that the drive supports reading and writing?
Octocontrabass
Member
Member
Posts: 5568
Joined: Mon Mar 25, 2013 7:01 pm

Re: NVMe driver

Post by Octocontrabass »

rdos wrote:If the device supports formatting, which I'm not sure if any of my drives does.
Any device that supports multiple sector sizes will support formatting, but you're right, it's an optional command.
rdos wrote:So I should use NSIZE for the highest sector number (less one) that the drive supports reading and writing?
Yes. Some drives support namespace management commands, so it's possible for the namespace to be smaller than the actual highest sector number the drive supports, but that should be pretty rare.
rdos
Member
Member
Posts: 3297
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

OK, so the driver is linked to the VFS, and the partition server reads sector 0 and concludes there are no partitions. That's pretty good progress.

I decided that I will use int 0 for the admin commands, and int 1-n for the namespaces. This means the NVMe function must support at least 2 interrupts for a single namespace, which both of my devices do. At the moment, I only check for MSI and discard MSI-X, which is more complex to setup. The high-speed device works with interrupts now, while I'm not sure about the low-speed. It didn't work with the admin interrupt earlier, but I'll see if I can fix it tomorrow.

I still have some issues with PRP lists which will be needed for longer requests, but it should be relatively easy to fix. I embedded the namespace function with the small block allocator that has fast linear to physical transformations. A problem otherwise is that if more than two 4k pages are requested, then the driver must allocate a PRP list and put the physical addresses there. I don't want to do this with "malloc", and so I use the same function as I use for schedules for USB devices.

I put the basic data & allocator, the door bell, the read & write submission queue and the completion queue in a 20k linear address area with paging. That way everything that is needed is in this area, and a selector can be defined which will make sure only this area is accessed and nothing else.

The read code is very simple and fits perfectly with the VFS interface.

Another optimization I will do is to write all static fields in the submissions queues at creation time, and then only update the dynamic fields. This should make a difference as the queue entries are 64 bytes, and I only need to write around 20 bytes.
Post Reply