NVMe driver

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Octocontrabass
Member
Member
Posts: 5501
Joined: Mon Mar 25, 2013 7:01 pm

Re: NVMe driver

Post by Octocontrabass »

rdos wrote:This means the NVMe function must support at least 2 interrupts for a single namespace, which both of my devices do.
I don't think the NVMe spec requires devices to support more than a single interrupt, even though it's strongly recommended.
rdos
Member
Member
Posts: 3268
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

Octocontrabass wrote:
rdos wrote:This means the NVMe function must support at least 2 interrupts for a single namespace, which both of my devices do.
I don't think the NVMe spec requires devices to support more than a single interrupt, even though it's strongly recommended.
It would be possible to use the first interrupt for the IO schedule too, if required. Particularly since I don't use interrupts for the admin schedule, rather I'm polling for completion. That's because interrupts didn't work on the "slow" disc, but I might check if it works now or nor. I actually don't know why it suddenly worked on the IO schedule. I did write to the clear mask register though, but it read out zeros so should already be unmasked.

I changed the logic again to only use a single submit queue (and complete queue) per namespace. I now will only post one entry and then wait for its completion. This reduces parallelism a bit, but it's simple and probably good enough for the moment.

The PRP lists are a bit poorly explained too, and that's a bit problematic since this logic is implicit. However, my understanding of it now is that the 1:st PRP entry is always filled out with the first physical address (which could have a non-zero offset). Next, either the 2:nd PRP entry is filled with another physical address to data, or it's filled with the address of a PRP list, and then the rest of the physical entries will be in the PRP list.

Since I now only have a single entry active at a time, I decided to preallocate one 4k page for the PRP list. This means I can work with request sizes up to 2MB.

The toggle bit in the completion queue is also a bit odd. There is a need to keep a toggle bit in namespace config and initialize it to zero. Every time the completion queue wraps around, the saved toggle bit is inverted. Completion of an entry thus can be checked by xoring the saved toggle bit with the toggle bit in the queue. If the result is 1, then the entry is done.

Anyway, after these fixes, I initialized the drive for MBR, created a 200MB FAT16 partition, and a couple of directory entries, and it all worked fine. I then rebooted and could se the new drive and the directories just like I created them.
rdos
Member
Member
Posts: 3268
Joined: Wed Oct 01, 2008 1:55 pm

Re: NVMe driver

Post by rdos »

I think a better algorithm for NVMe, and possibly AHCI as well, is to add requests to the schedule when they are added to the disc cache. Then I can have a dedicated thread per namespace that waits for the completion interrupt, and notifies the disk cache which for reads will wake-up the requester and for writes will do nothing. This requires that particularly writes are sent in bulks rather than one by one. I plan to handle file writes by having a kernel thread per process that has files open which scans pages for modification and then sends write requests to the disc cache per page. For updating FAT cluster chains, I have a bit of an advanced logic that typically will only issue a single write when adding or removing entries to a cluster chain. I plan to handle updating drectory entries with a timer rather than directly (and by checking for differences). So, I think I have a write logic which should allow posting writes directly to the NVMe disc schedule.

I normally have a C-scan algorithm in the disc cache which checks for active requests. This works well for mechanical discs (IDE), but not for modern memory card type of discs, where it wastes time on finding active requests. OTOH, I still need to visit cache entries so I can discard cache entries that are not used. Each cache has a maximum setting for used physical memory, which can be adapated to currently free physical memory.
Post Reply