Using FUSE as an VFS

bzt · Post by **bzt** » Thu Mar 18, 2021 3:44 pm

rdos wrote:Nothing can solve this if the implementation isn't reentrant, other than only allowing one operation at a time.

Please define what you mean by "reentrant", because it's definitely not what we normally do. Allowing one operation at a time is called concurrency (or the lack thereof) sometimes also called exclusive access, and has nothing to do with re-entrancy, like a function or library calling into itself multiple times. For example, multiple tasks are allowed to call the fs task/subsystem, so the syscalls must be written in a way to allow re-entrancy, because there's no guarantee that a previous write call made by another task has finished when another task is calling the same write function, but with different parameters (hence the function must be re-entrant). This is independent to the issue of exclusive access of file system meta data (it might be that all concurrent tasks are writing files on different mount points).

rdos wrote:Sure, you can create a queue of commands, but that basically is like putting a semaphore on the whole fuse module.

Most definitely not the same. A semaphore would block the caller, while a queue allows non-blocking async operations. There's a huge performance difference.

rdos wrote:
bzt wrote:
rdos wrote:Easy. You open the file in your application and let read/write sector use the file handle.
Now how would that be any different to the POSIX file abstraction, API-wise? There you open the file and you use read/write on the file handle (which then are translated into sector read/writes in the kernel if the handle is for a block device, but that's transparent to you).
Because it allows me to connect the read-write operations to anything I like, not just stuff that can be connected to file handles.

I'm confused. You wrote "let read/write sector use the file handle" and you write "not just stuff that can be connected to file handles". Which one is it then? BTW, anything can be connected to file handles, that's whole point of the UNIX philosophy. That's exactly what they mean by "everything is a file".

rdos wrote:Additionally, by using this interface you will always need to copy stuff between the kernel and a user mode buffer.

Nope, that's a totally independent issue, fuse or not. Look up mmap and unmap.

rdos wrote:In rar, the i-node would be the file position of the object.

Okay, now I'm sure you have problems with your i-node definition. A file position is not what an i-node is. It wouldn't even map to i-node numbers either, because file position is unique, and multiple objects in the rar can't have the same position.

rdos wrote:Wrong. I-nodes are cluster numbers in FAT.

How could they be? One is a scalar, the other is a structure. Cluster numbers are just a simple index to an allocation table. They do not identify the file on the disk, they do not tell you if the pointed object is a directory or a file for example. Plus you can't assign the same cluster to multiple paths, that would be a FAT error (with the notable exception of "." and ".."). With i-nodes, it's perfectly fine that more paths share the same i-node (called hard links) by having multiple paths the same i-node number in their directory entries.

thewrongchristian wrote:But I wouldn't call it an i-node. It doesn't contain what an i-node in a UNIX like OS would contain. FAT has a combined directory entry/i-node structure, meaning there is a 1 to 1 mapping between a directory entry and a file.

Exactly.

thewrongchristian wrote:Identifying files. The FUSE operations tend to identify files by name.

I don't know about that. I actually haven't seen any fuse driver that wouldn't check the superblock for magic bytes, but you're right about the fact fuse lacks a common way to identify which driver to call. Right now your only option is to call all drivers one after another and see which one isn't failing. Not very optimal, however only needed to do once when mounting, then you can cache the driver's pointer. (Alternatively you could port libmagic and assign a magic definition to each and every ported fuse driver. Then you could use libmagic to select the appropriate fuse driver)

thewrongchristian wrote:Related to above, I want my FUSE interface to be stateless

Well, this isn't something that fuse could do anything about. Whether a fuse driver is stateless or not, is totally up to the driver, fuse has no say in that matter.

thewrongchristian wrote:I'd have to check, but I'm not sure this handle goes back and forth over the protocol/API in FUSE.

It does not. How a fuse driver caches path is totally up to the fuse driver. A FAT driver could use directory entry offsets as "handle" if they wish to do so. Fuse drivers with better performance tend to utilize the underlying storage device's bio_io_vec feature in the Linux kernel (so the fuse driver essentially create a command list what to read/write and from/to, and it sends that list to the kernel, not the data, nor the fuse-driver internal "handle" or id).

thewrongchristian wrote:Buffer sharing. I think the FUSE protocol sends buffer contents over the FUSE channel, rather than using some shared memory mechanism.

No, it's configurable. It's just a VFS layer, so it just proxies the read/write commands, tells the kernel where to access it, but does not copy the data. See direct_io. But you could copy the data from a fuse driver if you want to (for example, like reading compressed contents from a zip archive would require the fuse driver to uncompress and return that uncompressed buffer rather than telling the kernel where the compressed data can be found). Again, this is totally up to the fuse driver.

thewrongchristian wrote:Licensing. I've not decided what license to use for my kernel, as and when I "release" it.

Doesn't matter. Fuse uses LGPL, which means you can link that with your code no matter your code's license.

rdos wrote:I'll create my own interface and then provide a FUSE layer on top of it.

Even though you are thinking this for the wrong reasons, this is generally a very good idea.

Cheers,
bzt

rdos · Post by **rdos** » Fri Mar 19, 2021 2:02 am

bzt wrote:Please define what you mean by "reentrant", because it's definitely not what we normally do. Allowing one operation at a time is called concurrency (or the lack thereof) sometimes also called exclusive access, and has nothing to do with re-entrancy, like a function or library calling into itself multiple times. For example, multiple tasks are allowed to call the fs task/subsystem, so the syscalls must be written in a way to allow re-entrancy, because there's no guarantee that a previous write call made by another task has finished when another task is calling the same write function, but with different parameters (hence the function must be re-entrant). This is independent to the issue of exclusive access of file system meta data (it might be that all concurrent tasks are writing files on different mount points).

So, can you explain how calling the fatfuse driver from multiple threads, which uses no synchronization primitives at all, would be safe and not potentially corrupt cluster chains or even the filesystem itself? And what's the difference between queuing commands and doing them one at a time & putting a huge semaphore on the fuse library, other than the problem that the caller wouldn't be blocked? There surely would be no parallel operations in the filesystem with any of those solutions.

Perhaps more important, how can I be sure that the NTFS and ext2/3/4 fuse implementations can safely be called from multiple threads without potentially crashing the filesystem?

bzt wrote:Now how would that be any different to the POSIX file abstraction, API-wise? There you open the file and you use read/write on the file handle (which then are translated into sector read/writes in the kernel if the handle is for a block device, but that's transparent to you).

I'll just conclude that I despise the whole concept.

Having strange "files" in the filesystems that are devices or whatever simply is a horrible concept. I believe in the idea of defining proper APIs for device-types, and that users (programmers) should not be allowed to mix handles in arbitrary ways. Besides, nobody except experts with inside knowledge knows what is supported and what is not.

bzt wrote:Okay, now I'm sure you have problems with your i-node definition. A file position is not what an i-node is. It wouldn't even map to i-node numbers either, because file position is unique, and multiple objects in the rar can't have the same position.

I never said it was. I claimed it could be considered as a number that the filesystem is free to use in any way it wants to as long as it can identify a file or directory. For a rar it might be the position of the header, for FAT the start of the cluster chain and for a "file handle based" implementation the file handle.

bzt wrote: How could they be? One is a scalar, the other is a structure.

Ever heard of pointers?

RAX = a scalar
RAX = cluster number
RAX = pointer to a file structure.

The x86 assembler wouldn't mind what RAX refers to.

bzt wrote: Cluster numbers are just a simple index to an allocation table. They do not identify the file on the disk, they do not tell you if the pointed object is a directory or a file for example.

In that case you can put the start cluster number + directory entry in a memory object and pass a pointer to this memory object as the i-node.

bzt · Post by **bzt** » Fri Mar 19, 2021 6:30 pm

rdos wrote:So, can you explain how calling the fatfuse driver from multiple threads, which uses no synchronization primitives at all, would be safe and not potentially corrupt cluster chains or even the filesystem itself?

Simple. Because it uses the file abstraction, the kernel does this for it. Go on, give it a try! Use fuse to mount an image, and then create and modify files inside that mounted path from multiple concurrent threads, and see for yourself that there will be no corruption.

rdos wrote:And what's the difference between queuing commands and doing them one at a time & putting a huge semaphore on the fuse library

A lot. Because the queue isn't implemented in the fuse library, rather in the kernel, so it's totally transparent to the fuse driver.

rdos wrote:There surely would be no parallel operations in the filesystem with any of those solutions.

Do a websearch on queues, or see this for example. But it would work anyway, because this is how the scheme looks like:

Code: Select all

process1 \
process2 - [VFS inside kernel #1] - [fuse driver] - [VFS inside kernel #2] - [storage device]
process3 /

No matter how many processes are using the file system, because it's the "VFS inside kernel #1"'s job to take care of the concurrency, and "VFS inside kernel #2" makes sure of it that sectors are written to the storage device in a nicely queued and duplication-free manner. For example if both process1 and process2 creates a file in the root directory, then at the end there will be only one sector write on the storage, because "VFS inside kernel #2" can merge the writes in the queue before the queue is flushed to the actual hardware sectors (and if the queue is a priority queue separating data sector and meta data sector writes, that's when you get soft update).

rdos wrote:Perhaps more important, how can I be sure that the NTFS and ext2/3/4 fuse implementations can safely be called from multiple threads without potentially crashing the filesystem?

Because unlike in your direct sector accessing idea, the current fuse drivers don't have to care about concurrency, as it's the job of the kernel (where the file descriptors are handled). The fuse driver just implements file system hooks for the kernel, but it does not implement a VFS nor exclusive access block device mechanism at all. That's a job for the file abstraction, kernel decides when to call the fuse driver, for which opened files, and how to handle the underlying storage's access to avoid corruption (fuse driver should have exclusive access to the storage).

rdos wrote:Having strange "files" in the filesystems that are devices or whatever simply is a horrible concept.

You misunderstood. Everything is a file means everything can be accessed through a file descriptor and with same file API. It doesn't have to be any actual file in the file system associated with it. For example pipes only have file descriptor but no file name (unless they are a named FIFO), sockets doesn't have file name either (URLs are parsed from strings, there's no need for an actual "https://forum.osdev.org" file on the filesystem in order to read this site), and with devfs there are no device files on the file system either (rather generated dynamically in memory).

rdos wrote:I believe in the idea of defining proper APIs for device-types, and that users (programmers) should not be allowed to mix handles in arbitrary ways.

And then how would you implement something like gzip -d that's supposed to read from stdin, from files and from network sockets as well? You would have to write significant amount of code in each and every application, for each and every combination of input and output (stdin to stdout, stdin to file, stdin to network, file to stdout, file to file, file to network, network to file etc.). That's a huge overhead, and huge risk of having a bug in one of your applications.

rdos wrote:
bzt wrote: Cluster numbers are just a simple index to an allocation table. They do not identify the file on the disk, they do not tell you if the pointed object is a directory or a file for example.
In that case you can put the start cluster number + directory entry in a memory object and pass a pointer to this memory object as the i-node.

Which leads us back to what I've said, aka cluster numbers do not tell you if the pointed object is a directory or a file. BTW, you're mixing concepts.
* i-node number: is a scalar index, stored in the file directory entries (also defined globally in sys/types.h as "ino_t")
* i-node: is a structure containing all relevant information like type and allocation (everything except the file name), selected by the i-node number (defined differently for each file system, for example "struct ext2_inode_large")

So in your example, a FAT directory entry would be the i-node, and the pointer to that would be the i-node number.

Cheers,
bzt

rdos · Post by **rdos** » Sat Mar 20, 2021 3:32 am

bzt wrote:
rdos wrote:So, can you explain how calling the fatfuse driver from multiple threads, which uses no synchronization primitives at all, would be safe and not potentially corrupt cluster chains or even the filesystem itself?
Simple. Because it uses the file abstraction, the kernel does this for it. Go on, give it a try! Use fuse to mount an image, and then create and modify files inside that mounted path from multiple concurrent threads, and see for yourself that there will be no corruption.

Simply won't do. The VFS in the kernel doesn't know that if you create /somefile1/test.c and /somefile2/test.c at the same time, you might corrupt the cluster chain on FAT given that it is shared between every object on the filesystem. It can also be a file delete or extending the size of a file, which would also affect the cluster chain.

It's a bit like arguing that my memory allocator or linked list doesn't need synchronization since the kernel does some magic for me.

There is no magic that can solve this. A kernel cannot relieve users from using synchronization primitives if the code is supposed to be thread safe. You will need to do it yourself.

bzt wrote:
rdos wrote:And what's the difference between queuing commands and doing them one at a time & putting a huge semaphore on the fuse library
A lot. Because the queue isn't implemented in the fuse library, rather in the kernel, so it's totally transparent to the fuse driver.

Doesn't matter. The kernel cannot solve this any better than the fuse library. Queues and synchronization primitives behave the same way in user land as in kernel, and are used for the same reasons.

bzt wrote:
rdos wrote:There surely would be no parallel operations in the filesystem with any of those solutions.
Do a websearch on queues, or see this for example. But it would work anyway, because this is how the scheme looks like:
Code: Select all
process1 \
process2 - [VFS inside kernel #1] - [fuse driver] - [VFS inside kernel #2] - [storage device]
process3 /
No matter how many processes are using the file system, because it's the "VFS inside kernel #1"'s job to take care of the concurrency, and "VFS inside kernel #2" makes sure of it that sectors are written to the storage device in a nicely queued and duplication-free manner. For example if both process1 and process2 creates a file in the root directory, then at the end there will be only one sector write on the storage, because "VFS inside kernel #2" can merge the writes in the queue before the queue is flushed to the actual hardware sectors (and if the queue is a priority queue separating data sector and meta data sector writes, that's when you get soft update).

Which means that only one filesystem operation can run at a time since you use a queue to serialize things. This results in poor performance since the disc driver will typically work with one only request at a time, causing long delays caused by seek times & setups. For instance, a read operation will need to queue a request to the disc and then have to wait for it to complete before it can do finish the operation. If there also is a write operation, or another read operation close to the original, these will not result in parallell work for the disc driver since file operations are serialized.

Also, since fuse high level interface uses paths, the filesystem need to traverse the path, potentially creating many different disc read operations that cannot overlap with the parsing of other paths.

bzt wrote:
rdos wrote:Perhaps more important, how can I be sure that the NTFS and ext2/3/4 fuse implementations can safely be called from multiple threads without potentially crashing the filesystem?
Because unlike in your direct sector accessing idea, the current fuse drivers don't have to care about concurrency, as it's the job of the kernel (where the file descriptors are handled). The fuse driver just implements file system hooks for the kernel, but it does not implement a VFS nor exclusive access block device mechanism at all. That's a job for the file abstraction, kernel decides when to call the fuse driver, for which opened files, and how to handle the underlying storage's access to avoid corruption (fuse driver should have exclusive access to the storage).

My direct disc interface will be in kernel, but I don't see how it can fix concurrency issues in file systems. It can of course handle the synchronization of the disc operations so those are thread-safe, but it cannot fix concurrency issues with cluster chains in the file system simply because it doesn't know about that problem.

bzt wrote:
rdos wrote:I believe in the idea of defining proper APIs for device-types, and that users (programmers) should not be allowed to mix handles in arbitrary ways.
And then how would you implement something like gzip -d that's supposed to read from stdin, from files and from network sockets as well? You would have to write significant amount of code in each and every application, for each and every combination of input and output (stdin to stdout, stdin to file, stdin to network, file to stdout, file to file, file to network, network to file etc.). That's a huge overhead, and huge risk of having a bug in one of your applications.

I provide input & output redirection between applications, but that's just for legacy reasons. The native API is not structured in that way.

bzt · Post by **bzt** » Sat Mar 20, 2021 10:22 pm

rdos wrote:Simply won't do. The VFS in the kernel doesn't know that if you create /somefile1/test.c and /somefile2/test.c at the same time, you might corrupt the cluster chain on FAT given that it is shared between every object on the filesystem. It can also be a file delete or extending the size of a file, which would also affect the cluster chain.

But it doesn't have to. To ensure the file system is always consistent on disk, all the kernel needs to know is the ordering in which the sectors must be written.

rdos wrote:It's a bit like arguing that my memory allocator or linked list doesn't need synchronization since the kernel does some magic for me.
There is no magic that can solve this. A kernel cannot relieve users from using synchronization primitives if the code is supposed to be thread safe. You will need to do it yourself.

Nope, that's incorrect. If you try to solve it yourself, you're just asking for trouble. You see, the whole concept of fuse is that the VFS solves this by multiplexing user's requests to the fuse driver (so you don't need to write it thread-safe, just use the session object), and VFS also takes care of that the storage device can be driven by one fuse driver at a time (by using exclusive access), so there can be no corruption on the device.

rdos wrote:Doesn't matter. The kernel cannot solve this any better than the fuse library.

Most definitely it can. Users calling open()/read()/write() on mounted files use the kernel. It's the kernel's job to handle these, and provide a unique, concurrency-free interface towards the driver instead of many competing apps (this is what the fuse kernel module does).

rdos wrote:Which means that only one filesystem operation can run at a time since you use a queue to serialize things.

Nope, it only means the fuse driver must use the context stored in the session. If two open files do not share the same i-node number (yet another use case where file descriptor abstraction is handy

), like "get_open_count(session, fd) == 1" then no further checks nor locking is needed which speeds up things (one fuse driver instance operates on one file system device only). Writing the sectors with formatted file system data must be serialized by LBA, but those must be no matter what solution you choose. The nice thing about priority queues, that the fuse driver could assign different priorities to data, meta-data and allocation-data sector writes, and in return the kernel would take care of the soft update mechanism in a generalized way for free. The kernel doesn't need to know how a meta-data looks like, it is enough if it knows that the sectors must be written in a specific order to keep file system always consistent on the disk. Read more on "soft update". Same works for journaling too, log data sector have a higher priority than normal sectors, so the journal will be updated first, without the kernel even knowing that it's updating a journaled file system.

rdos wrote:This results in poor performance since the disc driver will typically work with one only request at a time

Not true! In fact using priority queues is the only way to implement fast I/O (because queue is serialized, no locking, therefore low latency, and because contiguous sections can be merged into a single command in the queue before sending to the controller means high-throughput). Read on the subject a bit.

rdos wrote:For instance, a read operation will need to queue a request to the disc and then have to wait for it to complete before it can do finish the operation.

This is no different to normal operation. The difference is, queue can make sure of it that seek is immediately followed by (possible multiple) read command(s) which minimizes latency. For example, if there are three tasks reading the same sector (the superblock or the root dir for example), then with locking 2 tasks must wait until the first seeks and reads, then 1 task has to wait until the second seeks and reads. With queues there would be only one seek and read, and all tasks could be served in parallel.

rdos wrote:Also, since fuse high level interface uses paths, the filesystem need to traverse the path

This is no different to any other implementation. It is always the file system driver's task to traverse the path (couldn't be otherwise, because nobody knows how a path is represented on the storage (i-node? direntry? headers?) except for the driver itself).

Cheers,
bzt

rdos · Post by **rdos** » Mon Mar 22, 2021 4:59 am

bzt wrote:
rdos wrote:Simply won't do. The VFS in the kernel doesn't know that if you create /somefile1/test.c and /somefile2/test.c at the same time, you might corrupt the cluster chain on FAT given that it is shared between every object on the filesystem. It can also be a file delete or extending the size of a file, which would also affect the cluster chain.
But it doesn't have to. To ensure the file system is always consistent on disk, all the kernel needs to know is the ordering in which the sectors must be written.

I don't think you understand the problem. The problem is not sector ordering, rather that the FAT file system must allocate clusters from the cluster table in the filesystem when it creates new files or directories, and when it changes size on files. It's a bit like you have a bitmap for free physical memory, and you do allocations from multiple threads, and if the allocator isn't lock-free or multi-thread safe, threads might get the same physical addresses which will cause nasty bugs. In a filesystem, it leads to corruption. The VFS or kernel cannot solve this since it doesn't know when new clusters are needed or when clusters are freed. The only solution if the FAT implementation isn't multithread safe is to give it one command at a time, or put a huge lock around it.

bzt wrote:Nope, that's incorrect. If you try to solve it yourself, you're just asking for trouble. You see, the whole concept of fuse is that the VFS solves this by multiplexing user's requests to the fuse driver (so you don't need to write it thread-safe, just use the session object), and VFS also takes care of that the storage device can be driven by one fuse driver at a time (by using exclusive access), so there can be no corruption on the device.

The thing is that I'm not a Linux user of fuse. I'm writing a complete VFS + fuse interface, and I don't want bad design from Linux to sabotage performance. I don't want my VFS to do advanced analysis of requests so fuse drivers don't need to be thread-safe. I want the VFS to be simple & efficient, and I require file systems to handle their own synchronization so they can be called from multiple threads.

bzt wrote:
rdos wrote:This results in poor performance since the disc driver will typically work with one only request at a time
Not true! In fact using priority queues is the only way to implement fast I/O (because queue is serialized, no locking, therefore low latency, and because contiguous sections can be merged into a single command in the queue before sending to the controller means high-throughput). Read on the subject a bit.

I don't like that idea at all. I don't think queues are the solution to good performance. I think multiple server threads are.

Besides, what the article talks about is disc buffering (OUTPUT), and not INPUT queues to fuse. In fact, with a serialized input queue concept, throughput will be poor since this results in a random disc access pattern. I assume that the "merging" you are talking about is in the disc controller that will request large IO blocks, however, this has nothing to do with priority queues either and actually requires parallel filesystem code to happen.

bzt wrote: With queues there would be only one seek and read, and all tasks could be served in parallel.

I don't understand. If I have one queue entry that wants to open /file1 and another that wants to write /file2, then they clearly cannot be combined in a queue. Those are two different fuse calls and cannot be merged to anything that allows them to operate in parallell.

bzt wrote: Same works for journaling too, log data sector have a higher priority than normal sectors, so the journal will be updated first, without the kernel even knowing that it's updating a journaled file system.

Now you are talking about prioritizing OUTPUT. As we have already concluded, output consists of writing through a filehandle, which has nothing to do with the INPUT queue to fuse. I also fail to see how file IO could be prioritized. As you pointed out before, this "everything is a file" concept means that you use read/write with a file handle, and AFAIK, there is no prioritizing support in that API.

bzt · Post by **bzt** » Mon Mar 22, 2021 6:33 pm

rdos wrote:I don't think you understand the problem. The problem is not sector ordering, rather that the FAT file system must allocate clusters from the cluster table in the filesystem when it creates new files or directories, and when it changes size on files. It's a bit like you have a bitmap for free physical memory, and you do allocations from multiple threads, and if the allocator isn't lock-free or multi-thread safe, threads might get the same physical addresses which will cause nasty bugs.

Yes I understand, take a look at my previous "figure". If you do as I said, then the FAT file system driver will get serialized commands from the kernel, so there is no issue with cluster allocation. If you want to solve that in a multi-threaded fuse driver, well, I'm afraid you'll have to pay to M$, because the proper solution to do that is through transactions, and transaction-safe FAT manipulation is patented by them. US patent no. 7174420

rdos wrote:The thing is that I'm not a Linux user of fuse. I'm writing a complete VFS + fuse interface, and I don't want bad design from Linux to sabotage performance.

Yeah, by implementing yourself I meant implementing it in each and every fuse driver instead of implementing in the kernel in a generalized way.

rdos wrote:I don't want my VFS to do advanced analysis of requests so fuse drivers don't need to be thread-safe.

But you should. Not only for fuse driver, but for any other file system drivers. Implementing it in one place in the VFS might sound a big task, but believe it's a lot more viable than implementing multi-threading safety in each and every file system driver, whataver API they're using.

rdos wrote:I want the VFS to be simple & efficient, and I require file systems to handle their own synchronization so they can be called from multiple threads.

If you do so, then you won't be able to handle file locking in a simple and bullet-proof way. You'll have implement a significant portion of VFS functionality in each and every file system driver, that increases the chance of a bug significantly. But sure, you could do that.

rdos wrote:I don't like that idea at all. I don't think queues are the solution to good performance.

This isn't an idea, it is an empirically proven fact. All the fast performance solutions are using priority queues. Starting from IBM Mainframe's IO channels to ZFS vdevs. They all keep the commands in memory, prioritize them, and only flush once in a while in a specific order.

rdos wrote:Besides, what the article talks about is disc buffering (OUTPUT), and not INPUT queues to fuse.

Take a look at my "figure" again! All calls are going through the VFS twice: once for the input, and once for the output. Obviously those require two different approaches to handle (because on input the access is not restricted, and on the output it is exclusive), regardless both are implemented in the VFS and not in the file system driver.

Code: Select all
process1 \
process2 - [VFS inside kernel #1] - [fuse driver] - [VFS inside kernel #2] - [storage device]
process3 /
No matter how many processes are using the file system, because it's the "VFS inside kernel #1"'s job to take care of the concurrency, and "VFS inside kernel #2" makes sure of it that sectors are written to the storage device in a nicely queued and duplication-free manner.

Cheers,
bzt

rdos · Post by **rdos** » Tue Mar 23, 2021 6:26 am

bzt wrote:Yes I understand, take a look at my previous "figure". If you do as I said, then the FAT file system driver will get serialized commands from the kernel, so there is no issue with cluster allocation. If you want to solve that in a multi-threaded fuse driver, well, I'm afraid you'll have to pay to M$, because the proper solution to do that is through transactions, and transaction-safe FAT manipulation is patented by them. US patent no. 7174420

Their patent only covers how to recover cluster chains after sudden power failures, and so doesn't relate to cluster allocation in multithreaded environments. They also have a patent regarding long file names in FAT. I don't think these patents can be used to sue hobby (or semi-hobby) OS developers.

bzt wrote:
rdos wrote:I don't want my VFS to do advanced analysis of requests so fuse drivers don't need to be thread-safe.
But you should. Not only for fuse driver, but for any other file system drivers. Implementing it in one place in the VFS might sound a big task, but believe it's a lot more viable than implementing multi-threading safety in each and every file system driver, whataver API they're using.

I think every function must solve its own multitasking issues. Once upon a time, Unix couldn't even handle kernel threads or multithreading in the kernel, something that sucks big time.

bzt wrote:
rdos wrote:I want the VFS to be simple & efficient, and I require file systems to handle their own synchronization so they can be called from multiple threads.
If you do so, then you won't be able to handle file locking in a simple and bullet-proof way. You'll have implement a significant portion of VFS functionality in each and every file system driver, that increases the chance of a bug significantly. But sure, you could do that.

I do plan to put most of the code in the VFS, but I don't plan to serialize commands for the fuse driver. Rather, I plan to have multiple server threads on the fuse side. Actually, this is an option in the fuse library that can use Posix threads to dynamically create server threads. However, file systems must support this.

And I don't want the fuse drivers to parse paths, rather I want to handle paths in generic code and so I will use the low-level interface that works with i-nodes instead. This makes the fuse driver simpler and locks on paths don't need to be handled by the fuse driver.

bzt wrote:
rdos wrote:I don't like that idea at all. I don't think queues are the solution to good performance.
This isn't an idea, it is an empirically proven fact. All the fast performance solutions are using priority queues. Starting from IBM Mainframe's IO channels to ZFS vdevs. They all keep the commands in memory, prioritize them, and only flush once in a while in a specific order.

The problem to find consecutive sectors for read/write operations is similar to allocating more than 4k of physical memory. Lists & queues are horrible at this task, and bitmaps are superior. Particularly when this needs to be combined with buffering a large number of sectors in RAM.

bzt · Post by **bzt** » Tue Mar 23, 2021 9:40 pm

rdos wrote:I don't think these patents can be used to sue hobby (or semi-hobby) OS developers.

Yes, they can be (unlikely, but could be). And it even doesn't matter if you're right, because the justice system is extremely disfunctional, and isn't about justice at all. The only thing what matters, who can pay more to the lawyers. That's a battle that you, as an individual can't win against an IT giant with huge profits. For a good money they will prove that your multithreaded transaction-safe implementation violates the transaction-safe patents, even if that's not true. It's enough if your code does something little bit similar what the patent is about. (Just think about the LFN issue in the Linux kernel, how difficult it was to come up with a solution that is compatible but different enough so that no lawyers can get a hold on. The original code did not violate the patent (as it did not use exactly the same algorithm described in the patent, just did something similar), yet nobody cared, it was enough that lawyers said it does.)

rdos wrote:I do plan to put most of the code in the VFS, but I don't plan to serialize commands for the fuse driver. Rather, I plan to have multiple server threads on the fuse side. Actually, this is an option in the fuse library that can use Posix threads to dynamically create server threads. However, file systems must support this.

Yes, you can do that as I've said, but then you won't be able to use the existing fuse drivers as-is unless you make big modifications on them, which kinda defeats the purpose of having fuse in the first place.

rdos wrote:And I don't want the fuse drivers to parse paths, rather I want to handle paths in generic code and so I will use the low-level interface that works with i-nodes instead. This makes the fuse driver simpler and locks on paths don't need to be handled by the fuse driver.

Now I don't think that's possible. Let's say, a fuse driver doesn't handle the paths at all. You open /dev/sda1 (or whatever) with that driver. How would you know what to show at the mount point, what filenames and i-nodes are there if the fuse driver doesn't parse the paths in the image? Who is going to tell your kernel what i-node needs to be assigned for a certain path in the first place? And what about codepages? What if the file system uses a different codepage than your kernel? Who is going to translate the path, if not the fuse driver?

rdos wrote:The problem to find consecutive sectors for read/write operations is similar to allocating more than 4k of physical memory. Lists & queues are horrible at this task, and bitmaps are superior. Particularly when this needs to be combined with buffering a large number of sectors in RAM.

That's totally different. Detecting consecutive sectors in a queue is nothing like allocation, it's more like implementing run-length encoding (which could be easily implemented as an adaptive algorithm, making it O(1)). And deduplication also makes huge impact on the overall performance (and depends on buffering sectors in RAM, can't be done without).

Cheers,
bzt

rdos · Post by **rdos** » Wed Mar 24, 2021 3:16 am

bzt wrote:
rdos wrote:I do plan to put most of the code in the VFS, but I don't plan to serialize commands for the fuse driver. Rather, I plan to have multiple server threads on the fuse side. Actually, this is an option in the fuse library that can use Posix threads to dynamically create server threads. However, file systems must support this.
Yes, you can do that as I've said, but then you won't be able to use the existing fuse drivers as-is unless you make big modifications on them, which kinda defeats the purpose of having fuse in the first place.

I think it might be possible to do both. For important drivers, like FAT, EXT and NTFS I will probably redo them to fit my interface. For others, I might still be able to support them with poorer performance through the "real/legacy" fuse interface.

bzt wrote:
rdos wrote:And I don't want the fuse drivers to parse paths, rather I want to handle paths in generic code and so I will use the low-level interface that works with i-nodes instead. This makes the fuse driver simpler and locks on paths don't need to be handled by the fuse driver.
Now I don't think that's possible. Let's say, a fuse driver doesn't handle the paths at all. You open /dev/sda1 (or whatever) with that driver. How would you know what to show at the mount point, what filenames and i-nodes are there if the fuse driver doesn't parse the paths in the image? Who is going to tell your kernel what i-node needs to be assigned for a certain path in the first place?

Maybe I was unprecise. I don't want fuse to parse paths. Let's say a file ../somepath/test/x.c is opened. The VFS will know that .. refers to the parent directory of the current directory. It will then ask fuse for "somepath", "test" and the file "x.c", in that sequence using the result from the previous parse as the new starting point. If it already has part of the path cached, it can skip those steps. For instance, if it already knows about ../somepath, it only needs to ask for "test" and "x.c". If another thread decides to delete part of the path, then the VFS can arbitrate this and avoid sending invalid requests to fuse.

bzt wrote: And what about codepages? What if the file system uses a different codepage than your kernel? Who is going to translate the path, if not the fuse driver?

I don't support code pages. I only support UTF-8. At least for LFN and FAT, there are rather simple translations between those. The fusefat actually assumes that paths are UTF-8, and I will do the same. I don't know how this is handled in NTFS and EXT, but I'm pretty sure that Microsoft uses Unicode in NTFS.

bzt wrote:
rdos wrote:The problem to find consecutive sectors for read/write operations is similar to allocating more than 4k of physical memory. Lists & queues are horrible at this task, and bitmaps are superior. Particularly when this needs to be combined with buffering a large number of sectors in RAM.
That's totally different. Detecting consecutive sectors in a queue is nothing like allocation, it's more like implementing run-length encoding (which could be easily implemented as an adaptive algorithm, making it O(1)). And deduplication also makes huge impact on the overall performance (and depends on buffering sectors in RAM, can't be done without).

That's certainly a possibility, but when you read & write large files, this will create a huge number of sector requests, which risks creating too complicated codings or lists. In fact, I do use lists in my current implementation, but I think that's why it scales poorly to large requests. Scanning a bitmap is independent of the number of active requests, and is particularly efficient if requests are in similar regions of the disc. Also, by using 32-bit reads I can check 32 4k disc blocks per step.

Unlike the current fuse interface which use "file handles" to read and write from a block device, I will have a lock sector(s) function, a modify sector(s) function and an unlock sector(s) function. The lock sector(s) will return physical addresses to buffers. For meta-data, the physical address will be mapped in the file system linear memory and then the file system driver can use the linear address just like results from read/write, except that the buffer will only cover a single sector.

For files, I plan to intercept the normal file-IO from an application and map file contents in application memory space. Thus, the application will not ask kernel to read x bytes from file position y, rather will ask the VFS for a 4k page that covers position y, and then will map it into the application's linear memory and update the list of available disc blocks so file-IO can be done completely in user-space as long as positions are within mapped ranges. For sequential access, the kernel might even decide to create a "read-ahead" thread that fetches future positions in advance before the application tries to access them. Another possibility is to simply queue these requests for the filesystem and when they are done let the file system update the application buffers in kernel space. This would be more like an event-driven system than the normal queue based system used for fuse.

Intercepting is much easier in my design given that ALL file-IO must go through syscalls that only handle file-IO. C handles are translated into file-IO in libc, and so even using that interface always results in using the file-API for file-IO.

bzt · Post by **bzt** » Wed Mar 24, 2021 4:50 am

rdos wrote:I think it might be possible to do both. For important drivers, like FAT, EXT and NTFS I will probably redo them to fit my interface. For others, I might still be able to support them with poorer performance through the "real/legacy" fuse interface.

Yes, that's possible. As I've said, it is a great idea to have a native interface and only use fuse on top of that for compatibility. No argue here.

rdos wrote:Maybe I was unprecise. I don't want fuse to parse paths. Let's say a file ../somepath/test/x.c is opened. The VFS will know that .. refers to the parent directory of the current directory. It will then ask fuse for "somepath", "test" and the file "x.c", in that sequence using the result from the previous parse as the new starting point. If it already has part of the path cached, it can skip those steps. For instance, if it already knows about ../somepath, it only needs to ask for "test" and "x.c". If another thread decides to delete part of the path, then the VFS can arbitrate this and avoid sending invalid requests to fuse.

Now this is a totally different thing, and I agree once the fuse driver parsed the path the result should be cached in the VFS. That's exactly what all the VFS implementations I know do, and that's what my VFS does too. But you'll still need the fuse driver to parse paths for the first time (before you could cache the result), you can't get away without.

rdos wrote:I don't support code pages. I only support UTF-8. At least for LFN and FAT, there are rather simple translations between those.

But not for exFAT. Microsoft in their infinite idiocracy thought it's fun to waste storage space for a UNICODE conversion table on each and every disk. Now if you ask why the hell is something that's the property of UNICODE's and not the file system's being stored on each and every file system instance, you're absolutely right and it shouldn't be. But it is, which means a certain fs instance might deviate from the standard UNICODE code points by using custom values in that table, therefore you must run every filename through that conversion table otherwise you won't get valid UNICODE code points to convert into UTF-8. Pure madness if you ask me. (And their reasoning that without you couldn't convert between uppercase and lowercase is entirely and totally false, because UNICODE do, and always did provide that conversion information, in a common, universal, and non fs filename specific way, see UnicodeData.txt the coloumn before the last and the last. Those store exactly that information.)

rdos wrote:Unlike the current fuse interface which use "file handles" to read and write from a block device, I will have a lock sector(s) function, a modify sector(s) function and an unlock sector(s) function.

Using file locks on an opened device file is just the same. But as I have said, you can do that, you only need to replace unix_io.c (or win32_io.c) in the drivers (this isn't a standard, just a common practice to place the device code in these files, so in most fuse drivers you'll find it there. For example, look for pread, pwrite, those work exactly like multiple sector reads and writes)

rdos wrote:For files, I plan to intercept the normal file-IO from an application

No need for that. Application will talk to your kernel, not to the fuse driver. Simple application will just call open()/read()/write()/close() (or their equivalent on your OS), without knowing to which mount point and file system (and hence which fuse driver) a certain path is mapped to. I think your problem (multithreads and all) originates from the fact that you think applications will talk to the fuse driver directly. No, they won't, they will use syscalls that your kernel must handle, and it is your kernel that will translate your native API into fuse API for the driver (and it is up to your kernel if you create a shared memory between the application and the fuse driver or if you copy the read/write buffers between the processes. Neither the application nor the fuse driver needs to know, they just need a pointer and the size of the buffer).

rdos wrote:Intercepting is much easier in my design given that ALL file-IO must go through syscalls that only handle file-IO. C handles are translated into file-IO in libc, and so even using that interface always results in using the file-API for file-IO.

Exactly. The big advantage of "everything is a file" is that you only need one API.

Cheers,
bzt

rdos · Post by **rdos** » Wed Mar 24, 2021 6:38 am

bzt wrote:Now this is a totally different thing, and I agree once the fuse driver parsed the path the result should be cached in the VFS. That's exactly what all the VFS implementations I know do, and that's what my VFS does too. But you'll still need the fuse driver to parse paths for the first time (before you could cache the result), you can't get away without.

Right, but you don't need to let it parse composite paths, only single elements. However, this requires that you can tell fuse what the path relates to, like some inode. Otherwise, fuse will think it relates to the filesystem root.

bzt wrote:
rdos wrote:I don't support code pages. I only support UTF-8. At least for LFN and FAT, there are rather simple translations between those.
But not for exFAT. Microsoft in their infinite idiocracy thought it's fun to waste storage space for a UNICODE conversion table on each and every disk. Now if you ask why the hell is something that's the property of UNICODE's and not the file system's being stored on each and every file system instance, you're absolutely right and it shouldn't be. But it is, which means a certain fs instance might deviate from the standard UNICODE code points by using custom values in that table, therefore you must run every filename through that conversion table otherwise you won't get valid UNICODE code points to convert into UTF-8. Pure madness if you ask me. (And their reasoning that without you couldn't convert between uppercase and lowercase is entirely and totally false, because UNICODE do, and always did provide that conversion information, in a common, universal, and non fs filename specific way, see UnicodeData.txt the coloumn before the last and the last. Those store exactly that information.)

It never stops amazing me how much junk M$ is behind.

Still, is there any reason to support exFat at all? It won't be compatible with a large amount of devices like FAT is, and if I want a file system that can handle 64-bit file sizes, using NTFS or Ext4 seems like a much better idea.

bzt wrote:
rdos wrote:For files, I plan to intercept the normal file-IO from an application
No need for that. Application will talk to your kernel, not to the fuse driver. Simple application will just call open()/read()/write()/close() (or their equivalent on your OS), without knowing to which mount point and file system (and hence which fuse driver) a certain path is mapped to. I think your problem (multithreads and all) originates from the fact that you think applications will talk to the fuse driver directly. No, they won't, they will use syscalls that your kernel must handle, and it is your kernel that will translate your native API into fuse API for the driver (and it is up to your kernel if you create a shared memory between the application and the fuse driver or if you copy the read/write buffers between the processes. Neither the application nor the fuse driver needs to know, they just need a pointer and the size of the buffer).

There are alignment issues when using physical addresses rather than arbitrary buffers. For instance, sector 0 on a disc with 512 bytes per sector will be at the start of a 4k page, the next sector will be at offset 0x200, and so on. I will have to enforce this in the disc buffers, and I will always read at least 4k at a time regardless of sector size. When I map these addresses in linear address space, the sectors will start at the same offsets in linear address space. Now, for a file, it might be stored on consecutive sectors, in which case you could give the application a 4k buffer covering more than one sector in the file. This depends on alignment in the filesystem and of cluster sizes (in FAT). Anyway, if the file is stored on consecutive clusters on FAT, then every 4k buffer except possibly the first one will contain 8 sectors. If it's not continuous, some 4k buffers might contain fewer than 8 sectors. This creates a bit of complexity at the user side, but I think this is offset by the possibility to do file IO without syscalls (as long as the relevant buffers are mapped).

This is why the preferred API is to ask for a 4k buffer that has a specific file offset within it. The kernel will then return a mapped linear address, the starting position in the file, and the number of bytes it covers. When this point is read past another buffer at the returned start position + number of bytes is requested. This is very different from passing a user level buffer with any alignment to kernel and ask for y bytes of the file. The kernel will simply need to copy contents from the file to this buffer.

So, the user will not talk directly to fuse, and it cannot even given that they reside in different address spaces. All the communication will need to go through shared kernel memory.

In order to achive a no-copy interface, the same rules need to be applied to the disc interface. If I ask for sector 0, I will get a physical buffer where sector 0 is at offset 0, sector 1 is at offset 0x200 (assuming 512 byte sectors), and so on. If I ask for sector 1 only, I will still get a physical address where sector 1 starts at offset 0x200. I cannot pass the disc handler an arbitrary buffer I want the data to be loaded at. When I map this address in the file system, offsets within mapped pages need to be preserved.

Octocontrabass · Post by **Octocontrabass** » Wed Mar 24, 2021 1:29 pm

bzt wrote:(And their reasoning that without you couldn't convert between uppercase and lowercase is entirely and totally false, because UNICODE do, and always did provide that conversion information, in a common, universal, and non fs filename specific way, see UnicodeData.txt the coloumn before the last and the last. Those store exactly that information.)

While I agree that the exFAT up-case table is ridiculous, the reasoning is not: you can't use the Unicode data because the correct conversion depends on the user's language.

rdos wrote:Still, is there any reason to support exFat at all?

It's mandatory for SDXC, and cards may misbehave if you don't format them according to the SD standard.

bzt · Post by **bzt** » Thu Mar 25, 2021 2:18 am

Octocontrabass wrote:While I agree that the exFAT up-case table is ridiculous, the reasoning is not: you can't use the Unicode data because the correct conversion depends on the user's language.

No, their reasoning is invalid, because when you're checking file names in a case-insensitive way, you must convert all i variants (İ i I ı) to the same uppercase dotless I. And when the file name is displayed on a GUI, then only the user's language preference matters, which is independent to the file system.

Actually the current solution introduces a pretty nasty bug. Let's assume you have an application that wants to open a filename with an İ in it from an exFAT formatted image. If you compile that and create the image on a machine with user's language set to Turkish, it will work, but in every other case the app will return "File not found". Good luck debugging that!

Octocontrabass wrote:
rdos wrote:Still, is there any reason to support exFat at all?
It's mandatory for SDXC, and cards may misbehave if you don't format them according to the SD standard.

WARNING: fact check: FAKE
Oh, not this bullshit again, please. SDXC cards are perfectly capable of storing other filesystems, nobody ever reported a problem, never. Plus even the SD standard defines a register to tell that the card isn't using exFAT, which register must be supported by all compliant drivers. So no, not mandatory for several reasons, period.

No, the correct answer is, there are many devices (mp3 players, cameras, video recorders etc.) which can only use exFAT because their firmware only understands that file system alone and nothing else, so if you ever want to save your pictures from your camera under your OS, then you must support exFAT. If you don't have such a device (because you use a smartphone), then you'll probably never need exFAT.

Cheers,
bzt

rdos · Post by **rdos** » Thu Mar 25, 2021 2:40 am

Octocontrabass wrote:
bzt wrote:(And their reasoning that without you couldn't convert between uppercase and lowercase is entirely and totally false, because UNICODE do, and always did provide that conversion information, in a common, universal, and non fs filename specific way, see UnicodeData.txt the coloumn before the last and the last. Those store exactly that information.)
While I agree that the exFAT up-case table is ridiculous, the reasoning is not: you can't use the Unicode data because the correct conversion depends on the user's language.

I find upper-case filenames esthetically questionable, and so for FAT I create upper-case 8.3 names in the filesystem, but then lower case them when I present them to the user. I've not quite decided if I will go the Unix way and make case important in filename comparisons or the DOS/Windows way and convert to uppercase for comparisons. exFAT has given me more arguments for doing it the Unix way.

Octocontrabass wrote:
rdos wrote:Still, is there any reason to support exFat at all?
It's mandatory for SDXC, and cards may misbehave if you don't format them according to the SD standard.

I've formatted many SD cards both with MBR and ordinary FAT and GPT and ordinary FAT. It works just fine. The can be formatted with Ext4 too.

OSDev.org

Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS

Re: Using FUSE as an VFS