Page 1 of 1
Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 3:52 am
by CWood
I had an idea, ages ago, which I posted on my blog, about a distributed file hosting system. The post is still up there, however the idea is incredibly unpolished. I found myself thinking about it this morning in the shower, however, and wanted to get you guys' opinions.
So, the idea is this: a file hosting system, a la DropBox et al, but instead of a central hosting company storing the files, the people who use the service host the files. For example, if you purchase a 2TB drive, for instance, and sign up to the service, you can then put the 2TB drive on there, and get 2TB of hosting space in return, somewhere else in the world. Now, this sounds pretty much like a waste of time, but hear me out. It can be used to access your files anywhere with the service running (which if it becomes popular enough, could be anywhere...), by typing in your ID#/username (not decided which to go for yet). Furthermore, redundant backups will be made, with CRCs generated. Chances are, I'd go for something similar to a RAID array, so that it doesn't impact on disk space too much.
Obviously it would have to be encrypted, for security reasons, and I would likely do it with a key file, similar to SSH. Furthermore, the CRC hashes would be taken of both the encrypted, and unencrypted files. So, when the user decides to make a new file, the system would generate a CRC of the file, encrypt it, using the key file, download a list of (for example) 20 hosts, with enough space to store the file, send it to two of them, with the CRC, who then generate the same CRC on the encrypted file, and if the two match, store it. If they don't request the file is resent. Then, the user can view all of his/her files stored on the network, via a mountpoint (likely, I will write a driver that can match this in the fstab; Windows might come later, but I'm a Linux developer when not OSDevving). When a file is requested, one of the machines storing the file is selected, at random, and requested to send the encrypted file back. This will then check the users credentials, to make sure that the file really belongs to them (custom file system is probably in order, likely a modification of EXT4 or BtrFS), regenerates a CRC hash, and check that it matches the one on disk. If it does, send the hash, and the encrypted file to the users machine, which then generates another hash. If it doesn't match, tell the user, and use the other machine. If the user generated hash doesn't match the one sent down the wire, request a retransmit. The users machine will then decrypt the file, with the key file, and check the unencrypted hash as well.
I know it sounds complicated (at least, to me it does), but I think this could be quite useful for storage management, and backups. By donating a drive/partition to the system, and getting the equivalent storage space in return, automatically backed up (as I said, I don't know if it will be a full backup, or parity files/RAID system-esque), for better security, and better disaster-proofing.
Obviously, a list of hosts will have to be generated, so I intend to tackle this problem like this: when the user downloads the source/binaries (I intend to open source the whole thing, as closed source defeats the whole idea), a list of servers will also be downloaded. These servers maintain a list of machine id->ip address mappings, which are consulted. When a user stores a file, an entry is generated on the server, tracking which machine stores the file, which then propagates to all other servers. When a machine changes IP address, it contacts the server, which updates this. The user also gets the choice of becoming a hosts server themselves, which would also grant them some storage. Obviously, as the network grows, these servers would become more distributed; each server would store a subsection of hosts, and a subsection of files, and the system itself would ensure that each entry is stored at least 2-3 times for redundancy.
There would also be precautions against physical machines being too close to each other, both physically and virtually, by disallowing a file to be stored in any machine less than 10/8 ip addresses away. For example, if my IP is 55.55.55.55 (hint: it isn't), I could store on 45.55.55.55 and below, and 65.55.55.55 and above, but not in between. This is a naive way of doing it, but it largely protects against two hosts being on the same ISP, which means they are separated virtually, and probably (though not definitely) separated physically as well.
Re: Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 5:40 am
by Combuster
The core problem I see in this is that the net space allows for exactly one remote backup. Of course this gets better as the majority won't be using all their hosting space, but the number of backups won't be that large. Also with most people running desktops or smart devices these days, their time online will be relatively small, and even if you have several copies out there, the statistical chance a file is unavailable is a function of the amount of copies and average time online: for instance if the average user is online for 5 hours a day, you need 11 copies to have a 90% chance of being able to get to your file.
Re: Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 6:32 am
by dozniak
That's quite naive indeed, there are much better ways to do it (DHT with redundancy comes to mind).
EDIT: And, of course, the idea isn't really new.
Re: Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 7:43 am
by CWood
Hi guys, thanks for the fast feedback.
Combuster:
Indeed, this is an issue; to combat the lack of backups problem, compression should be built in to the system, transparently. The user should never have to know the details of this, however looking at recent statistics, the XZ algorithm looks to be the most promising. Whether or not this will be before or after encryption, I don't know, however I don't think it matters (I could be wrong). Consequentially, more copies of the file can be stored, and the end user will be nonthewiser. If anything, this will make the whole system faster, due to bandwidth considerations.
As for the issue of potentially not having a copy online now, that is also a very large issue, however your model assumes a random distribution of times, and indeed this is generally not how computer usage patterns work. There are several ways of combating this problem; firstly, it is not too difficult to build up a model of each machine's uptime profile. For example, my main machine usually gets turned on at 9:00, and stays on until around 21:30, GMT. My server stays on 24/7, aside from the occasional reboot, due usually to upgrades. Even a simple worst-case system will work fairly well in this situation, and a Bayesian system would work to within (I would anticipate, I'm no statistician) at least 2 standard deviations. Indeed, if it is precise enough, it could even go so far as to detect dates when the system will be down, each year, for things such as family holidays, etc.
Furthermore, time zones must be factored in as well. It is more likely that two users in the same time zone will be online at the same time, as it is two people in different time zones. And indeed, if we assume a fairly even mapping of uptimes, say 5 hours, it is safe to assume that these 5 hours will probably be within the times I quoted above, for my uptime profile, give or take. As a result, confining the uploads to, say, a maximum of 1 or 2 time zones away, the probabilities you quoted will be greatly improved.
dozniak:
I will admit, I've never heard of a DHT. I've done some preliminary reading now that you've brought it to my attention, and it does indeed look promising. I shall continue to research.
As for this idea not being a new one, I already assumed this; however, I haven't seen it implemented anywhere. This may or may not be due to this being a bad idea, however even if it is a bad idea, and it won't work, implementing it will still be a learning experience, and indeed should be fun.
Thanks,
Connor
Re: Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 8:28 am
by dozniak
CWood wrote:implementing it will still be a learning experience, and indeed should be fun.
Sure. I'm working on such implementation too.
The biggest problem so far is initial bootstrapping of a new node, which knows yet nothing of other available nodes. But my system is completely decentralized, so there's no initial server to logon to.
Re: Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 8:36 am
by CWood
Great minds think alike, what more can I say? Any advice for when I start the project (it might be a while yet, Real Life is still in the way of my projects right now)
Re: Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 12:46 pm
by Brendan
Hi,
The first problem will be bandwidth. For most home users the download speed of their internet connection is nowhere near as good as hard disk bandwidth, and the upload bandwidth is typically half the download speed. This means that uploading data to other computers whenever a file is stored or modified is going to be a massive bottleneck.
In addition to this a lot of users have data caps on their internet connection (e.g. "1 GiB per month" where bandwidth is severely crippled when you exceed the allowance, and some people even have a flat fee per MiB/GIB where you have to pay more if you run out). For these people, your service would translate into much higher internet connection costs.
The next problem is space - it doesn't add up. First, you'd want to use some of a user's disk as a local cache of that user's data to reduce the bandwidth problems (to avoid downloading when the data is accessed from the same computer it was uploaded, and to buffer data while it's being uploaded). Of course Combuster is correct - for each byte cached locally you're going to want about 20 copies of it elsewhere to improve the chance of availability. This means that for a 2100 MiB drive you'd have 100 MiB of data stored locally plus 20 copies of that 100 MiB elsewhere (on the internet), and 2000 MiB of the 2100 MiB drive would be used for storing other people's data (in return for them storing your 20 copies).
Also note that compression only really helps for files that aren't already compressed in some way. Most of the user's files will be backups (already compressed), application installers (already compressed), tarballs of open source projects (already compressed), JPEG (already compressed), MPEG videos (already compressed), MPEG sound files (already compressed), etc. Compressing the data a second time might save a tiny bit of space, but won't halve the amount of data (and could just make files larger).
The third problem is the perception of security. This has nothing to do with security - it's about making users believe that the system is secure (regardless of whether it is or not). Most people are sceptical, and it won't be like (e.g.) BitTorrent where people are downloading other people's data rather than storing their own data.
If you combine the bandwidth problems, the space problems and the (perceived, not real) security problems; I don't think it'll sound good to most users - lots of internet bandwidth and most of their disk space gone, in return for remote access and backup for less data than will fit on a USB flash stick (that happens to be cheaper than a 2 TiB drive, will hold more of the user's data, will be faster than their internet, and is still very portable).
Now don't get me wrong here - I'm not saying it's necessarily a bad idea. What I'm saying is that you'd need to find ways of solving those problems, or find some niche market that would want it despite those problems.
For one example, maybe you can do "data matching". If someone uploads a file, you could check if that same file is already stored in the distributed file system by a different user and do the equivalent of a symbolic link. If 100 people all upload the same Rick Astley video then you might only need to store 20 copies rather than 100*20 copies (and maybe you can store nothing because you know that data is in 100 user's local caches anyway). You might even be able to do this with pieces of files (rather than just whole files) - for example, if 20 people store slightly different copies of the same picture (e.g. the same picture of a grumpy cat, just with different words added to the bottom of the picture) maybe the first half of the file is the same for all users and only the last half of the file needs to be stored for each user.
Cheers,
Brendan
Re: Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 1:28 pm
by bluemoon
Beside the technical problems mentioned above, you can also think about the use cases. For example, 1) a frequent traveller may use such system to retrieve files on the storage system (it may or may not involve a centralized server, or user-node may share workload off from main server); or 2) it can serve for file sharing among friends; or 3) it is for publishing files to public, etc.
Note that for all the above use cases there are already commercial or free solutions.
Re: Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 3:27 pm
by CWood
Brendon:
Bandwidth is indeed an issue, however bear in mind that this will be running chiefly on a Linux system (until, if ever, I port it, or someone else does). Consequentially, it won't know the difference between /dev/sdc3 and /dev/sda, despite /dev/sda obviously being an entire disk, not just one partition. It could very easily be set up to use the disk, or just (say) /home/User/Documents, for example, which could be created as a partition, say 4GB large. The project will factor this in. My primary objective here is to create something similar to existing file storage solutions, but avoiding the caveats such as "if DropBox goes bust, I lose my data", which we saw with MegaUpload a few years back.
As for the space issue, I have yet to toy with that, and I anticipate that it would probably take much longer than I have at this stage. You are right that most data is already compressed, and further compression would be next to pointless, however I would anticipate that most people would see no need in (for example) making redundant backups of their MP3 collection; there are 2 types of people who have MP3 collections, one who already have the CDs, in which case they could merrily rip the files again, and those who pirate, in which case they could quite happily download again. Photos are a potential caveat, and videos too. As I said, I'll have to experiment.
As for the perceived security issues, this is largely a matter of PR, and currently, I don't even know if it will even be completed, it is so far off into the future. I anticipate at least 6 more months until I can work on this.
Your idea about reducing duplicates on the system is incredibly interesting, however, and I must confess, I love it. Many a storage requirement could be reduced this way; perhaps, assuming I decide to go the way of modifying EXT4, instead of specifying user:group ownership, specify a linked list, similar to the inode format, specifying user:permissions for all users. If a user modifies the file, store either a diff, or split the file; store the common data as above, and have the differing data as separate files, that the daemon then stitches back together seamlessly at runtime.
bluemoon:
Largely, the idea behind this project is the fact that backups should be distributed, across geographical, political, and company borders. This can already be done, but in the case of the latter, is very time consuming and difficult. The smart people would script it, but this is still a PITA. These are the problems I intend to solve. Bitcoin is already distributed, and does not rely on a central entity, at least from what I understand of it, so it is possible to do this kind of thing.
Cheers,
Connor
Re: Idea for a distributed file hosting system
Posted: Wed Mar 20, 2013 4:00 pm
by Mikemk
I have a few things to say here.
Firstly, giving a 2 TB hard drive creates 2 TB space for that user. Why would anybody want to trade a hard drive for 2 GB cloud storage of the same size as the hard drive?
Secondly, Brendan's data matching idea is worth a try if you do do it.
Re: Idea for a distributed file hosting system
Posted: Thu Mar 21, 2013 2:53 am
by dozniak
Brendan pretty much nailed the issues.
The issue of trust: right now, the file is only distributed across a range of devices you manually allow to access your data. This doesn't solve the problem per se, but just makes it easier to tackle for the initial implementation. The data and metadata could be encrypted with asymmetric schemes (private keys), but that doesn't give full security.
The issue of overhead: using automatic deduplication on a block level (if people share the same file using the same block size, chances are all the blocks will match up, and hence need to be stored only once. If there are minor modifications, then only some blocks would mismatch while other are perfectly in sync, and this means much less storage overhead).
Redundancy: This also gives possibility to spread out the file blocks to other nodes more evenly, with an encoding scheme allowing error correction file may be reconstructed even if some of its blocks are lost completely.
Plausible deniability: if your file is not stored in a single place as a single blob, it becomes much harder to prove you have it.
File metadata (name, attributes, custom labels) is also stored in a block, usually much smaller in size, which can be unencrypted to allow indexing, but could also be encrypted if you do not want to expose this metadata. In my design metadata is a key-value store with a lot of different attributes ranging from UNIX_PATH=/bin/sh to DESCRIPTION[en]="Bourne Shell executable" to UNIX_PERMISSIONS=u=rwx,g=rx,o=rx and so on. This format is not fixed, although it follows a certain schema/onthology. It allows "intelligent agents" or bots to crawl this data and enrich it with suggestions, links, e.g. a bot crawling an mp3 collection and suggesting proper tags - it could also find higher quality versions of the file, for example.
All this revolves around the ideas of DHTs, darknets, netsukuku and zeroconf. Still early on in the implementation to uncover all the details - they might change.
Re: Idea for a distributed file hosting system
Posted: Thu Mar 21, 2013 3:02 am
by dozniak
m12 wrote:Firstly, giving a 2 TB hard drive creates 2 TB space for that user. Why would anybody want to trade a hard drive for 2 GB cloud storage of the same size as the hard drive?
Because if that 2TB hard drive goes down, it will most probably take all the data down with it, not counting the extremely costly recovery procedures.
While with data stored across multiple nodes with redundancy you just throw this 2TB harddrive to trash and connect a new shiny 4TB harddrive - see how it populates with your data back? Magic.
Re: Idea for a distributed file hosting system
Posted: Thu Mar 21, 2013 3:05 am
by dozniak
Oh, CWood, if your primary purpose is backups you might want to check out existing "
p2p backup solutions", there are some interesting user reviews of those - it may be insightful to get data from the field.
Re: Idea for a distributed file hosting system
Posted: Thu Mar 21, 2013 5:58 am
by CWood
I will most certainly check out that website; any ideas I can incorporate, I consider a success. (Though, only features users will use; I may opt for a modular/plugin style system, to avoid unnecessary cruft a lot of software has these days).
As for file metadata, I think the name/file type should have the possibility to be encrypted as well; user discretion. For example, if the government use it, then obvious file names will make files high profile targets, particularly for botnets, which becomes a huge issue. Of course, not all users will want or need encrypted file names; for instance, my /home/venos/projects/xero/source/kernel/main.s is not particularly interesting to anyone but me, and since it will probably be open souced anyway, assuming I get that far, encrypting the file name is just silly.
Grades of protection would be interesting to pursue as well. For instance, using the government example above, they'll need very good encryption, and chances are won't mind waiting for it. However if you want to store your family photos on there, sure you don't want anyone looking at them, but the people who you don't want looking at them will not go to too much trouble to get them, so a lower grade of encryption can be used, which would be faster.
My thoughts are to compress after encryption, as this may (or may not) dodge the issue raised earlier about pointless double compression. I'm not an expert on this, this will be tested later, with some test code. I don't want to start a flame war, and I stress the fact
. Hence, I'm wary of posting this.
It would be interesting to pursue this as a file sharing mechanism as well (note: for legal purposes; anyone who abuses this is stupid, and you're the reason we've got torrents restricted at the ISP). This would reduce traffic overhead, for the people sharing, and would be no different for the people downloading, only a faster download. No more leaving the machine on all night.
This complicates storage, however. If 3 users put 2GB each on the system, and they all share a 1GB file between each other, who gets the overhead? Do we count the 1GB 3 times, which seems like something of a waste of space (though could be a way of recovering the lost space due to redundancy). Do we count each of them, 1/3 GB each? In which case, what if a 4th user comes along? This also makes it easy to see how many people have a file, which could be a privacy issue... Or, do we only charge the original uploader for space, which isn't fair on them, because the other two get the data for free.
Still lots of issues to iron out, but the way the system is evolving, I'm definitely liking.
Cheers,
Connor
Re: Idea for a distributed file hosting system
Posted: Thu Mar 21, 2013 7:14 am
by bluemoon
As for any resource pool system (including money pool), it's practical for the service provider to initiate a significant percentage of pool as start-up, and progressive shift the load to user. This way you can assure the service quality when it is launch, while reduce/limit cost as user-base grows.