Compressing VENDORS.TXT
Posted: Sun May 20, 2012 4:34 pm
Hi, I had a look at VENDORS.TXT (PCI Vendor and Device Lists) from http://www.pcidatabase.com/ and it's almost at 300kb with the tab separated copy, the CVS is even bigger IIRC.
So, since my complete OS is about 20kb, this seems ridiculous. So I thought it must be possible to use some easy, yet effective compression on it. I like the "challenge" and also I would hope to keep my OS contained to 1.44mb for as long as possible.
My first thought was to eliminate all the douplicate words, like "ATI", "RADEON", "PCI" and so on, which are present almost all over the place. Hoping that maybe a lookup table for at least the most common phrases could be used instead of the full string every time.
Feeding the text to http://textalyser.net gave me some nice stats about word count, ATI for example is found 518 times, RADEON 427 times, PCI 144 times, SECONDARY 124 times and so on. But I'm afraid that this is sinking rapidly and would not be as useful as I first had hoped since many words, probably more than 50% is only used once.
So scrap that idea. But if I keep the vendor and device ID's as hex but use some kind of special encoding for the string. After all it's mostly A-Z and 0-9 characters, not much extra. Any ideas on how to encode the strings in an easy to extract way and yet get as much compression as possible? i was thinking of still having look-up for the most common words, maybe all of them if that makes extracting easier combined with another character encoding to get the best from both. If I gain from it.
Reasons against using common compression formats like simple .zip for example would be that I want as simple extraction as possible in pure assembly and also because I think that there could be something more targeted towards the file format and strings/characters used in this very specific case which would give better outcome anyway.. Well, perhaps not better. At least close to as good as for example ZIP.
Any ideas are welcome
So, since my complete OS is about 20kb, this seems ridiculous. So I thought it must be possible to use some easy, yet effective compression on it. I like the "challenge" and also I would hope to keep my OS contained to 1.44mb for as long as possible.
My first thought was to eliminate all the douplicate words, like "ATI", "RADEON", "PCI" and so on, which are present almost all over the place. Hoping that maybe a lookup table for at least the most common phrases could be used instead of the full string every time.
Feeding the text to http://textalyser.net gave me some nice stats about word count, ATI for example is found 518 times, RADEON 427 times, PCI 144 times, SECONDARY 124 times and so on. But I'm afraid that this is sinking rapidly and would not be as useful as I first had hoped since many words, probably more than 50% is only used once.
So scrap that idea. But if I keep the vendor and device ID's as hex but use some kind of special encoding for the string. After all it's mostly A-Z and 0-9 characters, not much extra. Any ideas on how to encode the strings in an easy to extract way and yet get as much compression as possible? i was thinking of still having look-up for the most common words, maybe all of them if that makes extracting easier combined with another character encoding to get the best from both. If I gain from it.
Reasons against using common compression formats like simple .zip for example would be that I want as simple extraction as possible in pure assembly and also because I think that there could be something more targeted towards the file format and strings/characters used in this very specific case which would give better outcome anyway.. Well, perhaps not better. At least close to as good as for example ZIP.
Any ideas are welcome