Page 1 of 1

Compressing VENDORS.TXT

Posted: Sun May 20, 2012 4:34 pm
by bubach
Hi, I had a look at VENDORS.TXT (PCI Vendor and Device Lists) from http://www.pcidatabase.com/ and it's almost at 300kb with the tab separated copy, the CVS is even bigger IIRC.

So, since my complete OS is about 20kb, this seems ridiculous. So I thought it must be possible to use some easy, yet effective compression on it. I like the "challenge" and also I would hope to keep my OS contained to 1.44mb for as long as possible.

My first thought was to eliminate all the douplicate words, like "ATI", "RADEON", "PCI" and so on, which are present almost all over the place. Hoping that maybe a lookup table for at least the most common phrases could be used instead of the full string every time.

Feeding the text to http://textalyser.net gave me some nice stats about word count, ATI for example is found 518 times, RADEON 427 times, PCI 144 times, SECONDARY 124 times and so on. But I'm afraid that this is sinking rapidly and would not be as useful as I first had hoped since many words, probably more than 50% is only used once.

So scrap that idea. But if I keep the vendor and device ID's as hex but use some kind of special encoding for the string. After all it's mostly A-Z and 0-9 characters, not much extra. Any ideas on how to encode the strings in an easy to extract way and yet get as much compression as possible? i was thinking of still having look-up for the most common words, maybe all of them if that makes extracting easier combined with another character encoding to get the best from both. If I gain from it.

Reasons against using common compression formats like simple .zip for example would be that I want as simple extraction as possible in pure assembly and also because I think that there could be something more targeted towards the file format and strings/characters used in this very specific case which would give better outcome anyway.. Well, perhaps not better. At least close to as good as for example ZIP.

Any ideas are welcome :)

Re: Compressing VENDORS.TXT

Posted: Sun May 20, 2012 5:08 pm
by gravaera
Yo:

You needn't have the kernel or any of the drivers keep a list of mappings between device IDs and their names -- the kernel should only care about being to load a driver which matches the device ID.

In other words, your kernel sees that there is a device with VENDOR:DEVICE ID foo. It simply uses its own driver API and linkage standard to search the drivers in the initrd or disk for a driver which advertises itself as providing functions for controlling the device with ID foo. If a user ever wants to know what device specifically pertains to a certain VENDOR:ID combination (such as the case where a user is about to search for a driver for an "unknown device" on his system), they would have been using a userspace tool to look up the detected devices in the kernel device tree anyway;

The userspace tool is an excellent place to put the "lookup device name by ID" code, so it can display that information along with each PCI device in the tree. So what I'm saying is that there is no need to worry about adding the PCI vendor list to your kernel image since the kernel really doesn't need to care about the names of devices. Naturally your drivers would have a string identifying themselves which they would pass to the kernel, but that's at least only for drivers which have been loaded, as opposed to including a huge list of strings for devices that your kernel will probably not even support for a long time, much less need to know about them.

--Peace out
gravaera

Re: Compressing VENDORS.TXT

Posted: Sun May 20, 2012 5:24 pm
by bubach
I want to include this as default not because I have to, but because i feel like it. So it will probably be a freestanding executable or something.

Re: Compressing VENDORS.TXT

Posted: Sun May 20, 2012 5:44 pm
by Owen
Simple LZ77 + Huffman Coding?

(Also known as Deflate, or zlib. If you were feeling adventurous you could increase efficiency slightly by using LZ77 + Arithmetic coding)

Re: Compressing VENDORS.TXT

Posted: Mon May 21, 2012 12:53 am
by bluemoon
I suggest to feed vendor.txt into a filter script and retain only supported driver entry. That way you should have a relatively small file for distribution.

Re: Compressing VENDORS.TXT

Posted: Mon May 21, 2012 3:47 am
by bubach
Since I haven't got any good suggestions on home-brew/specialized compressions I might just use something like they do in PCIDev:
http://www.coolthemes.narod.ru/pcidev.html

Just show device type based on PCI class and subclass, it's slightly better than "unkown" on everything and yet not as space demanding. Could probably optimize the vendor name much better than they have in that program too. ;)