Compressing VENDORS.TXT

All off topic discussions go here. Everything from the funny thing your cat did to your favorite tv shows. Non-programming computer questions are ok too.
Post Reply
User avatar
bubach
Member
Member
Posts: 1223
Joined: Sat Oct 23, 2004 11:00 pm
Location: Sweden
Contact:

Compressing VENDORS.TXT

Post by bubach »

Hi, I had a look at VENDORS.TXT (PCI Vendor and Device Lists) from http://www.pcidatabase.com/ and it's almost at 300kb with the tab separated copy, the CVS is even bigger IIRC.

So, since my complete OS is about 20kb, this seems ridiculous. So I thought it must be possible to use some easy, yet effective compression on it. I like the "challenge" and also I would hope to keep my OS contained to 1.44mb for as long as possible.

My first thought was to eliminate all the douplicate words, like "ATI", "RADEON", "PCI" and so on, which are present almost all over the place. Hoping that maybe a lookup table for at least the most common phrases could be used instead of the full string every time.

Feeding the text to http://textalyser.net gave me some nice stats about word count, ATI for example is found 518 times, RADEON 427 times, PCI 144 times, SECONDARY 124 times and so on. But I'm afraid that this is sinking rapidly and would not be as useful as I first had hoped since many words, probably more than 50% is only used once.

So scrap that idea. But if I keep the vendor and device ID's as hex but use some kind of special encoding for the string. After all it's mostly A-Z and 0-9 characters, not much extra. Any ideas on how to encode the strings in an easy to extract way and yet get as much compression as possible? i was thinking of still having look-up for the most common words, maybe all of them if that makes extracting easier combined with another character encoding to get the best from both. If I gain from it.

Reasons against using common compression formats like simple .zip for example would be that I want as simple extraction as possible in pure assembly and also because I think that there could be something more targeted towards the file format and strings/characters used in this very specific case which would give better outcome anyway.. Well, perhaps not better. At least close to as good as for example ZIP.

Any ideas are welcome :)
Last edited by bubach on Sun May 20, 2012 7:04 pm, edited 1 time in total.
"Simplicity is the ultimate sophistication."
http://bos.asmhackers.net/ - GitHub
User avatar
gravaera
Member
Member
Posts: 737
Joined: Tue Jun 02, 2009 4:35 pm
Location: Supporting the cause: Use \tabs to indent code. NOT \x20 spaces.

Re: Compressing VENDORS.TXT

Post by gravaera »

Yo:

You needn't have the kernel or any of the drivers keep a list of mappings between device IDs and their names -- the kernel should only care about being to load a driver which matches the device ID.

In other words, your kernel sees that there is a device with VENDOR:DEVICE ID foo. It simply uses its own driver API and linkage standard to search the drivers in the initrd or disk for a driver which advertises itself as providing functions for controlling the device with ID foo. If a user ever wants to know what device specifically pertains to a certain VENDOR:ID combination (such as the case where a user is about to search for a driver for an "unknown device" on his system), they would have been using a userspace tool to look up the detected devices in the kernel device tree anyway;

The userspace tool is an excellent place to put the "lookup device name by ID" code, so it can display that information along with each PCI device in the tree. So what I'm saying is that there is no need to worry about adding the PCI vendor list to your kernel image since the kernel really doesn't need to care about the names of devices. Naturally your drivers would have a string identifying themselves which they would pass to the kernel, but that's at least only for drivers which have been loaded, as opposed to including a huge list of strings for devices that your kernel will probably not even support for a long time, much less need to know about them.

--Peace out
gravaera
17:56 < sortie> Paging is called paging because you need to draw it on pages in your notebook to succeed at it.
User avatar
bubach
Member
Member
Posts: 1223
Joined: Sat Oct 23, 2004 11:00 pm
Location: Sweden
Contact:

Re: Compressing VENDORS.TXT

Post by bubach »

I want to include this as default not because I have to, but because i feel like it. So it will probably be a freestanding executable or something.
"Simplicity is the ultimate sophistication."
http://bos.asmhackers.net/ - GitHub
User avatar
Owen
Member
Member
Posts: 1700
Joined: Fri Jun 13, 2008 3:21 pm
Location: Cambridge, United Kingdom
Contact:

Re: Compressing VENDORS.TXT

Post by Owen »

Simple LZ77 + Huffman Coding?

(Also known as Deflate, or zlib. If you were feeling adventurous you could increase efficiency slightly by using LZ77 + Arithmetic coding)
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: Compressing VENDORS.TXT

Post by bluemoon »

I suggest to feed vendor.txt into a filter script and retain only supported driver entry. That way you should have a relatively small file for distribution.
User avatar
bubach
Member
Member
Posts: 1223
Joined: Sat Oct 23, 2004 11:00 pm
Location: Sweden
Contact:

Re: Compressing VENDORS.TXT

Post by bubach »

Since I haven't got any good suggestions on home-brew/specialized compressions I might just use something like they do in PCIDev:
http://www.coolthemes.narod.ru/pcidev.html

Just show device type based on PCI class and subclass, it's slightly better than "unkown" on everything and yet not as space demanding. Could probably optimize the vendor name much better than they have in that program too. ;)
"Simplicity is the ultimate sophistication."
http://bos.asmhackers.net/ - GitHub
Post Reply