Creating Sample Data

AJ · Post by AJ » Wed Apr 25, 2007 3:37 am

Hi All,

As well as OS dev, I'm also creating a large database-based Windows app in c#. A big part of this system is contact management.

Does anyone know of any programs which generate a large amount of (sensible) random data (Forename, Surname, Title, Address etc...) which I can use to test my system on a larger scale (I' talking about 10,000-20,000 records). I can import data from most database (and formatted text file) types.

If such a progam was biased towards producing UK-type postal codes it would be useful.

Cheers,
Adam

Combuster · Post by **Combuster** » Wed Apr 25, 2007 4:18 am

Tried the telephone dictionary?

AJ · Post by AJ » Wed Apr 25, 2007 5:21 am

Combuster wrote:Tried the telephone dictionary?

I didn't mention - I'm not a professional developer so have no funds to buy *real* data or employ an office monkey, and don't have time to input 20k names from a paper directory! I'm aware of online directories where I can search for individual names for free, but not where I can download that quantity of data.

Solar · Post by **Solar** » Wed Apr 25, 2007 5:57 am

Well, either you're doing it professionally (and honestly, a phone directory on CD-ROM doesn't cost the world), or you're whipping up a script that takes some 20 fornames, 20 surnames, 20 postal codes etc. etc. and combines them at random...

mystran · Post by **mystran** » Wed Apr 25, 2007 11:13 am

AJ wrote: I didn't mention - I'm not a professional developer so have no funds to buy *real* data or employ an office monkey, and don't have time to input 20k names from a paper directory! I'm aware of online directories where I can search for individual names for free, but not where I can download that quantity of data.

Actually, I guess this is the type of thing you could solve without really testing, if you do a little analysis. In any sane implementation, the bottleneck will be the database, and unless you are really into database design, you should probably be using some off-the-shelf SQL database. Then your performance is pretty dependant on two things: how many queries you need to do, and can those queries be optimized by the database using indexes.

You don't need 20k tuples in your database to figure such things out. You need around 20. Then look at the amount of queries you do, and ask your DBMS to explain (often this is indeed the command "EXPLAIN") how it does those queries. Alternatively, you could just feed some 20 sensible entries, and then generate (with a small program in whatever scripting language) any number of not-so-sensible entries, which just happen to have the right format. This is a good strategy if your DBMS is too intelligent to use index when tables are small enough that it's faster to just read through them.

edit: oh and if you're unable to offload most of the work to the database engine, consider a redesign.