file check

Tue Jul 15 17:40:10 BST 2008

On 15 Jul 2008, at 16:34, Frank v Waveren wrote:

> On Tue, Jul 15, 2008 at 04:05:24PM +0100, Christopher Jones wrote:
>> Anyone got any better ideas for checking the integrity of my new  
>> database?
>
> Maatkit contains a number of very useful MySQL tools, including
> mk-table-checksum, compares tables. You'll have to load both your
> dumps into MySQL though.
>
> http://maatkit.sourceforge.net/doc/mk-table-checksum.html
>
> F.

That looks like a neat tool - definitely worth having! But looking at  
the Doc it seems it will probably fail for the same reason cmp failed  
- presumably (as someone else pointed out) subtle differences in the  
dump layout due to optimisations like varchar -> char.

I figure Michael is right to point out this is a job for Perl.... and  
this is a Perl forum after all!

Assuming I don't have the time to have Perl cross-check every single  
value (any guesses on how long it would take Perl to compare ~1  
billion values from MySQL?), how do these approaches sound for a  
quicker comparison;

1) For each table, return each field separately, ordered by the  
primary key.
2) For int and double fields, calculate some statistical functions  
(like the mean and stdev) of the field, and compare those values  
between the databases. (It would be an incredible coincidence if  
30,000 int values gave the same results if the databases were not the  
same).
3) join() the returned values and compare them. If comparing huge  
strings is an issue, then compare checksums of the strings.
4) As Michael suggested, directly compare some random values as a  
double check.

If anyone has any other ideas how to extend this to make it a bit  
better, I'd be interested.

Chris.

Gynaecological Cancer Research Laboratories,
UCL EGA Institute for Women's Health,
University College London,
Paul O'Gorman Building,
72 Huntley Street,
London WC1E 6DD
United Kingdom

Telephone: 020 3108 2007