file check

Tue Jul 15 20:50:08 BST 2008

On Tue, Jul 15, 2008 at 05:40:10PM +0100, Christopher Jones wrote:
>
> Assuming I don't have the time to have Perl cross-check every single  
> value (any guesses on how long it would take Perl to compare ~1 billion 
> values from MySQL?), how do these approaches sound for a quicker 
> comparison;

I'd think that with comparing tables in the database, the bottleneck
is going to be the transfer of the data (disk to memory, and possibly
the network as well), not the comparison of the data.

> 1) For each table, return each field separately, ordered by the primary 
> key.
> 2) For int and double fields, calculate some statistical functions (like 
> the mean and stdev) of the field, and compare those values between the 
> databases. (It would be an incredible coincidence if 30,000 int values 
> gave the same results if the databases were not the same).
> 3) join() the returned values and compare them. If comparing huge  
> strings is an issue, then compare checksums of the strings.
> 4) As Michael suggested, directly compare some random values as a double 
> check.
>
> If anyone has any other ideas how to extend this to make it a bit  
> better, I'd be interested.

I'd simply return the rows of both tables, in an order that doesn't
require the database to sort them (so, if you have InnoDB tables, primary
key order), and compare them row by row. You can stop as soon as you have
found a row that's different (unlike your case 2) above that requires reading
all rows before doing a comparison).

Abigail