Unicode data cleaning in databases

Sun Aug 27 16:01:07 BST 2006

I have a database that has mixed Latin1, ASCII, and UTF8. I have some
code that catches this on the way out but ideally I'd rather just
clean it up since the incoming sources now are all UTF8.

I have in my head an app that does schema discovery (`a la various
::Schema::Loader), finds out all the TEXT/(VAR)?CHAR/etc text fields
and pulls the data in, checks for encoding, and updates where
necessary to UTF8.

Does anything like this exist?

Paul

For the curious, right now I'm doing something like,

sub to_utf8 {
	foreach my $d (@_) {
		from_to($d, "iso-8859-1" => "utf8") if has_bad_utf8($d);
		#$d = CGI::Enurl::enurl($d);
	}
}

sub has_bad_utf8 {
	return unless defined $_[0];
    $_[0] =~ m/^((
        [\x00-\x7f] |
        [\xc0-\xdf][\x80-\xbf] |
        [\xe0-\xef][\x80-\xbf]{2} |
        [\xf0-\xf7][\x80-\xbf]{3}
        )*)
        (.*) $/x;
	length $3 ? $-[3]+1 : 0;
}