Sharding, and all that

Fri Dec 19 11:46:34 GMT 2008

Richard Huxton wrote:
> Yep - that's what "sharding" is all about - separate disconnected silos
> of data. 

I thought sharding specifically related to horizontal partitioning.  i.e.
splitting one table across several databases,  e.g. records with even row
ids in one DB, odd in another.  Apologies if my terminology is wrong.

I was thinking more specifically about vertical partitioning along the
functional boundaries which wouldn't be sharding by my (possibly incorrect)
definition.  Apologies for being off-topic, too  :-)

> Then you're not maintaining referential integrity. There's no point in
> having a user-id that doesn't *mean* anything. 

I'm not suggesting that the user id doesn't mean anything.  It means
exactly the same thing as it does in any other database - a unique
identifier to a user record.

I'm saying that the user record in the message DB doesn't need to
store email address, username, password, favourite colour, challenge
question/response, and all the other cruft that goes with user administration.
All it needs is a unique id, an authentication realm (identifying the
authentication service) and authentication token (identifying the user
at that service).  And perhaps any message-board specific niceties such
as your local nick and display preferences.

Similarly in the authentication DB there's no need to store information
in the users table relating to message board preferences.

I can see that trivially splitting one user table into two leads to all
sorts of integrity problem.  But I'm thinking of them as two separate
user databases from the outset and accepting that they're potentially
disjoint.

The best (but poor) example I can think of right now is how my gravatar
popped up automatically when I signed up at github.  Not because the github
user database is referentially at one with the gravatar database, but because
I used the same public identifier (my email address) on both systems.

So it could be argued that there is *a* point in having a user id (such
as email address) that doesn't *mean* anything to the *current* database,
because it might have meaning to *other* databases.  It's the closest
thing we've got to referential integrity in a distributed world.

> Primary keys, foreign
> keys and all the other bits and pieces of RI in a SQL database are there
> to maintain the *meaning* of your data.

Sure, I recognise the fact that you lose referential integrity at the
boundary between your db and the "outside world".  But internally,
the DB remains referentially intact.  The message board still has its
own user records for messages to reference.  The fact that the authentication
realm/token may at some point in the future become invalid is really no
different to the current situation where a user's email address changes
and they can no longer login or get a password reminder/reset.

> Before that though, make sure you do have a problem. Pick the right tool
> for the job - if high concurrency/complex queries/procedural code for
> constraints is a requirement then it's probably not MySQL. Always
> consider what an extra couple of grand on hardware will gain you.

For this particular project we don't really have a database problem that
can't be solved with a bit of replication and a whack of the hardware hammer.
The majority of the load will be searches that we're reading from a single
read-only table.  So we can replicate that easily across as many web heads as
required for performance and it gives us simple redundancy in case of machine
failure, etc.

All the read/write functionality (mostly CMS-like stuff) will happen
(initially) on a single server with master read/write database.  It'll
be fairly low-load and it's not mission critical (in the sense that having
the CMS offline for a few hours is an inconvenience not a hanging offence).

The impetus was more about turning one large and tangled CMS-like system into
several smaller, simpler systems.  90% of the tables fall neatly into one
sub-system or the other (front-end search and content delivery, back-end
content management, image management, accounting and billing, and so on).
It's the last few  tables (mostly relating to users, unsurprisingly) that
straddle spread-eagled across the database dangling their tools in the works.

So I'm really coming at this from the perspective of someone thinking about
building distributed apps and wondering how the database is going to work
behind the scenes, rather than someone with a database performance problem
considering partitioning.

Cheers
A