Sharding, and all that

Fri Dec 19 09:52:40 GMT 2008

Andy Wardley wrote:
> My take on this is that you would have to (re-)build your application and
> underlying DB to be distributed from the start.  Your user DB/app would
> provide an authentication service, perhaps something similar to OpenID,
> that
> your message board would use to authenticate users.

Yep - that's what "sharding" is all about - separate disconnected silos
of data. You know, like the ones that were all the rage in the 60s that
drove people to invent RDBMS. The good thing is, if your application is
successful and is still in use a couple of years from now you get to
either spend all your time fire-fighting or re-implementing integrity
constraints.

> The problem you mention is that the message.user foreign key then goes out
> the window.
> 
> One approach would be to have a local users table in the messaging DB which
> maps internal user IDs to some kind of external user identifier.  That
> could
> be their email address (in the simplest case), a URL that identifies the
> user
> at your authentication app (e.g. http://auth.your-app.com/joe90) or perhaps
> a GUUID.

Beside the point. Nothing stopping you running separate services talking
to one big database. Nothing stopping you replicating the user table to
various slaves and having separate message tables on each. That's not
sharding though - sharding (if it means anything) means to have
*disconnected* databases. Which works fine for Google since they
effectively need only one table for web-searching.

> Either way, you're explicitly decoupling the internal requirement that a
> message must have a "valid" user id (i.e. a record in the users table) from
> the assumption that you actually know anything valuable about that user
> that
> doesn't relate explicitly to the message board.
> 
> You can then maintain referential integrity in your message DB
> regardless of
> what happens in the user DB.  It would be possible to delete a user from
> the
> user system, for example, without having any *internal* effect on the
> message
> board DB (no more cascading delete woes).

Then you're not maintaining referential integrity. There's no point in
having a user-id that doesn't *mean* anything. Primary keys, foreign
keys and all the other bits and pieces of RI in a SQL database are there
to maintain the *meaning* of your data.

> I appreciate that "decoupling" is a fancy way of saying "broken", but I'm
> beginning to see it as a feature rather than a liability in this context.

I can only think of two contexts:
  1. Don't care
  2. Last resort - the project is pushing the boundaries of what is
     possible.
Now Google were lucky in that they scored on both.

> I should point out that I haven't implemented anything like this yet so I
> could be way off course. But I'm about to implement something like it for a
> $work project so any pointers on this would be welcome if I'm sailing in
> the
> wrong direction.

It's not uncommon to find a project with one very large table (sensor
readings, apache log lines, messages, text from documents). It's a lot
less common to have several very large tables. If you're in the first
camp then you can replicate all the (comparatively) small tables and
partition the single large one. That lets you maintain the view of a
single database from the outside.

Before that though, make sure you do have a problem. Pick the right tool
for the job - if high concurrency/complex queries/procedural code for
constraints is a requirement then it's probably not MySQL. Always
consider what an extra couple of grand on hardware will gain you.

The article you linked to and the one Nigel Hamilton mentions further
down the thread are both being "clever". In neither case (from the small
amount of detail available) does it look justified.

-- 
  Richard Huxton
  Archonet Ltd