BigTable-ish performance on MySQL (or any relational database) – lessons from Google, Amazon, YouTube, Twitter, and Flickr

Given my mainframe history it has taken some effort to wrap my mind around the new “database” architectures, but it turns out that the all the differences do boil down to a few key findings that can be reused in a more “traditional” application environment.

Warning: these personal conclusions have not yet been substantiated by any serious testing or even modeling. Take them with a grain of salt.

Warning 2: this is NOT a cookbook (although i will be a very happy camper indeed the day I can write one) – these are just (hopefully) useful observations.

First, get this:

Google BigTable and, to a slightly lesser extent, Amazon SimpleDB  are not really databases, not the way we used to define them. The term “query-able attribute store” fits really well: instead of orderly rows and tables with orderly and predictable dimensions they store key-value pairs. Think of this data as a collection of rather simplistic XML documents. Or a bunch of arrays handled by memcached.

I suspect that what happened was a rather natural evolution: instead of a neatly defined tables the architects succumbed to the classic “we do not know what we will really need, so why don’t we just define a simple big table with ‘key’ and ‘value’ columns and stick everything in it.”

Take this big table, and then

  • make the columns of truly variable size (whatever your db engine calls it),
  • distribute it across servers for redundancy,
  • optimize it for data retrieval,
  • add some serious indexing (reverse indexes have been speculated),
  • give up normalization (it does not make much sense anymore, does it?),
  • chop all the unnecessary fancy features (like JOINs),

and voilà : you have BigTable (as opposed to just a big table we have started with).

Lesson One: Understand Your Priorities

..when designing the database. BigTable sucks at writing. But that is fine with Google. “Fast, cheap, and good – pick any TWO”. There is always a price for any improvements. Figure out what you are willing to give up.

Yes, it’s painful, but, quoting Filip Verhaeghe:It takes tremendous effort to scale databases dynamically on physical servers and real hard disks, and find things fast. If you’ve ever had databases that needed partitioning because they were simply too large, you know what I’m talking about.”

Lesson Two: Denormalize!

Twelve years ago a good friend of mine started a company that did everything Twitter does today and much more (just this once I am willing to give DFJ the credit for being visonary). When I saw the database chart, it covered a whole wall: a truly beautiful design that would make any DBA weep. It also required a lot of expensive hardware and a lot of redesign and ongoing optimization. Before they could sort it all out the first Internet bubble burst and the zero-revenue business model went out of fashion. The beautiful multi-join queries punished the company with high fixed cost, and it took the business some really drastic steps to survive.

Twitter has thrown normalization out the window as soon as it collapsed under the traffic load, and now it is scaling up nicely with just one database server.

Lesson Three: More Asynchronous Processing!

Not everything has to be done right away. Some transactions affect the immediate user experience, while others bare fruit at an undetermined later date. So, if they are not that crytical, why not initiate them later via a reliable queing/messaging mechanism? Flickr does exactly that, and most of such “underdog” transactions are still completed within 15 seconds, while the initiating user enjoys near-instant response time.

A lot of old-school enterprise database applications are about perfect data integrity, but even there if you really think it through… there are opportunities.

Lesson Four: Pick the Right Infrastructure

Since the dawn of mainframes it has been easier to run the database engine on a single box. And so we have ended up with the “one server, lots and lots of storage” architecture.

Database clusters and proper partitioning require a whole new level of expertese, and geeks are not really good at persuading the executives that “no, we can do without all this effort right now, but you will thank me in the future”.

Google by necessity has gone with “lots of cheap, basic hardware”. It makes their infrastructure up to 33x cheaper. It also means that the data is split between lots of servers where each has 4GB to 8GM of RAM and about 20GB of actual data. Oh, and each piece of data is replicated on three separate servers. This suggests that about 70% of all Google’s “databases” are cashed in RAM at all times.

Ok, maybe a bit less – we have to throw in something for those indexes. On the other hand, not all data is critical, and Big Table is really good at managing that distribution automatically.

So, consider the IO of pre-queried, already cached in the RAM, database. And now consider data-to-CPU ratio. No wonder Google compresses the data so that it makes it faster through the network – they got the processor cycles to spare! No wonder the infamous Google’s MapReduce engine pushes the data processing onto the “database” servers: closer to the data and with processing power to spare!

There is a lot to learn here that does not require building a full replica of Google: cache the tables, consider the benefits of a properly partitioned database cluster early, throw brains, not money at the performance problem, do it early.

I think I am looking forward to the next time I have to set up the architecture from scratch: this time I will do it right. And the best part is: I can do it right without giving up the benefits of a proper db engine.


About this entry