Scalable databases and skewed access patterns
October 21, 2009
One thing I found remarkable with scalable databases like Dynamo and BigTable is that while they spread the data over many servers, there is just one server responsible for handling requests to a certain record (= key). In the case of HBase, there is only one server for handling a specific tablet (= set of records) at a time. In the case of Dynamo, requests are handled by one coordinator but sent to all replicas, waiting until a certain number responded (= the quorum thing). So for reads of one specific record, there is no load balancing over multiple servers (AFAIU).
Now the Dynamo paper notes that “even where there is a significant skew in the access distribution there are enough keys in the popular end of the distribution so that the load of handling popular keys can be spread across the nodes uniformly through partitioning.”
The BigTable paper writes about a relatively small table (~500 GB) in Google Earth that “must serve tens of thousands of queries per second per datacenter with low latency. As a result, this table is hosted across hundreds of tablet servers and contains in-memory column families.”
Apparently in real-world scenarios things turn out OK. Still, what would happen when you have a very small data set accessed by a very large amount of users? Probably it would be a bit crazy to touch the database for every read, and this could be solved by caching in the application layer.
Applications where you have the same problem for writes seem much less common to me, and so a specific solution can be engineered for that.