Virtual nodes

October 9, 2009

A recurring pattern noticed when partitioning (sharding) data over multiple systems is that of creating much more partitions than there are actual systems.

Let’s call the systems nodes and the partitions virtual nodes.

This is something which comes back in:

  • Consistent hashing: each node gets assigned many of the hash buckets
  • Katta (distributed Lucene): each search server can handle a variety of index shards
  • Bigtable or HBase: each region server can handle a variety of regions

In these examples, the hash bucket, the index shard and the region are the virtual nodes.

The advantages include:

  • being able to handle nodes with different capabilities (a heterogenous set of nodes): more powerful nodes get assigned more of the virtual nodes
  • when a node goes down:
    • the virtual nodes that it was responsible for can be re-assigned to multiple other nodes
    • or from another point of view, the replica’s of the virtual nodes managed by this node will be on a variety of other nodes, so the load of the killed node will be automatically spread over multiple other nodes
  • when a new node is added, it can take over the responsibility for some of the virtual nodes, and this can be done one by one to allow for warm up

Speaking of recurring patterns, there is also clear similarity in the way Lucene handles index updates and the way HBase & Bigtable handle updates:

  • Lucene first buffers index updates in memory, and when certain limits are reached, flushed them to a new file on disk called an index segment. This is instead of trying to update the existing index in-place. Searching is done by searching over all index segments. When the number of segments gets large, some of them are merged together in the background.
  • HBase/Bigtable does something similar with MapFiles/SSTables: new data is occasionally flushed to a new immutable MapFile (basically a file of key-value pairs sorted by key with an additional index containing the positions in the file of a subset of the keys, to avoid a full sequential scan). When looking for something, all flushed MapFiles are consulted until the item is found. A background process merges the MapFiles when there are too many of them.

Leave a Reply