Scalable databases and skewed access patterns
October 21, 2009
One thing I found remarkable with scalable databases like Dynamo and BigTable is that while they spread the data over many servers, there is just one server responsible for handling requests to a certain record (= key). In the case of HBase, there is only one server for handling a specific tablet (= set of records) at a time. In the case of Dynamo, requests are handled by one coordinator but sent to all replicas, waiting until a certain number responded (= the quorum thing). So for reads of one specific record, there is no load balancing over multiple servers (AFAIU).
Now the Dynamo paper notes that “even where there is a significant skew in the access distribution there are enough keys in the popular end of the distribution so that the load of handling popular keys can be spread across the nodes uniformly through partitioning.”
The BigTable paper writes about a relatively small table (~500 GB) in Google Earth that “must serve tens of thousands of queries per second per datacenter with low latency. As a result, this table is hosted across hundreds of tablet servers and contains in-memory column families.”
Apparently in real-world scenarios things turn out OK. Still, what would happen when you have a very small data set accessed by a very large amount of users? Probably it would be a bit crazy to touch the database for every read, and this could be solved by caching in the application layer.
Applications where you have the same problem for writes seem much less common to me, and so a specific solution can be engineered for that.
A thought on client-server versus P2P
October 19, 2009
Client-server and P2P (peer-to-peer) are two opposing architectures for distributed systems. However, in a client-server architecture, it is perfectly possible that the server is internally organized as a P2P system. This is the case for highly-available distributed storage systems like Dynamo and Cassandra.
A thought on the relation between RDMBSes and distributed stores
October 19, 2009
As it is often misunderstood, I think it might make sense to review the relation of RDBMSes and NoSQL-style distibuted stores.
Scaling the relational model across machine boundaries does not work very well, but in a distributed storage system scenario, the local storage can still be provided by an RDBMS.
A highly available distributed store uses a set of nodes with a set of techniques (replication, consistent hashing, vector clocks, hash trees, hinted handoff, …). But at the node-local level, the data still needs to be stored somehow.
Some (Dynamo depending on config, PNUTS) store the data in an RDBMS (MySQL), which then serves as a btree-based key-value store. In this case, they are simply used because they exist, are ready-to-use and known-to-work.
Others (Cassandra, HBase/BigTable) use a simple better-performing solution based on SSTables (or similar), log files and memtables.
Cassandra and HBase are also examples of how it still makes sense to provide node-local, non-scalable finer-grained structures, namely the columns within column families. Similarly, within a distributed store the relational features could still be exploited at the local level.
Virtual nodes
October 9, 2009
A recurring pattern noticed when partitioning (sharding) data over multiple systems is that of creating much more partitions than there are actual systems.
Let’s call the systems nodes and the partitions virtual nodes.
This is something which comes back in:
- Consistent hashing: each node gets assigned many of the hash buckets
- Katta (distributed Lucene): each search server can handle a variety of index shards
- Bigtable or HBase: each region server can handle a variety of regions
In these examples, the hash bucket, the index shard and the region are the virtual nodes.
The advantages include:
- being able to handle nodes with different capabilities (a heterogenous set of nodes): more powerful nodes get assigned more of the virtual nodes
- when a node goes down:
- the virtual nodes that it was responsible for can be re-assigned to multiple other nodes
- or from another point of view, the replica’s of the virtual nodes managed by this node will be on a variety of other nodes, so the load of the killed node will be automatically spread over multiple other nodes
- when a new node is added, it can take over the responsibility for some of the virtual nodes, and this can be done one by one to allow for warm up
Speaking of recurring patterns, there is also clear similarity in the way Lucene handles index updates and the way HBase & Bigtable handle updates:
- Lucene first buffers index updates in memory, and when certain limits are reached, flushed them to a new file on disk called an index segment. This is instead of trying to update the existing index in-place. Searching is done by searching over all index segments. When the number of segments gets large, some of them are merged together in the background.
- HBase/Bigtable does something similar with MapFiles/SSTables: new data is occasionally flushed to a new immutable MapFile (basically a file of key-value pairs sorted by key with an additional index containing the positions in the file of a subset of the keys, to avoid a full sequential scan). When looking for something, all flushed MapFiles are consulted until the item is found. A background process merges the MapFiles when there are too many of them.
Memory, disk & network speed
July 17, 2009
This acmqueue article by Adam Jacobs contains some interesting thoughts:
- in modern systems, [...] random access to memory is typically slower than sequential access to disk. Note that random reads from disk are more than 150,000 times slower than sequential access; SSD improves on this ratio by less than one order of magnitude.
- [...] the highest-speed local network technologies have now surpassed most locally attached disk systems with respect to bandwidth, and network latency is naturally much lower than disk latency. As a result, the performance cost of storing and retrieving data on other nodes in a network is comparable to (and in the case of random access, potentially far less than) the cost of using disk. Once a large dataset has been distributed to multiple nodes in this way, however, a huge advantage can be obtained by distributing the processing as well—so long as the analysis is amenable to parallel processing.
Web applications: Java vs Javascript
February 2, 2009
We position Kauri as a REST-centric Java web application framework, and ourselves as Java developers, but the reality is that we are doing more and more with Javascript.
Web-applications moved from being “thin client” to “fat client“: every web-page can be a complete application downloaded into your browser. Web development is now about exposing server-side resources in a RESTful way, and developing fat clients using Javascript.
Previously web-frameworks focused on the thin client model. They offered solutions for managing transient application state: sessions, conversations and the like. Often they helped developers by hiding HTTP, URIs and HTML (not a bad thing per se). All UI-logic was on the server, sometimes enhanced with some optional client-side Javascript.
But this is more and more a thing of the past. Kauri counts on the browser to keep interaction state. As a consequence, for UI work the balance is moving from Java to Javascript. Major parts of application front-ends are now developed as Javascript running in the browser (cfr. Kauri Forms). Interestingly, this also means there is more shared ground between web-frameworks that employ different server-side base technologies.
For completeness, I should mention that there is still a lot of Web-centric server-side work. Neither does the focus on Javascript mean that I’m talking about Gmail-style single-URI applications.
While we can now develop richer applications, it did not necessarily become easier. But if you are interested in working on a platform focusing on RESTful web application development, with a broad scope, and rich browser-based clients, check out Kauri.
Daisy 2.3 preview
February 1, 2009
A lot has been happening in Daisy-land, but since I have only been involved from a long distance, I have neglected to report on it. With Daisy 2.3 on its way, it is high time summarize some of the major new features.
Search & replace
Search & replace is a new feature allowing to perform a search & replace across an arbitrary set of documents.
You start by making a selection of documents through a query or the document basket, and then specify the parameters:

The sensible case setting will perform a case-unsensitive search, and smartly replace the word keeping the original word casing.
After pressing search, the system searches through all documents (keeping you updated on progress), and displays what it has found. You can then review the found occurrences and confirm what needs to be replaced:

Both the ’search’ and the ‘replace’ actions are implemented as document tasks. Document tasks have existed for a long while in Daisy, and allow to reliably execute a task across a set of documents and track the progress.
Upload control
Uploading binary files in Daisy, authored with other applications, is easy. However it is rather complicated if you want to edit the uploaded content: you need to download it somewhere, open that file in your editor, when you’re done go in the Daisy document editor, and upload the new file, which means you’ll have to browse for it again in the file system.
Now there is the upload control, a Java applet, which removes all these steps. When editing a document with uploaded content, you’ll see a new edit button:

Pressing that button will launch the applet:

You simply need to press a button to open the file in the native application, and when you’re done press Save to upload it again to Daisy. That’s all there is to it.
Inline editing
Next to the normal Daisy document editor, there is now a new inline editor. The inline editor allows to edit documents right where they are displayed, without losing the (navigation) context. The document is styled as usual, with parts and fields arranged as defined by your own styling XSL, but the parts and fields become editable. As a consequence, parts and fields can be edited on the same page, without switching tabs.
The way it works is that in the document-styling XSL, you add some attributes on HTML elements which contain a part or field, to indicate that in edit mode, the element should be replaced with an editor for that part or field. When you’re viewing the document, everything is as before, except there will be an edit button. When you press the edit button, the inline editor is launched, which will execute the same XSL, but elements having the inline editing attributes will be replaced by the respective field or part editors. The field and part editors share the same code as with those in the normal Daisy editor.

Partial write access
In Daisy 2.2, partial read access was introduced. This made it possible that someone can read a document without necessarily being able to read all fields and parts of it.
Now, we have also added partial write access. Using the ACL, you can control which document properties a user can update. This is how the ACL write detail permissions screen looks like:

The document editor will of course only allow you to edit those things for which you have write permission.
Partial write access can be combined with partial read access. Previously, having write access on a document implied that you have full read access.
Customizable document browser
When creating a link in Daisy to other Daisy documents, you use the document browser dialog to look up the document you want to link to. This dialog is used in various locations: in the HTML editor, in the navigation tree editor, for link fields, in the workflow task screens, etc.
For Daisy 2.3, the document browser gained a lot of new functionality, and became completely configurable. You can for example create a specific document browser configuration for editing a certain type of link field. For multi-value link fields, the document browser is now used for editing all links, rather than only adding new links.
You can configure which search fields are visible and which columns are displayed in the result table. You can show and hide the document preview. New query features include the ability to embed the faceted browser and to create predefined named queries.

Chunked query styling
The same query result browser as used in the document browser, is also available as a query styling. This means that by adding a simple style_hint option to a query, the query results will be displayed in a table with chunk navigation.
In the editor:

In the rendered document:

Other
We replaced the proprietary PDF formatter used for the books with FOP, which we already used for on-the-fly single document PDFs, but at the time did not work good enough for the books. The full-text search screen is now configurable (you can hide certain fields from the form) and gained new possibilities (e.g. selection of arbitrary collections). Improvements were made to repository namespaces, workflow (among others, the possibility to edit parts and fields inline on the workflow task screens), the document editor (editor height is adjusted automatically to the window height, no “Change editor height” button anymore). The access rights of the document owner can now be defined in the ACL, instead of being hard-coded to ‘full access’.
Within a few weeks, the 2.3 release candidate will be available. We hope you will give it a try and are looking forward to your feedback.
Security through obscurity
April 25, 2008
A while ago I changed the title of this blog to include the new project I’m working on, Kauri, a web application platform.
From time to time, I look around what other people are doing and so stumbled across this Wicket in Action book. You can download a draft of the first chapter for free. As I started reading, my attention sharpened when I saw the statement “There are some problems with this REST approach”. Which, you wonder? Well, for example:
Say for instance that you are authorized to view only part of the product database. With the product ids being passed around in the URLs what stops you from just trying a code or writing down one you saw your colleague from department Z use (you noticed the id in the URL when getting some coffee)? You’ll have to explicitly check whether the user is allowed to see a certain result. This can mean quite a lot of work and chances are you forget something.
Certainly a valid, but not very strong approach to security!
Regardless of the above quote, Wicket is probably a great framework, but it doesn’t give a good impression to read such things.
Global search and replace
April 24, 2008
Karel has been working on a great new Daisy feature: the ability to perform a search-and-replace across a selection of documents. A nice tour of this feature is in the documentation.
On the background, it is implemented as special kind of document task. There’s all sorts of nifty stuff involved, like handling of markup during search and replace, being able to keep the original casing of words, etc.
Unrelated to this search-and-replace, other recent improvements include:
- The fulltext search screen became more powerful, for example it is possible to freely select document collections, and it is customizable through configuration.
- Some workflow related improvements happened, especially the ability to embed workflow queries in publisher requests, so that it is possible to show workflow information along with documents.
- PDF renderer unification: with the new generation of the FOP XSL-FO processor available, we have been able to remove the dependency on the commercial IBEX XSL-FO engine for the books publishing, so that the same engine is now used everywhere for the creation of PDFs.
Daisy 2.2 released
March 9, 2008
Daisy 2.2 is now available. Besides some small fixes and translation updates, this is the same as the 2.2-RC release.