Column stores
October 7, 2007
In applications which need to manage entities with a flexible set of attributes, like Daisy’s documents, a typical problem is how to store this data in a relational database.
The most straightforward approach is using a big table containing tripples {entity ID, attribute ID, value}. This is the approach currently used in Daisy, although the ‘tripples’ are somewhat more complex: there are multiple ‘value’-columns for the different data types, indexes for multi-value and hierarchical fields, and some more. The problem with the tripple-table-approach is that searching on multiple fields requires lots of self-joins of this table.
Since documents in Daisy follow a document type, and each document type has a certain number of fields, another approach might be to dynamically create a database table corresponding to each document type. There are however a few reasons that would make this approach in Daisy complex: the type of a document in Daisy can be dynamically changed (and history of documents needs to be kept), fields can be multi-valued and hierarchical, the same Daisy field type can be shared between multiple document types, …
So using a traditional RDBMS has always felt a bit wrong. Some time ago I came across MonetDB, which uses “a storage model based on vertical fragmentation” (also called a decomposed storage model). The concept is easy to understand: a traditional database stores data per row, while this database stores data per column. In each column, only non-null values are stored, so it is ideal for sparse data. The cost of adding a new column is unrelated to the current number of rows or columns (5 or 5 million columns, it doesn’t matter). Adding a new record might be a bit more expensive, but query performance is better. MonetDB can also be used as an XML database and shows superior XQuery performance.
Recently I found a number of other papers on these topics:
- Column-Stores For Wide and Sparse Data
- Scalable Semantic Web Data Management Using Vertical Partitoning
- The End of an Architectural Era (It’s Time for a Complete Rewrite) (less about column stores but nonetheless interesting)
The second paper shows that using a column-oriented approach in a classical database (using one table per property) already has a good query-advantage over the traditional tripple-approach: [...] average query times go from around 100 seconds to around 40 seconds However: [...] by using a column-oriented DBMS [...] queries now run in an average of 3 seconds.
Up to now, our current field-storage approach is still working fine, but as datasets grow and queries become more complex, it is interesting to know there are others (especially the RDF crowd) thinking about these problems.
Next to all this, an interesting observation is that Alfresco and Jackrabbit are using Lucene for all their searching needs. I have not yet thought much about this approach, so I don’t know how well Daisy’s current query language can be implemented on this basis, but it’s sure an interesting path to investigate too, especially as it would unite full-text and metadata searches on a low level.
October 8, 2007 at 8:31 am
Interesting topic! I’ve often thought about Daisy’s storage model in this context too. You may also be interested in the column store work at the Hadoop project, which includes an RDF api through Hbase.
http://wiki.apache.org/lucene-hadoop/Hbase
This recent write up on Amazon’s Dynamo project may be of interest too: http://allthingsdistributed.com/2007/10/amazons_dynamo.html
October 9, 2007 at 2:39 pm
Given Daisy’s document-centric approach, something like CouchDB might also be appropriate. Although it’s probably not quite ready for “prime time” yet, it might be worthwhile to continue watching this space.
I’m very interested to see where you go with this.
October 9, 2007 at 5:31 pm
Thanks for the links.
BTW, currently these are just some random musings, no actions planned.
October 28, 2007 at 5:15 pm
Check out LucidDB also.
November 4, 2007 at 11:14 am
Your post set prompted me to look into several solutions. I ended up surveying the bulk of the xml database field. The end result that I wanted to pass by you, is, why not use IBM’s DB2 Express-C 9.5 (a free edition, a mature platform on various OSs) and transition the Daisy repository to a pure xml db? DB2 would allow a transition path from the relational model, using the current Daisy relational db schema, to a pure xml-based repository solution, in the same database.
See page 55, Fig 2.2 of “http://www.redbooks.ibm.com/abstracts/sg247301.html?Open” for where I’m coming from, along with the surrounding explanation.
The end solution would allow managing entities with a flexible set of attributes (Daisy documents) using a non-relational, XML databaase with Daisy Query mapped to XQuery with its performance attributes.
This has been fun. Daisy is great!