Just in time for the weekend, Daisy 2.2-RC has been released.

The major new features in this release revolve around translation management, partial document read access and custom field editors. This release is interesting for any Daisy user though, since we’ve also added a bunch of other features and improvements.

As stated in my announcement mail on the list, we need your feedback, so don’t wait till tomorrow to install it :-)

Links:

Downloads

Changes since 2.1

Paul committed changes to trunk which allow to implement custom field editors, similar to how you can already implement custom part editors. This allows to plug-in your own widgets instead of the normal entry for a field. As an example, there’s an article on the wiki showing how to embed Google maps to select a location that will be stored in a field.

Meanwhile Karel has announced his work on a new kind of hierarchical selection list which will allow to manage very large hierarchies. This will also include a GUI to easily browse and edit such hierarchies. The hierarchical selection list data is user-specified (rather than derived from queries), so it is a bit like the current static selection list, but the selection list data is stored in Daisy documents for more flexibility. There are also search extensions planned to allow performing hierarchical searches even when the field type is non-hierarchical, thus when the hierarchy path is not stored as part of the field.

And last but not least, as you can read on our company blog, we are starting a project to develop a new web application framework which will, among other things, serve as a basis for the next generation of the Daisy front-end.

Translation management

January 4, 2008

With the holidays and such, I hadn’t found time yet to blog about the new translation management features colleague Karel and myself have been adding to Daisy, so let’s have a look at this now.

There are lots of aspects to managing multi-lingual content, here are some of them, not everything mentioned already exists in Daisy:

  • repository and document structure related: storing multiple language variants of a document (supported in Daisy since version 1.3), content which is shared by all language variants such as non-translatable fields (not yet in Daisy), language-specific fulltext-indexing behavior (not yet in Daisy)
  • management related: keeping track of the translation status, being able to query outstanding translation work (this is what we added now).
  • content related: markup for indicating non-translatable content or other hints towards translators (not specifically present in Daisy, though could be based on HTML element classes). See also ITS.
  • editing related: updating translation status (this is supported in Daisy, remainder not yet), split-screen editing, language-aware spell-checking, integration with terminology systems and translation memories, segment-based editing, …
  • publishing related: deciding what language to show
  • workflow related: external: export content and provide it to a translation agency (this we added now), internal: use of the Daisy-integrated jBPM workflow.

The main areas we have currently worked on are keeping track of the translation status, and extending the export/import tool with a format suitable for exchange with translation agencies (also known as localization providers).

Document model changes for translation management

A document in Daisy can exist in a number of languge variants, we call the reference variant the variant in which the original content is written, and the translated variants are translations of the reference variant.

To keep track of which version of a translated variant corresponds to which version of the reference variant, we added a ’synced with’ property to versions. The ’synced with’ property is a link to a version of another language variant. If new versions have been added to the reference variant, than the translated variants based on that reference variant will likely also require updating.

synced-with.png

However, not all changes to the reference variant require updating the translated variants, for example this is the case when you only fix a small typo in the reference variant. For this purpose, we’ve added a property ‘change type’ to versions, which can have the value major or minor. Major changes invalidate the translations, while minor changes do not. This major/minor distinction is, for translation management, only important on the reference variant. On other variants, or when you’re not making use of translation management, you can use this property as you desire.

major-minor-change.png

To indicate what the reference variant is for a document, a property ‘reference variant’ has been added to documents. Setting the reference variant for a document is also an indication that the document is considered to be under translation management.

While adding the ’synced with’ and ‘change type’ properties to versions, we’ve also added a ‘change comment’ property, which can contain a short description of the changes in that version. This is somewhat similar to the commit message in SVN/CVS.

All these version properties are still editable after version creation, in contrast with the actual content of a version, which cannot be modified after version creation.

Here is how the new properties are presented in the editor. The ’synced with’ is only shown when relevant: when the reference language is set, and when the current variant is not the reference language.

editor_tm_bottom_edit.png

While at it, we also improved the top part of the document editor, to take less space, and show the document ID, type, branch and language:

editor_tm_top_edit.png

Querying for translation status

With the new document model properties, especially the ’synced with’ field, we can now do various translation status searches. There are some new query language constructs for this, and to make it easy for users, we’ve made the commonly needed queries available via a new page accessible through Tools -> Translation Management.

The most useful is the overview query, which shows the translation status for all variants of a set of documents, as shown in the screenshot below.

tm_overview.png

In case you’re wondering: green means the translation is up to date, white means the variant does not exist, “not in sync” means the ’synced with’ link does not point to the last version with major changes of the reference language, “not synced” means the ’synced with’ link is not set at all, “not synced with ref variant” means the synced-with link points to a variant which is not the reference variant.

Translation import/export

A common need is to provide content stored in Daisy to localization providers. These people usually won’t edit content in Daisy, but have their own tool chain in place with content segmentation, translation memories, etc. So we only need a way to provide the content to them and load the returned result into Daisy. We already have an import/export tool to exchange content between repositories, but its format is not very well suited for this purpose: the content of one document is spread over multiple files (one for the meta data, one for each part), there is one directory per document rather than one file per document, … Also, for translation purposes, the content exported from one language variant needs to be imported into another language variant after translation, rather than the same language variant.

Therefore, we’ve extended the current import/export tool with the ability to import/export a new format which is better suited. All content of a document is embedded into one XML file (of course, this only includes text-based formats). There’s a lot more to say about this, if interested just go read the documentation.

Other changes

While at it, we’ve introduced some other improvements:

  • the daisy-wiki-add-site tool now has the ability to do a multi-language setup: it will automatically set up multiple sites for each desired language, with the navigation documents and home page documents being variants of the same documents.
  • the version list screen now allows editing the new version properties, as well as allows easy diffing between arbitrary versions.
  • the document-info popup has been redesigned to show more information, and is quicker accessible through an icon in the menu.

We’re thinking of a 2.2 Milestone release in the next weeks, so that those who don’t dare to build Daisy from source will be able to try out this new stuff.

More information on the multi-language features can be found in the documentation.

Column stores

October 7, 2007

In applications which need to manage entities with a flexible set of attributes, like Daisy’s documents, a typical problem is how to store this data in a relational database.

The most straightforward approach is using a big table containing tripples {entity ID, attribute ID, value}. This is the approach currently used in Daisy, although the ‘tripples’ are somewhat more complex: there are multiple ‘value’-columns for the different data types, indexes for multi-value and hierarchical fields, and some more. The problem with the tripple-table-approach is that searching on multiple fields requires lots of self-joins of this table.

Since documents in Daisy follow a document type, and each document type has a certain number of fields, another approach might be to dynamically create a database table corresponding to each document type. There are however a few reasons that would make this approach in Daisy complex: the type of a document in Daisy can be dynamically changed (and history of documents needs to be kept), fields can be multi-valued and hierarchical, the same Daisy field type can be shared between multiple document types, …

So using a traditional RDBMS has always felt a bit wrong. Some time ago I came across MonetDB, which uses “a storage model based on vertical fragmentation” (also called a decomposed storage model). The concept is easy to understand: a traditional database stores data per row, while this database stores data per column. In each column, only non-null values are stored, so it is ideal for sparse data. The cost of adding a new column is unrelated to the current number of rows or columns (5 or 5 million columns, it doesn’t matter). Adding a new record might be a bit more expensive, but query performance is better. MonetDB can also be used as an XML database and shows superior XQuery performance.

Recently I found a number of other papers on these topics:

The second paper shows that using a column-oriented approach in a classical database (using one table per property) already has a good query-advantage over the traditional tripple-approach: [...] average query times go from around 100 seconds to around 40 seconds However: [...] by using a column-oriented DBMS [...] queries now run in an average of 3 seconds.

Up to now, our current field-storage approach is still working fine, but as datasets grow and queries become more complex, it is interesting to know there are others (especially the RDF crowd) thinking about these problems.

Next to all this, an interesting observation is that Alfresco and Jackrabbit are using Lucene for all their searching needs. I have not yet thought much about this approach, so I don’t know how well Daisy’s current query language can be implemented on this basis, but it’s sure an interesting path to investigate too, especially as it would unite full-text and metadata searches on a low level.

I need to concentrate on other things, but before it is old news I should announce Daisy’s new partial read permission feature.

Once upon a time, in the early days of Daisy, you could either grant or deny read permission to a document. Later on, we added a read-live permission next to the read permission. When you grant read-live permission but deny read permission, users will be able to read a document, but only the data from its live version. This is a quite essential feature once you start using Daisy for things like public websites.

Now we took it a step further. The read permission can now be refined using some access details (I’m starting to think that ‘permission details’ might be a better name?). These details can specify:

  • whether you can read non-live versions (this replaces the earlier read-live permission)
  • whether you can read all fields, if not, you can list a few exceptions
  • same for parts
  • whether the fulltext index for this document can be read. If denied, the document will be removed from results sets of queries containing a fulltext condition.
  • whether the document summary (= first 300 characters of the document) can be read

The below screenshot shows the new administration dialog. The interface for specifying accessible fields and parts is still a bit rough.

accessdetails.png

The immediate use-case we had for this is to allow users to see a document exists, but without being able to read its content, except maybe for some fields and a summary part.

More technical details on the partial read permission can be found in this mail and a small update.

Daisy 2.1 released

September 4, 2007

I’ll keep this short, since nothing much changed since the 2.1-RC release, except for diff-related improvements.

Quick links:

Release announcement

Downloads

Changes since 2.0

Demo site

GSoC wrap-up

August 27, 2007

Our second participation in the Google Summer of Code (GSoC) has been small (one student) but very productive. In case you’re new around here, our student this year is Guy, who has been working on a comparison engine for HTML documents. His code is integrated into Daisy trunk and will be part of the 2.1 release.

While it is still fresh on my mind, I’ll write down what I should not forget for next year’s GSoC:

  • we need to improve on the project ideas:
    • put focus on ideas which we actually want to see accomplished (and want to integrate into Daisy). For example, we got many proposals for the “integrate Daisy in non-Java environment” idea, but we had those last year already, so I wasn’t interested in doing that again (with the exception maybe for a browser-hosted Javascript API to access the repository). We also had some nice ideas for which we got no proposals at all…
    • state clearly at the start of the ideas page that students can do interesting projects even if they never heard of Daisy before (and of course, live up to that promise).
  • integrating Guy’s work early through the project has made sure that now, at the end, his work is actually put to use, and has had enough testing and polishing.
  • have chat or a daisy-gsoc-irc channel available for small talk. Usually it is beneficial to have all communication and decision making in the mailing list archives, but irc might help for small quick questions the students might have.
  • Daisy itself needs to be accessible enough to newcomers and developers. The source build improvements in Daisy 2.1 are a first step, but we need to continue working on this.

I hope we’ll be able to participate again next year, I’m looking forward to it already. Lots of thanks to Google for the initiative, to Steven for the getting us in, to Marc since his message brought Guy here, and last but not least to Guy for being a fabulous student.

Daisy 2.1-RC is now available (download here). This release follows 2.0 a bit sooner than usual, the idea is to provide users quickly with a large amount of smaller improvements. Some of the more notable items:

  • Various enhancements to the query language, the navigation manager, the publisher and the faceted browser which will be noticed by people customizing and building on Daisy.
  • New functionality: variables in documents and the ability to shift headings when including documents.
  • A new Spring-based Daisy Runtime platform for the repository server. Creating and deploying repository plug-ins has become much easier and is now documented.
  • A first version of the HTML diff library of our GSOC student, Guy, has been integrated (example, example).
  • Doing a first-time build in pre-2.1 Daisy was a rather labor-intensive process. This made building Daisy from source unattractive. Daisy 2.1 corrects this situation, the build is now a lot simpler.

The final Daisy 2.1 release should arrive in about 3 weeks. Everyone’s welcome to provide feedback on the release candidate on the Daisy mailing list.

Variables in documents

July 27, 2007

Announcing a new Daisy feature…

Background (skip this if you know Daisy)

Various tools are available for documentation authoring and publishing. One set of them are tools oriented towards technical writers such as Docbook, DITA, and lots of others of which I’m mostly ignorant.

Daisy offers an interesting alternative by providing an online, easy-as-a-wiki tool, yet with a focus on structured markup. It’s backed by a repository allowing for searching, browsing, access control, etc. On the publishing side, features like recursive document includes, embedded queries, and a ‘book’ publication engine are available. Some of the features of the book publication engine are TOC generation, lists of figures and tables, (print) indexes, footnotes, shifting of headers with respect to their position in the book, various header numbering possibilities, complete control over styling through XSLT, and chunked HTML output where the chunks are unrelated to the orginal document boundaries.

Variables

Daisy has now given birth to a new feature especially useful for documentation applications: variables in documents. A common use-case for variables is to avoid hardcoding product names in documentation, for example for when the product is marketed under different names.

The actual variable resolution is configured at the site level (a “site” in Daisy is a website providing a particular view on the repository, there can be many of them). So depending on the site through which you view the document, you will see the same text with the variables resolved to different values. As you can expect from XML fairies like us, variable values can contain mixed content (mixed text & tags). And of course, the variables work in the book publishing too.

insert_variable.png


Variable-challenges

Besides the fluffy feature announcement, it is interesting to look at a particular difficulty of document variables.

Putting variables in documents basically turns the document into simple templates. The documents are not displayed as-is, but they can generate different content depending on the context. To some extent this difference between “stored document” and “displayed webpage” already exists, due to features such as includes, information aggregation, etc. The variables are a different class though since are really part of the document text.

When a reader sees a particular word on a page, and enters it in the search box, then the expected behaviour is to see that page turn up in the search results.

(sidenote: authors on the other hand might want to search where a particular variable is used. This special use-case could be served by maintaining a dedicated variable-usage index, much like the document-links index.)

As a solution, one approach might be to recognize that there’s just too much difference between documents stored in the repository and pages published on a website, and that therefore we need a separate index of the published pages. While this simplifies things conceptually, it will make the system setup more involved, and still has the disadvantage that the index will need rebuilding as soon as e.g. a variable changes its value (one could track the dependencies of each page to know when it needs to be reindexed).

Another thought is to first perform the search on the variable values to find which variables match, and then add the corresponding variable names to the search condition. A nice idea on first sight, but when you think about it, the actual implementation of such thing, working in a completely transparent way, is not simple either.

For now I have not treated this problem at all, since [other work is waiting, and] the current use-case for variables is mainly in the documentation area, where things like the book publication (or the maven plugin) will be used for generating publications.

The Packt Open Source Content Management System Award is collecting nominations.

pakt_nominate_2007.jpg

If you like Daisy, please take a moment to submit your nomination.

For the Daisy website URL you can enter http://www.daisycms.org/

Thanks!