XML Dev Day Tokyo 2007

Today (21 December 2007) I am attending the Tokyo 10th XML Developers day. This is an annual event, held in Japanese, with latest news from the Japanese XML developers community. The event is organized by Murata-san. The presentations are of a great variety, providing both technical and rather “political” aspects of XML trends. Below is a summary of the talks, focusing on the technical aspects.

The World of Genji Monogatari – a new Edition

(slides) Miyawaki-san presents a project with the aim to relate additional information to the main text of Genji Monogatari: notes from himself or others, images, sound of a reading of the novel etc. In a previous version he used an approach of enriching an HTML file (the main text) with that additional information. Later he switched to an XML-based format. That allows a user to easily add new information, and multiple users to add content without conflicts. Miyawaki-san encounters some problems like XSLT-debugging, the need to process JavaScript (necessary for the HTML output) in XSLT, or error handling in XSLT. The latter even motivated him to rewrite some XSLT code in JavaScript …

An interesting aspect of a generated HTML output is that it is presented in Internet Explorer in vertical layout (see this screenshot). For the Japanese audience who is used to read novels vertically, this is an important feature, though the layout still needs some improvement.

Parallel Narratology

Kobayashi-san from Justsystems presents so-called parallel narratology, realized with XML. He mainly uses the famous Japanese murderer story of Rashomon as an example. In that story, the same event is narrated several times from different perspectives. In an XML representation of the text, users can create links between the different narrations of the same scene (e.g. a murderer scene). Yamaguchi-san from Justsystems demonstrates the XFY editing and development tool which has been used for the creation of the links, and the presentation of the parallel texts.

The linkage allows readers to take a new perspective on the relations between narrations. Not only the author, but everybody can add such links: the ordinary “story readers” become “story writers”. A problem mentioned during Q/A is: how to express overlap, i.e. non-hierarchical structures of markup. Such structures may be necessary e.g. to enable different users to mark up overlapping passages of the same text.

Character Encoding (again)

Kobayashi-san presents again, mainly about the “gaiji” problem. There are many ideographic characters which are so-called glyph variants of characters from the Unicode character set. However, these variants are not represented as distinct characters in the character set. Kobayashi-san introduces two solutions to this issue: shortly the Japanese standard JIS X 4166, and mainly Unicode variation selectors. The latter are used for the registration of such variations in the Unicode Ideographic Variation Database.

Yamamoto-san demonstrates how Adobe is using the variation selector approach. In Adobe fonts, there are glyphs available for much more characters than the Unicode character set encompasses. Adobe uses the variation selectors approach to create stable identifiers for these characters. The Unicode Ideographic Variation Database then serves as a registry for these identifiers.

Input Methods in the Age of XML

Masu-san from Justsystems presents on input methods. These are tools for making the input of scripts with a large set of characters (like in Japanese) easier. Masu-san then demonstrates enhanced input methods. They provide functionality which is not only useful for large character set input. For example, while entering a date, the input method recognizes the date and creates appropriate markup, hyperlinks to a calendar etc. Using such enhanced input methods, users not knowledgeable of XML are producing XML-code (e.g. a marked-up date) without noticing.

Kumaya-san from Tokyo University presents on how an enhanced Input method framework can be used. In the presented implementation, currently, schedule data is represented via Microformats, and location information is taken from Google maps. A question is how to describe the meaning of related, similar information, e.g. HCalender versus Google calendar. The answer is: humans have to make the relation explicit, there is no automatic way to achieve this.

A Prediction of the OOXML Vote Result and the Ballot Resolution Meeting

Murata-san provides insights in the rather political aspects of the development of the OOXML and the ODF office formats. The Ecma TC 46 committee has already approved OOXML. On 25 February 2008, there will be a Ballot Resolution Meeting (BRM) by JTC1 about OOXML. Murata-san predicts that at the BRM, there will be no discussion of real criticism on the format, but only discussion of minor technical changes. The reason is that contentious issues are outside the scope of the BRM.

He presents also work on ODF undertaken in ISO SC34 and a proposal for standardization of a round-trip conversion between ODF and OOXML. The discussion about the round trip conversion happens in DIN (Germany,) not in SC34 (yet). The DIN committee might want to submit the result to JTC1 using the fast-track procedure.

Murata-san admits that in the past, he thought the standardization of an office format would be impossible. Nowadays there are two on their way, which seems to be better than zero …

Outline of AtomPub and Interoperability Testing Results

Asakura-san from NTT Communications presents the AtomPub format (RFC 5023). This is a protocol for publishing and editing Web Resources like Atom feeds via HTTP and XML 1.0. AtomPub defines the notions of collections (like Atom “feeds”) and members (like Atom “entries”). Using the HTTP methods GET, POST, PUT and DELETE, collections and members can be retrieved, created, changed or deleted.

Asakura-san also provides insights into the experiences he had during the IETF standardization process, and reports from the the AtomPub interoperability event in Tokyo. Six companies participated successfully in that event.

Schemas for validating Atom and its Extensions

Murata-san again talks about Atom. Atom uses extensions from many namespaces, e.g. Gdata, Google Calendar, Atom Bidi, GeoRSS, Dublin Core etc. Murata-san gives mainly examples from Google calendar, which has a basic set of extensions. Others can be added.

Unfortunately, there are often no schemas for the validation of the extended Atom data available. To solve this issue, first Murata-san created an RELAX NG schema. Unfortunately, with more than 3-4 extension schemas, the task became really hard even for RELAX NG experts. He then switched to a different approach: NVDL, which is used for namespace-based validation of compound documents. This makes it easier to create schemas for Atom+extension validation. A remaining issue is that NVDL does not support validation of ID/IDREF-based types.

Using NVDL for Document Analysis and Synthesis

Miyashita-san from IBM also talks about NVDL. NVDL relies on analyzing, i.e. separating parts of a document for the purpose of validation. Miyashita-san’s aim is to provide a mechanism for applying various other processing (e.g. editing) to the separated document parts, and put them then together again. He gives an example of Atom which contains XHTML. First the input document is separated in the Atom and the XHTML parts. Then format specific processing is applied, and the parts are put together again.

Miyashita-san demonstrates some testing with the Google Calendar data provided by Murata-san. One of his next goals for this framework is to use the XProc XML Pipeline Language as a means to define processing pipelines for the separated document parts.