Enhancement of the Amaya Web browser/editor for Internationalization and XML

Project acronym: QUESTION-HOW
Project Full Title:Quality Engineering Solutions via Tools, Information and Outreach for the New Highly-enriched Offerings from W3C: Evolving the Web in Europe
Project/Contract No. IST-2000-28767
Workpackage 1, Deliverable D1.5

Project Manager: Daniel Dardailler <danield@w3.org>
Author of this document: Vincent Quint<quint@w3.org>

Created: 28 August 2002. Last updated: 28 August 2002.


Table of Content:

Introduction

This document is a summary of the achievements of project "Enhancement of the Amaya Web browser/editor for Internationalization and XML".

The goal of the project was to develop Amaya, W3C's Web browser and editor, in two directions simultaneously. One development track is Internationalization, the other is XMLization. The motivation is to allow Web users to create and display pages in many different languages, including those using non-latin scripts. This will broaden the use of Amaya to new geographic areas. While the coverage increases, the technologies implemented in Amaya are also extended to support and promote the latest developments of XML technologies.


Background on Amaya

Amaya is a Web client that acts both as a browser and as an authoring tool, the two features being seamlessly integrated. Web pages are edited in WYSIWYG mode, i.e. the user interacts on a formatted document, like in most word processors. The editor maintains a structured representation of the document (a DOM tree). Every user command is first performed on this tree, and the part of the tree that has been modified is reformatted and redisplayed. Several views of each document may be open simultaneously, to provide a more complete representation of the document being edited. In addition to the formatted view, Amaya can display the source code and/or the DOM tree. All views can be edited. The formatted view and the tree view always reflect the current status of the document while the source view is refreshed on request.

To allow users to really edit the Web, Amaya provides direct access to remote Web sites through the HTTP 1.1 protocol, both for reading and writing Web pages (the HTTP Get and Put methods are used). Thus users can edit pages that are stored on remote servers exactly in the same way they work on local files.

Several document formats are supported natively by Amaya: HTML, XHTML, MathML, and SVG. This allows author to edit various types of Web pages, including scientific documents with lots of mathematical expressions, and structured graphics.

CSS style sheets are also supported. Not only documents are formatted according to their style sheets, but users can create and edit style sheets in a WYSIWYG mode.

Amaya is available on several platforms: Windows (95, 98, NT, 2000, XP), Unix (Linux, Solaris, AIX, etc.) and Mac-OS X. It is an Open Source endeavour.


Internationalization

Developments in the Internationalization area cover several aspects:

User interface
Users may freely change the dialogue language, i.e. the language used in menus, dialogue boxes, messages, etc. In the current version (Amaya 6.2) several European languages are supported: English, French, German, Spanish, Italian, and Portuguese.

On-line help is provided to assist users while they are working. Help pages are available in French, English, German, and Spanish (not complete yet).

Document display
A wide range of scripts are available for displaying various languages in documents. Version 6.2 can display Latin (including several variants), Cyrillic, Greek, Arabic, Hebrew and Japanese.

In addition to scripts, others aspects of languages are handled, such as hyphenation: words are hyphenated according to the language. Version 6.2 can hyphenate Dutch, English, Finish, French, German, Italian, Portuguese, Spanish, and Swedish.

The editor is able to spell check document contents in English and French.

Multilingualism
Amaya can handle simultaneously several documents in different languages, including multi-lingual documents, with multiple scripts. Languages and scripts may be mixed freely, even in the same paragraph, with different writing directions. However, in version 6.2, only horizontal text is supported (left-to-right and right-to-left).
Language negotiation
To take advantage of servers that offer several versions of a document in different languages, users may configure Amaya with a list of preferred languages. Amaya will then download these pages in the chosen languages.
Addressing
Amaya can also handle internationalized URI, to access Web sites with addresses expressed in any language.

All these features are available on all supported platforms.

The international structure of W3C was a great help in these developments. The Team from Keio University has greatly contributed to the support of Japanese. This joint effort is now continued to introduce other Asian languages in Amaya.

Unicode support

At the specification level, Unicode is used by XML to allow XML files to contain text written in any script. This feature applies to all XML languages, including XHTML, MathML and SVG.

In Amaya, internationalization has been implemented in the spirit of XML. The common ground provided by XML to these document formats has been adapted to Unicode and, as a consequence, all formats (XHTML, MathML, SVG) have immediately taken advantage from Unicode.

Encoding

An important part of the internationalization effort was dedicated to the support of Unicode. All documents handled by Amaya are now represented internally in UTF-16. This allows to represent any document, even the most complex multilingual documents, in an uniform way. This also makes processing these documents simpler.

The initial version of Amaya supported only 8-bits characters. All internal structures functions have been changed to handle multi-bytes characters.

Although it uses UTF-16 internally, Amaya does not impose documents to be coded in UTF-16. It provides conversion from and to several other encodings:

Fonts

To display all these characters, the support of fonts has been considerably extended. In particular, to get access to a large variety of characters in many different styles and sizes, TrueType fonts have been introduced. This also allows to use the same fonts on different platforms and makes font management easier.

Layout

Handling several writing directions in the same block of text has required the layout process to be redesigned. The main change was to implement the Unicode bidi algorithm. This allows to set characters properly within a block of text such as a paragraph with several chunks of text in different writing directions (for instance English, Arabic and Hebrew).

More work was required to align blocks correctly in a page according to the writing direction. With this new approach to page formatting, complex structures such as tables or nested lists can now be displayed equally well in Arabic, Hebrew or Latin documents.

Editing

The above mentioned extensions were required to display multilingual document, but to also allow editing, a some more changes have been made. Cursor movements now follow the logical order of characters. For instance, when the cursor is at the end of a chunk of text in English followed by some text in Arabic, moving to the next character actually put the cursor at the other end of the Arabic text.

Various input methods have to be supported to allow users to enter characters from different scripts. These methods are usually provided by the operating system, but Amaya was extended to allow the user to select among the methods available. This feature has been used to enter Japanese characters.

HTTP 1.1

The internationalization effort concerns two aspects of HTTP 1.1:

XHTML

Multi-lingual XHTML pages can be displayed and edited freely. In fact, with the extensions mentioned previously for the support of Unicode, this was quite easy to achieve. But XHTML has a few internationalization features that have been implemented:

CSS

The CSS style sheet language can be used to specify the style of any document, whatever the format used ((X)HTML, XML, SVG, MathML). Like XHTML, CSS has a few internationalization features. In particular, properties direction and unicode-bidi have been implemented to control the bidi algorithm.

SVG

The SVG graphics format provide full support for internationalized text. A large part of the implementation was realized by the general support of Unicode, but there are also some SVG specific features that had to be implemented in addition:

MathML

Unicode provides MathML with a very wide variety of characters and symbols. Again, the basic Unicode support in Amaya makes it possible to handle all these characters, but more efforts were required to cope with the special needs of mathematical expressions. While most platforms propose a number of fonts for displaying usual scripts (Latin, Greek, Arabic, etc.), they are very limited regarding mathematical symbols. To address this issue, the support of the ESSTIX fonts was added to Amaya.

ESSTIX offers a consistent set of 17 fonts that are freely available in several formats, including TrueType. With these fonts, Amaya can display almost every mathematical expression represented in MathML. The support of these fonts required mapping tables to be created, as the original encoding was not Unicode.

Annotea

Annotea is an application built on top of Amaya to allow users to create and share annotations attached to any part of any Web page.

Annotea was extended to fully support UTF-8 in all the RDF data that describes an annotation. Users can now annotate any kind of document, regardless of its encoding. The name of the annotation author is also stored in UTF-8.


XML

Amaya natively supported some XML languages (XHTML, MathML, SVG), but it did not know what to do with documents using other XML languages: it just displayed their source code and allowed the user to edit the code as plain text. The goal of the XML activity in Amaya was to add support for generic XML, i.e. to allow users to see any XML document well formatted and to edit them in the same way they edit XHTML pages.

Generic XML

The first step in supporting generic XML was to revisit the whole sequence of document processing from downloading to publishing, and to adapt each step to generic XML documents.

Generic XML documents can be downloaded either locally or remotely (through HTTP), like any other document. They are then parsed by the same parser (Expat), but parsing does not check elements and attributes, it only makes sure that the document is well-formed (in the XML sense). It simply builds a DOM tree that matches the source file. A formatted representation of the document is then created, relying only on the DOM tree and the CSS style sheets attached to the document. This is different from documents in the natively supported formats: Amaya knows how to format them and style sheets are used only to make changes to the default layout and style. For generic XML documents, if there is no style sheet, Amaya uses its own heuristics to format the document as a sequence of blocks. This provides a very simple layout, but it is enough to read and edit the document. Two other views can be displayed: one showing the DOM tree, the other showing the source code. The three views are handled in the same way as for XHTML, MathML or SVG documents. They can be used to edit the document. Finally, generic XML documents can be saved locally or remotely, like any other document.

Namespaces

XML namespaces allow several XML languages to be used simultaneously in a single document. This feature was already available in the initial version of Amaya, which allowed MathML expressions or SVG graphics to be included within XHTML documents. With the support of generic XML, this was extended to any XML language. Also, the natively supported languages were offered more possibilities to mix together. It is now possible to include MathML expressions or fragments of XHTML text within SVG graphics, even when graphics are themselves included within XHTML pages.

Another possibility is to have islands of MathML, SVG, or XHTML in generic XML documents. Those islands are then processed with the full semantics (formatting, editing) of their language.

Finally, a generic XML document can use several namespaces, some of them being known (XHTML, SVG, MathML), some other being unknown (generic XML). The tree view shows clearly where namespaces change in the DOM tree.

Editing generic XML

Most editors for XML documents handle only source code and use the document DTD or schema to constrain the user. In Amaya a different approach is taken. Although the source code is available and can be edited, the emphasis is put on WYSIWYG editing. The user interacts mainly on the formatted representation of the document. Also, the notion of well-formedness introduced by XML is used to allow the user to manipulate the document freely, without any DTD or schema.

Editing several documents simultaneously is a basic feature of Amaya. This is specially useful for copying some parts of a local or remote document into another document. This also works well for generic XML. Given the multi-namespace feature, it is easy to copy or move pieces of a document into another while preserving their structure.

The current version of Amaya allows the user to perform a number of operations on the formatted view. He/she can edit the content of any generic XML document, either directly or through a search/replace command. He/she can edit attributes (create or remove attributes, change their value), but the creation of new elements is currently limited and need further work. Obviously, any modification can be done using the source view and such changes can be reflected in the other views on user's request, but that was not the main goal of these developments.

All editing commands for generic XML documents are built on top of the same basic commands used for other types of documents. Therefore they benefit from the same advantages, which include an unlimited undo/redo mechanism.

Other technologies

When implementing XML, it makes sense to also implement some closely related technologies. For instance, XML relies on XLink to add hyperlinking semantics to structured documents. XLink was already implemented in Amaya for handling links in SVG and MathML, and to relate documents and annotations in Annotea. This implementation was extended to allow generic XML documents to use it. It is then possible to create new links in any XML documents and to use them as easily as in XHTML pages.

CSS is another technology used in conjunction with XML. As explained above, it is used to format generic XML documents, but it would be very useful to be able to create or update CSS style sheets while editing an XML document. Work is in progress to allow that, by reusing the CSS editing feature which is already available for XHTML, MathML and SVG. This will be specially useful to improve the layout of documents that come without any CSS style sheet.

Deviations from Plan

None, work is done.