This document is also available in these nonnormative formats: XHTML+MathML version.
Copyright ©2006 W3C^{®} (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This Note is a selfcontained discussion of Arabic mathematical notation in MathML. It provides guidelines for the handling of Arabic mathematical presentation using MathML 2 Recommendation (2nd Edition) [MathML22e] and suggests extensions for a future revision.
This Note has been written by participants in the Math Interest Group (W3C members only) which is part of the W3C Math activity. Please direct comments and report errors in this document to wwwmath@w3.org, a mailing list with a public archive.
Publication as a Interest Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
1 Introduction
2 Some Features of Arabic Script
2.1 Text Direction
2.2 Glyph Shaping
2.3 Mirroring
2.4 Number Systems
3 Comparison of Mathematical Notations
3.1 Arabic Notation; Moroccan Style
3.2 Arabic Notation; Maghreb Style
3.3 Arabic Notation; Machrek Style
3.4 Additional Arabic Notations
3.5 Persian
4 Proposals and Clarifications
4.1 Clarification of bidirectional Algorithm for MathML
4.2 Glyph Shaping
4.3 Additional Mathvariants
4.4 Mirroring
4.5 Horizontal Stretchiness
4.6 Additional Constructs
5 Conclusions and Future Work
6 Acknowledgments
7 Production Notes
A Localization Issues
A.1 Number Systems
A.2 Symbols Choice
B Implementation Issues
B.1 Character Encoding
B.2 Mathematical Fonts
B.3 Symbol Stretching
B.4 Software Tools
C Bibliography
As the World Wide Web becomes more world wide, inclusion of the world's many languages, scripts and cultures becomes critical. Although the development of the Mathematical Markup Language (MathML) [MathML22e], was neither intentionally nor explicitly exclusive of nonEuropean languages and scripts, the focus was on the notational schema used with European languages. Indeed, most of these notations are used unchanged in many other contexts. However, there are variations introduced in some languages, either for historical reasons, or to fit within various writing systems, which MathML should accommodate for improved international support (in particular educational material requiring these variations, or historical documents).
While European languages are written left to right (LTR), Arabic, among others, is written right to left (RTL). We will see that in Arabic mathematical texts many of the same notational constructs are used, but may be reversed or mirrored, depending on the cultural context; what we will call a mathematical directionality. The mathematical directionality is not necessarily the same as the text directionality. Moreover, since the mathematical material may commonly contain text and symbols coming from both Arabic and European languages, the question of how the Unicode bidirectional algorithm [UnicodeBiDi] should be applied arises. Finally, several additional symbols and writing styles may be used in special ways.
Arabic Calligraphy is enriched by a variety of writing styles, as European writing benefits from a variety of fonts. The graphic above illustrates a variety of Arabic calligraphic styles; each word is the name of the corresponding style. In the same way that European mathematics broadens the set of distinct symbols available by using bold face, Fraktur or other styles, so does Arabic mathematics but typically by varying strokes, adding tails or other extensions.
A given piece of mathematics marked up in Content MathML ([MathML22e], chapter 4), is generally languageneutral — although the choices for variable names may imply a cultural context — it intends to represent the universal meaning of the mathematics. A given piece of mathematics marked up in Presentation MathML ([MathML22e], chapter 3), on the other hand, conveys the visual appearance of the expression. That appearance necessarily targets a specific language and notational conventions, indeed even of the scientific discipline involved. In this Note, we amplify and formalize this segregation of concerns: Presentation MathML should be a fairly literal representation of the visual notation to be used.
We relegate all localization issues — which symbol to use for summation, which name to use for tangent, what format to use for numbers — to the generator of the Presentation MathML, rather than the renderer. This avoids guessing, perhaps wrongly, what number is intended while deciding whether to replace periods by commas, for example. Thus, localization entails the choice of what text content to place within MathML's token elements, but that choice is already fixed within a given piece of Presentation MathML.
In this Note, we have attempted to examine all notational conventions in current use with Arabic and languages written using Arabic script, without giving preference to one form over another. We aim to clarify the specification of MathML, proposing extensions where needed, so that MathML has the broadest coverage possible. Nevertheless, an indepth analysis of issues affecting other languages, particularly those written top to bottom is a topic for future study. The emphasis on Arabic languages is partly a reflection of an increased interest in, and usage of, MathML in Arabic language contexts that have highlighted the issues described here. Another topic for future study is how Content MathML might best support the transformation to appropriately localized Presentation MathML.
When a mixture of LTR and RTL characters appear in text (ie. bidirectional or BiDi text, such as an English text that includes Arabic words), Unicode's bidirectional algorithm [UnicodeBiDi] describes the order in which the characters will be displayed. All adjacent stronglytyped RTL characters (such as a in a single Arabic word) will be presented in righttoleft order, and vice versa for stronglytyped LTR characters. A cluster of characters with the same directionality is called a directional run.
Within any given "paragraph", directional runs are then ordered according to the
overall directional context. The bidirectional algorithm allows for higherlevel
protocols to determine which segments of a structured text constitute "paragraphs"
in this sense. For example, in HTML blocklevel elements are taken as the
paragraph segments. The toplevel html
tag determines the directional context
which can be changed on lowerlevel elements using the dir
attribute.
For a gentle introduction to bidirectional text, see [UnicodeBiDiIntro].
Note that mirrored symbols are not required by Unicode (See Mirroring in [UnicodeBiDi], section 6) to be literally the exact mirror image. Indeed, it is considered an important point of Arabic calligraphy that they are not: the feather's head (kalam) is a flat rectangle. The writer holds the pen so that the largest side makes an angle of approximately 70° with the baseline. This orientation is kept throughout the process of drawing the character. Furthermore, as Arabic writing goes from right to left, some boldness is produced around segments running from top left toward the bottom right and conversely, segments from top right to the bottom left will rather be slim. Thus, the Arabic sum symbol , for example, is not simply the mirror image of sigma .
There are several decimal numeral systems in use in Arabic:
System  Unicode  Digits  Image  Regions  

European  U0030U0039  0  1  2  3  4  5  6  7  8  9  Maghreb Arab (eg. Morocco), as well as European  
ArabicIndic  U0660U0669  ٠  ١  ٢  ٣  ٤  ٥  ٦  ٧  ٨  ٩  Machrek Arab (eg. Egypt)  
Eastern ArabicIndic  U06F0U06F9  ۰  ۱  ۲  ۳  ۴  ۵  ۶  ۷  ۸  ۹  Iran 
Style  Image  MathML 

English 
 
French 

Style  Image  MathML 

Moroccan 

Although the mathematics would be embedded within a RTL language (Arabic), its directionality is still LTR. The connecting words and phrases within the math, however, are RTL Arabic, and should be subject to glyph shaping (although some current MathML renderers are not doing this). Thus these phrases should appear as "إذاكان" (for "if"), "غيرذلك" (for "otherwise") and "مع" (for "with").
Also, the indication is that the bidirectional algorithm [UnicodeBiDi] should be
applied to individual text and token elements, rather than at a higher level as in HTML;
that is, the token elements act as paragraph segments.
Even with these considerations, the ordering of phrases within the last clause
(for "otherwise (with pi=3.141)") is problematic. The obvious markup sandwiching
an mrow
for "pi=3.141" between two mtext
's for "otherwise (with" and ")", respectively,
would yield an incorrect ordering. A correct rendering seems to require the possibility
of embedding math
within mtext
, which is not possible in MathML 2.0.
But even then, the desired ordering would need to be marked up as two separate mtext
elements:
one for "otherwise", and one for "(with pi=3.141)". The Math Interest Group is currently
considering the possibilities of such embedding. The example above was marked up by
artificially placing the Arabic word for "with" after the "pi=3.141".
Given such issues, it is sometimes advantageous to minimize the use of connecting phrases, with preference to simple punctuation, such as:
Style  Image  MathML 

Moroccan 

The Maghreb style of notation is widely used in North Africa:
Style  Image  MathML 

Maghreb  Not yet attempted 
As the final Arabic example, we consider the Machrek style generally used in the Middle East.
Style  Image  MathML 

Machrek  Not yet attempted 
Style  Image  MathML 

English 
 
Arabic 

Style  Image  MathML 

English 
 
Arabic 

Style  Image  MathML 

English 
 
Persian 

While the overall notation is similar to the Moroccan model (LTR), it uses the Eastern ArabicIndic digits. The word "حد" (for "limit"), is used; this word should not only be affected by glyph shaping, but should be stretched horizontally to match the length of the underscript.
The following summarizes how directionality should be applied to MathML and, in particular, describes how the bidirectional algorithm should be applied (it falls into class HL4; See Higher Level Protocols: HL4 in [UnicodeBiDi], section 4.3).
The overall mathematical directionality should be determined by
a (new) dir
attribute on the outermost math
element
which takes one of the values ltr
or rtl
;
the default is ltr
.
If this attribute is rtl
the layout of all Layout, Script, Limit,
Table and Matrix schemata should proceed from right to left. This includes
such effects as the surd of an mroot
starting from the right.
When the mathematical directionality is ltr
, the layout should conform
to the current MathML specification.
The text content of each Token element should be treated as a separate directional segment and the bidirectional algorithm should be applied to each independently. The initial directional context for each Token element is determined by the mathematical directionality. This latter property should assure that individual mirrored symbols are treated correctly.
As an example, consider the MathML fragment:
<mn>1</mn> <mo>+</mo> <mi> </mi> <mo></mo> <mn>2</mn>
Some browsers misapply the bidirectional algorithm to the expression as a whole, as in HTML. Applying the HTML algorithm would set the first two items LTR, but then switch directions upon encountering the letter ; thus the last three items are reversed.
Style  Image  MathML 

Right 
 
Wrong 
isolated  initial  tailed  looped  stretched  doublestruck  

dotted  
undotted 
It is not expected to be meaningful to apply the "bold", "italic", "fraktur", "script", "sansserif" or "monospace" mathvariants (or combinations) to Arabic (although there is some sentiment for allowing "bold" and "italic"). Nor is it meaningful to apply any mathvariant other than "normal" to multicharacter tokens, which should have glyph shaping applied. The current MathML specification points out that the only combinations of characters and mathvariant that have an unambiguous interpretation are those that correspond to the SMP Math Alphanumeric Symbols. An analogous argument is to be made for Arabic and the proposed Arabic Math Alphabetic Symbols [UnicodeProposition] (not yet part of Unicode).
Both dotted and undotted alphabetic symbols are encountered in this Note. The choice of which type to use is up to local preferences, however; documents use either dotted or undotted symbols, but not a mixture, and in particular, the dots are not used to indicate semantic distinctions. Thus, it is not felt that dotting is a good candidate for a mathvariant value, but rather should be accommodated by the choice of symbol fonts available to user's browser, or possibly through CSS.
In Arabic mathematics, the sum, product and limit are commonly stretched horizontally to the same width as the limits (over or under) that apply to them. Such stretching does occasionally appear, but is rare, in European mathematics. In Horizontal Stretching Rules of MathML ([MathML22e] section 3.2.5.8.3), standard allows for such horizontal stretching of some symbols at the discretion of the rendering agent. In this Note, we simply encourage developers to implement this feature for the appropriate Arabic symbols.
The successful use of mathematics in Arabic texts will also require, in addition to the extensions proposed here, that the appropriate codepoints are included in Unicode, and that those codepoints are correctly marked as mirrored. Some proposals ([UnicodeProposition],[ArabicMathUnicode]) have already been made.
The images of Arabic and Persian expressions were composed using the RyDArab system [RyDArab], and the FarsiTeX system [FarsiTeX], respectively.
There are two kinds of symbols: literal and mirrored symbols used according to the local area:
the limit operator is presented in the two ways: and . This last notation is used in Persian.
the factorial operator is presented in the two ways: and !12.
These stretched operators can be compared to the mathematical stretchy accents, only the roles are reversed. We can also think of something similar to the square root construction.
Even though some local symbols, used in mathematics written in an Arabic notation, can be obtained via mirroring of already existing symbols, there are many symbols found in Arabic mathematical handbooks that are not yet part of the Unicode Standard and cannot be obtained through a simple mirroring [ArabicMathUnicode]. Some of such special characters are submitted for inclusion into the Unicode Standard [UnicodeProposition].
Some font families are designed to meet with the requirements of typesetting mathematical documents in an Arabic notation. The RamzArab Arabic mathematical font [RamzArab] aims to provide a complete and homogeneous Arabic font family, in the OpenType format, respecting Arabic calligraphy rules.
Although letters in "tailed" and "stretched" forms are semantically distinct from the "initial" forms, they can be simulated by connecting with a particular final form of HEH and the final form of ALEF, respectively, and applying glyph shaping. This technique may be useful when an insufficient variety of fonts is available.
Implementors are encouraged to make it feasible for users to choose dotted or undotted mathematical symbol fonts easily in accord with local tastes.
Good  Bad 

These curvilinear extensible symbols were generated by the CurExt application for the system T_{E}X with a PostScript font generator [RamzArab].
Although horizontal stretching of sum and product operators is rare in European mathematics: and , this stretching is more common, and more desired, in Arabic mathematics: and .
[Note: the broken corner in these symbols is a known flaw to be repaired in a future version of RyDArab [RyDArab]].