Re: Arabic letters separated by markup

Andreas Prilop wrote on the Unicode mailing list[1]:
> Does the Unicode standard only deal with plain text or
> does it also deal with text in markup languages like SGML/HTML?
> 
> I wonder whether Arabic letters should join when they are
> separated by markup. Here's an example:
> 
>  http://www.unics.uni-hannover.de/nhtcapri/temp/nastaliq.html
> 
> Current programs display the letters separated by markup
> differently: Internet Explorer 6 and StarOffice 7 join the
> letters, but Mozilla 1.7 does not.
> 
> Is it left to the rules of SGML/HTML to decide or
> has the Unicode standard any opinion about this?

In semantic markup languages like HTML, it's really the domain of the
formatting system used to process the markup, not the markup system
itself. [1] So, for web pages, this behavior would be governed by the
Unicode and CSS specs. I haven't read the Unicode book cover to cover,
but since there's an argument here, I'm guessing it's not covered by
Unicode quite yet. :)

Like many other people here, I think that the goal should be to make
the text as readable as possible, even if it means ignoring some of
the styling.

Therefore, these are the rules I suggest:

  For characters within the same inline sequence.

   1. Shaping and joining behavior MUST NOT be affected by element
      boundaries.
   2. Ligatures, including obligatory ligatures, MUST be broken if
      the formatting rules introduce extra space between the affected
      characters (e.g. by putting a border and margin around one of
      the characters).
   3. Optional ligatures SHOULD be broken if the formatting rules
      cannot otherwise be accomodated.
   4. Obligatory ligatures MUST NOT be broken if the formatting rules
      introduce no extra space between the affected characters, even
      if this means some of the characters are rendered in the wrong
      font or as part of the wrong visual element.
   5. Combining characters MUST be rendered as the combined grapheme
      cluster if the system is capable of rendering the combination,
      even if this means some of the characters are rendered in the
      wrong font or as part of the wrong visual element. The combined
      grapheme cluster SHOULD be rendered as part of the base
      character's element, or, in the case of combining jamos, the
      initial character's element.

I'm quite certain of #1, but as I don't have extensive background
in this stuff, I am not so certain of the others. Comments are
appreciated. I can ask the CSS Working Group to consider adding a
recommendation to the next revision of CSS2.1 if there seems to
be a consensus around a particular set of rules, and/or to refer
to relevant parts of the Unicode standard.

~fantasai

[1] http://www.unicode.org/mail-arch/unicode-ml/y2005-m06/0110.html
     username: unicode-ml ; pass: unicode

[2] CSS determines whether an element visually behaves as a
     block or an inline or a table cell. Given the CSS rule
       * { display: inline; }
     both
       <div>ARA</div><div>BIC</div>
     and
       <span>ARA</span><span>BIC</span>
     would result in the exact same rendering.

Received on Saturday, 11 June 2005 20:29:51 UTC