[css-text 3 and 4] hyphenation & interop

It's a mistake to define hyphenation as in css text level 3 without
saying how hyphenated words behave. Luckily some of the necessary text
has already been written for css text level 4.

Some things that need to be clarified:

. hyphenation is a property of rendering, not of the DOM (disregarding
  shadow DOMs for a moment) - a search for "barefoot" must work even
  if the word has been hyphenated as bare-
  foot, and if the text is reflowed, e.g. because of a change in
  viewport size, the word may be hyphenated differently or not at
  all in subsequent renderings.

. soft hyphen characters must not affect search: they are to be
  ignored in both search strings and document text.

. ASCII hyphen ("-") can be used as a break character, as can
  the soft hyphen. Breaking at ­ must insert "-" from the
  current font.
    Note: level 4 proposes a custom hyphenation character.
    I think a selector approach might be better, as then colour
    and/or an image could be used, e.g. a picture of a curved
    arrow in code listings, offset from the text.

. A renderer is never required to hyphenate, even if a single
  "word" is longer than the available space. Existing overflow
  strategies can be used.

. A user agent or renderer MAY add a preference to allow users to
  enable hyphenation by default for any text in their language, or
  any text not specifically marked for language; there
  should also be an option to disable hyphenation altogether.

The next step (level 4) should include a hyphenation exception
mechanism.

A way to use TeX pattern files may also be useful, but today hyphen.js,
hyphenate.js etc. can add soft hyphens at every break point for many
languages (not German!), and can work around incompatible browser
behaviour with respect to soft hyphens, searching, reflowing text etc.

What I'm trying to do here is (1) push for higher quality, and (2) push
for higher interoperability. Right now hyphenation tends to break stuff
or to behave too differently across browsers for even the JavaScript
shims to be acceptable. Accessibility can also suffer, e.g. soft hyphens
are said to be (incorrectly) rendered as spaces in some browsers. So we
need to give more guidance (unless my experiments and the research I did
are out of date, which is always possible since I blinked in the
meantime!)

Liam

PS:  TeX's hyphenation algorithm is not the best (as even the TeXBok
acknowledges). TeX is not considered a "high end" formatter by people
who do large amounts of batch/unattended high-quality formatting, and
its poor hyphenation algorithm and its unacceptable treatment of corner
cases are a large part of the reason; TeX is fine, excels, in
semi-automated formatting, e.g. for research papers, where the author
will correct problems.  The advantage of the TeX pattern algorithm and
interchange format format (which seems to be closely modelled on the
older troff algorithm) is that it's widely described and is much more
compact than the dictionary-based systems.  The best results for most
Western languages are a mix of an algorithm and a dictionary; some
languages, such as Thai and German, are much harder than others.

PPS: I wrote a lot more detailed comments on hyphenation and line
breaking but am guessing that I need to save them until the WG has
cycles to process them for level 4.

But the comments here apply to level 3.

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

Received on Saturday, 23 March 2013 04:22:20 UTC