17967 – Parsing algorithm should not preclude Complex Ruby

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 17967 - Parsing algorithm should not preclude Complex Ruby

Summary: Parsing algorithm should not preclude Complex Ruby

Status:	RESOLVED WONTFIX

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 normal
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:	http://fantasai.inkedblade.net/weblog...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-07-18 07:25 UTC by contributor
Modified:	2012-08-21 19:29 UTC (History)
CC List:	11 users (show)

See Also:

Attachments

Description contributor 2012-07-18 07:25:05 UTC

This was was cloned from bug 13113 as part of operation convergence.
Originally filed: 2011-07-01 13:33:00 +0000
Original reporter: Henri Sivonen <hsivonen@iki.fi>

================================================================================
 #0   Henri Sivonen                                   2011-07-01 13:33:34 +0000 
--------------------------------------------------------------------------------
Continuing from bug 12935.

I have spent some more time implementing variations and experimenting
with them. I'm now ready to request specific edits.

Please make the following spec edits:

 1) Please add rb, rbc and rtc to the list of elements that get closed
by "generate implied end tags" at
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#generate-implied-end-tags

 2) Please replace the "in body" entry for 'A start tag whose tag name
is one of: "rp", "rt"' with these three entries:

A start tag whose tag name is one of: "rbc", "rtc"

    If the stack of open elements has a ruby element in scope, then
generate implied end tags.

    Insert an HTML element for the token.

A start tag whose tag name is one of: "rb"

    If the stack of open elements has a ruby element in scope, then
generate implied end tags, except for elements with the name "rbc".

    Insert an HTML element for the token.

A start tag whose tag name is one of: "rt", "rp"

    If the stack of open elements has a ruby element in scope, then
generate implied end tags, except for elements with the name "rtc".

    Insert an HTML element for the token.


Note that the "If the stack of open elements has a ruby element in
scope, then" parts are just copying the current spec text. I don't see
the value of that bit and would be OK with omitting the scope check.

Rationale:

We shouldn't paint ourselves in the corner with the parsing algorithm so Complex Ruby can't be introduced in the future without causing ungraceful behavior in browsers implementing an earlier snapshot of the parsing spec.

The changes proposed above assume a design where rp goes as a child of rtc (if rp is used at all) in the Complex Ruby case. This allows UAs that implement Simple Ruby be forward-compatible by having
   rp { display: none; }
   rtc > rt { display: inline; }
   rtc > rp { display: inline; }
in the UA style sheet while UAs supporting both Simple and Complex Ruby would have
   rp { display: none; }
in the UA style sheet without the two other rules.
================================================================================
 #1   Ian 'Hixie' Hickson                             2011-07-01 22:20:54 +0000 
--------------------------------------------------------------------------------
This only makes sense if we think complex ruby makes sense. If it does not, then we should design the parser to be the best thing ignoring complex ruby.

I'm not at all convinced that the use cases for complex ruby are compelling. Sure, as with anything, there are use cases that need finer-grained semantics than HTML can provide. But we're not designing DocBook here, the rare use cases are _by design_ not handled. We don't have a way to semantically mark up Scandanavian arroword crosswords (or indeed even simpler "regular" crosswords), and that's ok. We don't have a way to mark up bibliographic entries in a manner sufficiently semantic-rich to work as well as BibTeX, and that's ok.

Note that I'm not arguing here that we shouldn't add this _yet_; that it might make sense one day but not today. I'm arguing that it will never make sense for HTML to support complex ruby, because the use cases of complex ruby are too obscure to deserve being supported as first-class primitives in HTML.

Am I wrong?

If I _am_ wrong, what other features might we one day add that we should support in the parser today? Crosswords in particular might need particularly painful changes to the table parsing model; should we add new elements to table parsing rules to support potential future extensions there?
================================================================================
 #2   fantasai                                        2011-07-05 22:13:53 +0000 
--------------------------------------------------------------------------------
Yes, I think you are wrong.

  * *Most* ruby in Japanese should be marked up with Level 2 markup. This
    isn't a rare use case by any means.

  * Pretty much all of the semantics of BibTeX and DocBook can be captured
    extending HTML elements with a microformat. That's not the case for the
    structures of complex ruby.

  * Crossword puzzles and sudoku are handled better by table markup than most
    other games are handled by HTML markup, and for how common they are compared
    to other use cases for HTML, strike me as adequately supported by HTML
    already.  But if you think it's insufficient, file a separate bug.

  * You don't know how the needs of HTML will evolve over time. All you can
    anticipate is what's appropriate for it to include right now. I'm sure that
    12 years ago, many of the things included in HTML5 would be considered scope
    creep from what was supposed to be a simple document markup language.

The top complaint the CSSWG got from the publishing industry in Japan, btw, was that the way ruby influences the line height is wrong. I think that's a fair indicator that correct support for ruby is important to them as they move more of their content to HTML.
================================================================================
 #3   fantasai                                        2011-07-05 22:21:40 +0000 
--------------------------------------------------------------------------------
[Sorry for the broken wrapping. I didn't realize the text box was bigger than the line limit.]
================================================================================
 #4   Ian 'Hixie' Hickson                             2011-07-08 23:26:15 +0000 
--------------------------------------------------------------------------------
>   * *Most* ruby in Japanese should be marked up with Level 2 markup. This
>     isn't a rare use case by any means.

What data do we have on this? I find this hard to believe. All the examples I've seen of complex ruby have seemed rather contrived.
================================================================================
 #5   fantasai                                        2011-07-18 02:56:53 +0000 
--------------------------------------------------------------------------------
> What data do we have on this?

The fact that most Japanese words are compound words with that structure? Level 1 markup only applies to
  a) single-kanji words--which are reasonably common but not overwhelmingly so
  b) multi-kanji words whose pronunciation cannot be broken down (which are
     very noticeably a small minority of such words)

> All the examples I've seen of complex ruby have seemed rather contrived.

The term "complex ruby" covers a lot of things and is imho a misleading distinction. Let's use instead the levels I outlined in my writeup. To which level(s) do the contrived examples you have seen belong and why do they seem contrived?
================================================================================
 #6   Sam Ruby                                        2011-08-02 22:13:07 +0000 
--------------------------------------------------------------------------------
This bug was marked as P1 over 30 days ago, and still hasn't been RESOLVED.

Editor (and editor assistants): please RESOLVE it ASAP.  NEEDSINFO and WONTFIX are valid resolutions for this part of the process.  We simply want to get this bug to a state where we are prepared to accept change proposals should anybody be inclined to produce such.
================================================================================
 #7   Anne                                            2011-08-03 05:41:51 +0000 
--------------------------------------------------------------------------------
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document: <http://dev.w3.org/html5/decision-policy/decision-policy.html>.

Status: Rejected
Change Description: no spec change
Rationale: Resolving as WONTFIX to address comment 6 and because comment 5 has not really any convincing data.
================================================================================
 #8   Boris Zbarsky                                   2011-08-03 14:57:10 +0000 
--------------------------------------------------------------------------------
Reopening.  What data did you expect exactly?  A list of Japanese words or phrases that can't be usefully marked up without <rb>?
================================================================================
 #9   Anne                                            2011-08-03 15:00:43 +0000 
--------------------------------------------------------------------------------
<rb> is implicit in the current model and has nothing to do with complex ruby.
================================================================================
 #10  Boris Zbarsky                                   2011-08-03 15:50:41 +0000 
--------------------------------------------------------------------------------
Making it implicit makes fallback and inlining not work.  Did you even read the document linked to in the url field?  Did you read the second part of comment 5?
================================================================================
 #11  Anne                                            2011-08-03 16:07:46 +0000 
--------------------------------------------------------------------------------
My bad, last time I spoke to Japanese developers what IE had was sufficient and I assumed nothing much had changed. Having said that, I'm not sure allowing UAs to not support ruby markup makes sense and sort of wonder how often one would use ruby markup to then have it inlined.
================================================================================
 #13  Ian 'Hixie' Hickson                             2011-08-04 06:57:21 +0000 
--------------------------------------------------------------------------------
Status: Did Not Understand Request
Change Description: no spec change
Rationale: I still haven't seen data on this. Make a random selection of books, magazines, web pages, or whatever, and tabulate how many of each kind of ruby these texts have, ideally with examples of each for my own education. It's possible that what's in the spec is insufficient, but I am highly skeptical that the level of complexity being proposed is necessary to solve read-world use cases.
================================================================================
 #14  Ian 'Hixie' Hickson                             2011-08-06 03:33:49 +0000 
--------------------------------------------------------------------------------
*** Bug 10830 has been marked as a duplicate of this bug. ***
================================================================================
 #15  fantasai                                        2011-10-07 23:53:22 +0000 
--------------------------------------------------------------------------------
Since I neither have access to a Japanese library, nor the time and patience necessary to tabulate the kind of data set you're requesting, you're getting the next best thing: scans from a magazine lent me by someone I randomly met on the BART. The magazine is Mangajin issue 53, published March 1995, and the tagline is "Japanese Pop Culture & Language Learning". Here are two representative pages and diagrammed extracts from them.

Several articles furigana over the kanji. Example:
http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-54
They are formatted using jukugo ruby. (Jukugo ruby formats like a word-to-word association, but line-breaks differently: the associated kana must be kept wih their kanji base.) This colorized extract shows the association of kana to kanji:
http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-jukugo-ruby
The ratio of compound words to simple words is 2:1. The rest of the page holds close to this ratio.

Other parts of the magazine use double-annotated ruby. Example:
http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-35
Notice the line-breaking behavior and the word associations.
Here is a diagrammed exerpt. The ruby base is in red. The first annotation (romaji) is in blue. The second annotation (English transliteration) is green:
http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-double-annotation

Here is real-world use of complex ruby. You can of course continue to argue that the use case is unimportant, but it exists.
================================================================================
 #16  Ian 'Hixie' Hickson                             2011-10-10 19:44:22 +0000 
--------------------------------------------------------------------------------
The first one does not seem to require anything the spec doesn't already provide.

The second is consistent with what I wrote in comment 1. I make no argument that there are no use cases. My argument is that the use cases are obscure. The second example here is not common text, it's a very specialised case where the language itself is being taught.

There are lots of examples of how we don't currently support that kind of thing. For example, we don't have markup for grammar annotation (no <verb>, <subject>, <adverbial-clause> elements) which would be very useful for people teaching French of English. We don't have anything for marking up family trees or molecular structures, even though that means HTML is deficient for supporting those use cases (I get at least one person who asks me whether we can add markup for genealogy every few months, because right now they're stuck with using bitmaps or abusing tables to convey their data, and that sucks). I gave other examples in comment 1.
================================================================================
 #17  Ian 'Hixie' Hickson                             2011-11-02 19:30:21 +0000 
--------------------------------------------------------------------------------
I spoke with the i18n group about this yesterday, and it seems that we don't really need to add any elements to handle the important use cases here.

Multiple annotations can be handled pretty easily if we just define that nested ruby is semantically equivalent to two annotations; picking which side the annotations appear on is a stylistic issue for CSS. Monoruby and group ruby are both handled already; the only difference is the how much is put in the ruby base before the annotation. Jukugo is a stylistic variant of group ruby, again to be handled in CSS.

Fallback if we rely on this simple pattern is suboptimal, but that doesn't seem to be a big deal. It's time for implementations to just implement ruby. AT fallback is not impossible in any of these cases and is unaffected by how we mark it up.

The last remaining case is what to do with multiple annotation if there is word-pairing for each component. Not supporting this doesn't seem like a big deal, but if we do want to support it, we could do it with multiple <rt>s for each ruby base.

In conclusion, the spec should be changed to limit ruby nesting to two levels, defining the outer level as a phrase-level annotation; and we should consider supporting multiple <rt>s per base, if there is reason to believe that multiple monoruby annotations at the ends of lines are common.
================================================================================
 #18  Michael[tm] Smith                               2011-11-20 17:26:17 +0000 
--------------------------------------------------------------------------------
Henri, any response to comment #17 from Hixie?
================================================================================
 #19  Henri Sivonen                                   2011-11-21 07:36:25 +0000 
--------------------------------------------------------------------------------
I'm not competent to disagree with the i18n group on this topic. fantasai, bz?
================================================================================
 #20  fantasai                                        2011-11-29 00:49:46 +0000 
--------------------------------------------------------------------------------
> Jukugo is a stylistic variant of group ruby,

This is not true.

> again to be handled in CSS.

CSS can handle jukugo vs. mono rendering at the stylistic level *iff* both the pairing and the word-boundary information is recorded in the HTML. Group ruby doesn't record any sub-word pairing information, because there isn't any, so you'll have to explain better what you mean by this sentence.

> defining the outer level as a phrase-level annotation

This doesn't make sense for, e.g. double annotating kanji with both kana and romaji.

>  we should consider supporting multiple <rt>s per base, if there is reason to
> believe that multiple monoruby annotations at the ends of lines are common.

I have no idea what this is referring to.
================================================================================
 #21  Koji Ishii                                      2011-12-04 05:57:25 +0000 
--------------------------------------------------------------------------------
> Fallback if we rely on this simple pattern is suboptimal, but that doesn't seem
> to be a big deal. It's time for implementations to just implement ruby. AT
> fallback is not impossible in any of these cases and is unaffected by how we
> mark it up.

Fallback isn't only for browsers without ruby support. One UA vendor I know considered using fallback when ruby is too small to read, but gave up due to the text quality issue fantasai pointed out.
================================================================================
 #22  fantasai                                        2012-02-20 17:18:42 +0000 
--------------------------------------------------------------------------------
http://lists.w3.org/Archives/Public/public-i18n-cjk/2012JanMar/0063.html
================================================================================

Comment 1 Ian 'Hixie' Hickson 2012-07-20 04:19:20 UTC

I intend to address this as described in the final paragraph of note 17 above; this will not impact the parser.

Comment 2 fantasai 2012-07-20 15:37:36 UTC

This bug is very specifically about the parser, and about HTML's forwards-compatibility in the case that a future version defines complex ruby to be in scope. It's not about whether or not to support complex ruby; that's a separate debate, where several approaches are being discussed and there isn't a clear conclusion yet. I'd rather not complicate this bug by discussing approaches to ruby, please let it suffice to say that at some point in the next 20 years it's entirely possible for it (or some subset of it) to be in scope and the current parsing algorithm has no benefits over one that is more forwards-compatible.

Comment 3 fantasai 2012-08-21 18:15:54 UTC

I assume the "fix" was this checkin:
  http://html5.org/tools/web-apps-tracker?from=7243&to=7244

That does not fix this bug, which is summarized in comment #1. It fixes perhaps another one. Most of your comments and questions here seem to be addressing this phantom bug (which I assume is summarized "support xxx type of ruby"), rather than the actual one hsivonen filed, so I can understand the confusion. But the bug hsivonen filed is not fixed by this checkin.

Comment 4 Ian 'Hixie' Hickson 2012-08-21 19:29:07 UTC

There's no need to change the parser, so if you want me to address the original bug report and not the reasoning behind the original bug report then this is just a WONTFIX.