ISSUE-12: Create a (Sentence) Segmentation Markup System compatible with the proposed Unicode segmentation characters

Segmentation Markup

Create a (Sentence) Segmentation Markup System compatible with the proposed Unicode segmentation characters

State:
CLOSED
Product:
MLW-LT Requirements Document
Raised by:
Arle Lommel
Opened on:
2012-05-09
Description:
The Unicode Technical Committee (UTC) is considering a proposal to encode two characters in plain text for improving the output of UAX #29-based sentence segmentation processes by allowing for text to contain override characters to correct results. For example, given the following string:

“Mrs. Smith and Mr. Jones ate lunch at Mme. Flaubert’s apartment.”

UAX #29 would incorrectly treat it as four segments:

1. Mrs.
2. Smith and Mr.
3. Jones ate lunch at Mme.
4. Flaubert’s apartment.

The introduction of a SENTENCE JOINER (SJ) and corresponding SENTENCE NON-JOINER would allow processes to explicitly override UAX #29 behavior, e.g. by rendering the example as:

“Mrs.<SJ> Smith and Mr.<SJ> Jones ate lunch at Mme.<SJ> Flaubert’s apartment.”

Where the SENTENCE JOINER (<SJ>) overrides the default UAX #29 rule.

The UTC requested that if this proposal moves forward, that “somebody” also work on a standard markup-oriented equivalent.

I see two ways to handle this request (interpreting it pretty broadly):

==1. Directly compatible with the character model==

Add two empty elements, e.g., <sj/> and <snj/>, that can substitute directly for the proposed characters.

Pros: directly equivalent to UTC proposal. Light-weight and minimally intrusive in the document structure. Can be added only where needed, making implementation simple.

Cons: limited utility for addressing individual segments (elements are not text-containing nodes); embeds process-dependent information in the document (i.e., the segmentation characters are useful only in a UAX #29-compliant process and segmentation thus requires continual reprocessing).


==2. Segments as text nodes.==

Add a new non-empty element, <segment>, e.g., (in HTML5)

<p><segment>Mrs. Smith and Mr. Jones ate lunch at Mme. Flaubert’s apartment. </segment><segment>They had filet of herring and boiled potatoes with a cream sauce.</segment></p>

Pros: allows referencing of contents directly. Equivalent to <span> but with defined semantics. Definite boundaries to segments that do not require reprocessing. No dependence on UAX #29. More flexible than option 1.

Cons: Not equivalent to the UTC proposal. Would really be best implemented as an element in HTML5, which will be tougher than getting an attribute. Heavier than the alternative and requires explicit marking of all boundaries: cannot rely on UAX #29: an all-or-nothing model

Note: This could be implemented using <span> plus attributes:

<p><span type="segment">Mrs. Smith and Mr. Jones ate lunch at Mme. Flaubert’s apartment. </span><span type="segment">They had filet of herring and boiled potatoes with a cream sauce.</span></p>



I do not have a sense as to which is better. These really serve different needs. Option 1 provides a fix for UAX #29 and matches the proposal there. Option 2 is better where segments need to be addressable in the document. Perhaps using <span> elements already solves the problem of addressability, in which case option 1 may be attractive.

We need to discuss this topic further to determine what the use requirements are.
Related Actions Items:
Related emails:
  1. [All updated agenda] Re: [All] AGENDA MLW-LT Call 28 June, 2 p.m. UTC (from fsasaki@w3.org on 2012-06-28)
  2. [ISSUE 12] Segmentation Markup System (from arle.lommel@dfki.de on 2012-06-27)
  3. [All] AGENDA MLW-LT Call 28 June, 2 p.m. UTC (from fsasaki@w3.org on 2012-06-27)
  4. [All] AGENDA MLW-LT Call 28 June, 2 p.m. UTC (from fsasaki@w3.org on 2012-06-27)
  5. Update: MLW workshop Dublin Overview of issues to be discussed on 12-13 June (Tuesday-Wednesday) (from fsasaki@w3.org on 2012-06-10)
  6. MLW workshop Dublin: Overview of issues to be discussed on 12-13 June (Tuesday-Wednesday) (from fsasaki@w3.org on 2012-06-07)
  7. Re: [all] AGENDA MLW-LT WG, 31 May 2012 2 p.m. UTC (from arle.lommel@dfki.de on 2012-05-31)
  8. [all] AGENDA MLW-LT WG, 31 May 2012 2 p.m. UTC (from fsasaki@w3.org on 2012-05-30)

Related notes:

Closed with
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0156.html

Felix Sasaki, 27 Jun 2012, 09:16:09

Display change log ATOM feed


Chair, Staff Contact
Tracker: documentation, (configuration for this group), originally developed by Dean Jackson, is developed and maintained by the Systems Team <w3t-sys@w3.org>.
$Id: 12.html,v 1.1 2014-01-21 15:46:10 kahan Exp $