ITS WG Collaborative editing page

Follow the conventions for editing this page.

Status: Initial Draft (Req Doc) ie. please focus on technical content, rather than wordsmithing at this stage.

Elements and Segmentation

Summary

[R025] Methods, independent of the semantic, of the elements must exist to provide hints on how to break down document content into meaningful runs of text.

Challenges

Many applications that process content for linguistic-related tasks need to be able to perform a basic segmentation. They need to be able to do this without knowing about the semantic of the elements. The elements marking up the document content should provide generic clues to help such process.

Two types of information are needed:

1- A way to distinguish elements that may hold text content (a), from elements that never have text content (b).

Example for (a): The element

may hold text:

<p>
 <b>This is bold.</b>
 <i>This is italic.</i>
</p>

Example for (b): The element

<ul>
 <li>This is the first item.</li>
 <li>This is the second item.</li>
</ul>

2- A way to distinguish independent text content that is nested within another content (c), from text content that is part of its parent element's content (d).

Examples for (c):

<p>Palouse horses<fn callout="#">A Palouse horse is 
the same as an Appaloosa.</fn> have spotted coats.</p>

Or, a more complex case: a footnote in OpenDocument:

...
<text:p text:style-name="Standard">
 Palouse horses
 <text:note text:id="ftn1" text:note-class="footnote">
  <text:note-citation>1</text:note-citation>
  <text:note-body>
   <text:p text:style-name="Footnote">
A Palouse horse is the same as an Appaloosa.</text:p>
  </text:note-body>
 </text:note>
 have spotted coats.</text:p>
...

Both examples corresponds to two distinct text runs:

Palouse horses have spotted coats.
A Palouse horse is the same as an Appaloosa.

<p><term>Palouse horses</term>
have spotted coats.</p>

This corresponds to a single text run:

Palouse horses have spotted coats.

A processor should be able to know from a method or infer from the context such information.