2878 – "Element within text" data category

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 2878 - "Element within text" data category

Summary: "Element within text" data category

Status:	CLOSED FIXED

Alias:	None

Product:	ITS
Classification:	Unclassified
Component:	ITS tagset (show other bugs)
Version:	WorkingDraft
Hardware:	PC Windows XP

Importance:	P2 normal
Target Milestone:	LastCall20May
Assignee:	Felix Sasaki
QA Contact:	ITS mailing-list

URL:
Whiteboard:
Keywords:

Duplicates (1):	2968 (view as bug list)
Depends on:
Blocks:

Reported:	2006-02-15 08:24 UTC by Felix Sasaki
Modified:	2006-07-21 17:49 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Felix Sasaki 2006-02-15 08:24:23 UTC

TBD until last call.

Comment 1 Christian Lieske 2006-02-17 11:20:43 UTC

Continuing the Yomigana question ...

Should we recommend the use of Ruby instead? If so, which nesting do we 
recommend a specific type of nesting?

a.
 
<ruby>
   <rubyBase><term>PlateBroiler</term></rubyBase>
   <rubyText>ssss</rubyText>
</ruby>

b.

<ruby>
   <rubyBase>PlateBroiler</rubyBase>
   <term>PlateBroiler</term>
   <rubyText>ssss</rubyText>
</ruby>

Comment 2 Felix Sasaki 2006-02-23 12:06:26 UTC

Hiroki Sato made an interesting proposal at
http://lists.w3.org/Archives/Public/public-i18n-its/2006JanMar/0094.html about
this topic. It might be useful to take this into account during the segmentation
discussion.

Comment 3 Yves Savourel 2006-03-10 22:26:12 UTC

This is a note from Andrzej and Yves:

We discussed the "segmentation/inliness" topic today and we came up with a proposal for it. Here it is:


===1: The name

Since we wanted to avoid 'inline' for its other meanings in the the domain of representation/rendering and 'segment' for its meaning in localization, we came up with "Elements within text" as the name for this data category.

===2: The aim

The aim of this data category is to identify the elements that are within text content and do not contain a text node that belongs to a different text unit.

Knowing these elements allow linguistics-related tool to break down the text of the document into text units that are meaningful. No schema information or programmatic methods allow to detect all cases of such elements.

Example, in the following code:

<p><b>Palouse</b> horses<fn callout="#">A Palouse horse is 
the same as an <b>Appaloosa</b>.</fn> have spotted coats.</p>

The element <b> is the only to be defined as "within text".

In the following OpenDocument code:

<text:p text:style-name="Standard">
 Palouse horses
 <text:note text:id="ftn1" text:note-class="footnote">
  <text:note-citation>1</text:note-citation>
  <text:note-body>
   <text:p text:style-name="Footnote"> A Palouse horse is the same as 
an Appaloosa.</text:p>
  </text:note-body>
 </text:note>
 have spotted coats.</text:p>

None of the elements is to be defined at "within text".

The processing expectation for this data category is to break down the text of a document in separate text units where: a) Any element identified as 'within text' remain with its enclosing text. b) And any other element is removed or left in the form of a place-holder.

> ===3: ITS Markup

We came up with two different possible solutions to code this information in ITS: One using XPath expression, the other using a list of element names.

With XPath:

<its:documentRules>
 <its:withinTextRule its:selector="//em" its:withinText="yes" />  
<its:withinTextRule its:selector="//strong" its:withinText="yes" />  
...
</its:documentRules>

With list:

<its:documentRules>
 <its:withinTextRule its:list="em strong..." its:withinText="yes" /> 
</its:documentRules>


-- Yves is of the opinion to use the list (but could live with the selector):

Using XPath would force (at least in DOM) to decorate the document to know whether an element is "wintin text" or nor when traversing the document tree. There are no easy or unexpensive way to know if a given element is matching or not an XPath expression when accessing the tree directly. Since we have not been able so far to come up with cases where an element would be "within text" or not depending on its context, it seems using XPath is not as justified here as it is in other data categories.

-- Andrzej preferes XPath:

It provides more control and might well be required in certain conditions. One can imagine that there could well be situations where an element is 'within text' in one context, and not in another, so XPath provides the maximum flexibility.

-Andrzej and Yves

Comment 4 Yves Savourel 2006-03-11 02:06:58 UTC

OK, I'm revising my opinion on not needing XPath. I've just found in DocBook some cases of elements that are sometimes within text, and sometimes not within text: <firstname>, <surname>, <lineage>, and <othername> are both in:

<personname> which has the following content model:
personname ::=
((honorific|firstname|surname|lineage|othername)+)

and is also in <address> which has the following content model:
address ::=
(#PCDATA|personname|honorific|firstname|surname|lineage|othername|
affiliation|authorblurb|contrib|street|pob|postcode|city|state|
country|phone|fax|email|otheraddr)*

And I suspect there are many more like this in DocBook. So I suppose that makes the use of its:selector a must in <its:withinTextRule> (and DocBook a good candidate to test any implementation to the limits).

Comment 5 Felix Sasaki 2006-03-13 04:52:42 UTC

Just a proposal for the simplification of the rules: how about
<its:documentRules>
 <its:withinTextRule its:selector="//em"/>  
<its:withinTextRule its:selector="//strong"/>  
...
</its:documentRules>
instead of
<its:documentRules>
 <its:withinTextRule its:selector="//em" its:withinText="yes" />  
<its:withinTextRule its:selector="//strong" its:withinText="yes" />  
...
</its:documentRules>
Since the name of the element <its:withinTextRule> already denotes that everything which is being selected has the value "yes". A similar simplification might be useful for the global rule of the terminology data category.

Comment 6 Yves Savourel 2006-03-13 05:10:31 UTC

I thought about such simplification at some point, but then I tried to get a list of 'element within text' for DocBook... There are cases where some elements can be within text or not depending on the context, which (I think) means one may have to overwrite a previous rule when going through declaration (a bit like for translatability). So I'm not sure if we could simplified.

Maybe the terminology rule could be simplified like that?

Comment 7 Felix Sasaki 2006-03-13 05:19:54 UTC

(In reply to comment #6)
> I thought about such simplification at some point, but then I tried to get a
> list of 'element within text' for DocBook... There are cases where some
> elements can be within text or not depending on the context, which (I think)
> means one may have to overwrite a previous rule when going through declaration
> (a bit like for translatability). So I'm not sure if we could simplified.

Of course for processing you need to assume assigned values, e.g. withinText="yes". But since there is only one value for this data category (which is different from translatability "yes" versus "no"), I thought there is no need for a user to write an attribute. A processor can "assume" this value just from the name of the <its:withinText> element. That is: the override you are describing can work only relying on this name

> 
> Maybe the terminology rule could be simplified like that?
> 
I think so.

Comment 8 Yves Savourel 2006-03-13 22:08:41 UTC

> But since there is only one value for this data category
> (which is different from translatability "yes" versus "no"),

I'm not sure about this. I think we might need two values: "yes' and "no".
for example in DocBook the element <firstname> is within text when its parent is <address> but not within text when its parent is <personname>. So we would have something like:

<its:withinText its:selector="//address/*" its:withinText="yes"/>
<its:withinText its:selector="//personname/*" its:withinText="no"/>

To be sure to set the right value to <firstname> (and the many elements like <firstname>).

One could do it one by one obviously:

<its:withinText its:selector="//address/firstname" its:withinText="yes"/>
etc... (omitting the parents with mixed content)

but it would make things more complicated (I think).

Comment 9 Felix Sasaki 2006-03-14 02:22:44 UTC

(In reply to comment #8)
> > But since there is only one value for this data category
> > (which is different from translatability "yes" versus "no"),
> 
> I'm not sure about this. I think we might need two values: "yes' and "no".
> for example in DocBook the element <firstname> is within text when its parent
> is <address> but not within text when its parent is <personname>. So we would
> have something like:
> 
> <its:withinText its:selector="//address/*" its:withinText="yes"/>
> <its:withinText its:selector="//personname/*" its:withinText="no"/>

It took some time, but now I understand.

> 
> To be sure to set the right value to <firstname> (and the many elements like
> <firstname>).
> 
> One could do it one by one obviously:
> 
> <its:withinText its:selector="//address/firstname" its:withinText="yes"/>
> etc... (omitting the parents with mixed content)
> 
> but it would make things more complicated (I think).

I think your proposal with "yes" and "no" looks good. Btw., is this only a "global" data category? Or do we need also a local attribute its:withinText?
>

Comment 10 Yves Savourel 2006-03-14 03:36:17 UTC

Andrzej and I were thinking it would be a global data category only.

Comment 11 Yves Savourel 2006-03-15 16:21:01 UTC

*** Bug 2968 has been marked as a duplicate of this bug. ***

Comment 12 Yves Savourel 2006-03-15 16:24:29 UTC

Noting Andrzej's comments at the beginning of the telkeconference today:

It seems we are missing an attribute in <its:withinTextRule> One that would specific if the element contains a text run separate from the parent text run. (like a footnote inside a paragraph).

Comment 13 Yves Savourel 2006-03-15 23:31:56 UTC

Hi Andrzej,

I agree with you that ITS should be able to specify the "subflow" elements (elements that contain an independent text run, like a the footnote).

But my understanding was that it was addressed by not listing such elements in the withinText list and relying on the process to identify them.

As I see it (so far) we can do this because when an element can be "subflow" it does not matter whether it is inside a text run or outside, from a processing view point there is little or no difference.

Maybe examples would make my thoughts clearer. I start with these principles:

- If an element is set as being "within text" it goes (and its content) with the text unit where it occures.

- If an element is not set as being "within text", it is either an element with no text content, or an element with text content that is a subflow. And in both cases you have to do (almost) the same.

My thought is that an its:subflow attribute would work, but also would force us to have more rules, while that same information could be gathered during the process.

For example, in XHTML an <li> element can contain PCDATA and %Flow; which includes things like <p> and <b>.
That means we can have:

<li>Some <b>text</b> and some more. <p>And some more text</p></li>

<b> is withinText, no issue. But <p> is also found outside of <li> and should be treated as not inline then. And when inside <li> it should be treated as subflow. So we certainly could do something like:

<its:documentRules>
 <its:withinText its:selector="//b" its:withinText="yes"/>
 <its:withinText its:selector="//p" its:withinText="no"/>
 <its:withinText its:selector="//li/p" its:withinText="yes" its:subflow="yes"/>
 ...
</its:documentRules>

But things get very quickly out of control as far as the number of rules and overrides you have to do: The case of <p> is true for <h1>, and many more elements (And it can be worst if you move to formats such as DocBook)

In the other hand, the difference between handing "//p" and "//li/p" is minimal: iIn both case you start a new text unit. The only difference is whether it's a subflow text unit or not, and that can be detected by looking if the parent of <p> has text nodes or not.

Thinking about all the normal withinText cases like <b>, someone is going to ask me:

"Then why can't you replace withinText by simply checking if the parent has text nodes? Like for <p>? No need for that data category then."

That is because there are cases where it cannot be detected:

<li><b>Some text </b><i>Some text</i></li>

Then, Andrzej you are going to say:

"Then it means in cases like that <p> as a subflow cannot be detected either!"

And I'm going to say:

Yes, like here:

<li><p>Some text </p><p>Some text</p></li>

But it is OK because if such case is treated as normal text run instead of subflow, it does not matter. While for <b> and <i> the order of the elements may need to be changed during translation and therefore it is important that <b> is detected as withinText in that case.

In other words: In the case of an "inline" element that is "subflow": I think the "subflowness" information can be obtained by simply not listing the element as withinText, and the detection that such element is actually "inline" can be detected by simply knowing whether or not you are already within a text run for that element.

All this obvioulsy needs to be validated by implementations...
Working on it.

Cheers,
-yves

Comment 14 Felix Sasaki 2006-06-07 05:55:52 UTC

This bug has been discussed before the last call publication, so I'm closing it now. Please do not reopen it if, even if you have something specifc on "element within text". Instead I would propose to have new bugs, so that we can separate last call comments from earlier discussions.