20272 – Word Boundaries (Hyphenation) in Indian languages (UAX#29) Text Segmentation

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 20272 - Word Boundaries (Hyphenation) in Indian languages (UAX#29) Text Segmentation

Summary: Word Boundaries (Hyphenation) in Indian languages (UAX#29) Text Segmentation

Status:	NEW

Alias:	None

Product:	CSS
Classification:	Unclassified
Component:	Text (show other bugs)
Version:	unspecified
Hardware:	PC Windows XP

Importance:	P2 major
Target Milestone:	---
Assignee:	fantasai
QA Contact:	public-css-bugzilla

URL:	http://w3cindia.in/ABNFValidSegmentat...
Whiteboard:
Keywords:	needsAction

Depends on:
Blocks:

Reported:	2012-12-06 12:08 UTC by Naitik Tyagi
Modified:	2012-12-06 12:08 UTC (History)
CC List:	4 users (show)

See Also:

Attachments
complete description of this issues (946.51 KB, application/pdf) 2012-12-06 12:08 UTC, Naitik Tyagi	Details

Description Naitik Tyagi 2012-12-06 12:08:37 UTC

Created attachment 1261 [details]
complete description of this issues

Word Boundaries (Hyphenation): 
Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection, or “move to next word” control-arrow keys), and “Whole Word Search” for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another.

Recommended solution: ABNF Valid Segmentation and hyphenation dictionary (if available)

Sentence Boundaries
Recommended solution: Some special sentence boundaries like 
the double poorna virama,
possibly with numbers (as in Sanskrit text, shlokas etc.)
A string of Unicode-encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E FULL STOP) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user. 

Solution

Grapheme Cluster Boundaries: ABNF Valid Segmentation Based, Possible Extension for handling some cases (?)
Deletion and backspace: Code point wise as well as ABNF Valid Segmentation 
Mouse Selection: At ABNF Valid Segmentation and code point level.