17258 – can range object support text/word/sentence based range setting?

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 17258 - can range object support text/word/sentence based range setting?

Summary: can range object support text/word/sentence based range setting?

Status:	NEW

Alias:	None

Product:	WebAppsWG
Classification:	Unclassified
Component:	HISTORICAL - HTML Editing APIs (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Aryeh Gregor
QA Contact:	HTML Editing APIs spec bugbot

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-05-31 02:05 UTC by Yang Sun
Modified:	2012-12-04 03:11 UTC (History)
CC List:	8 users (show)

See Also:

Attachments

Description Yang Sun 2012-05-31 02:05:32 UTC

Hi, we are developing  web based programming tool.
Grammar highlight and auto-complete are features we want in the tool.
But we find it is hard and not flexible when using existing node based range object method: setStart(node,offset),setEnd(node, offset) to implement it.

Can we add another style method which have been support in IE, similar with moveStart("character",1) or moveStart("word",2) etc, it will give more flexibility for our development.

what's more, can we let range object support execCommand method to directly set the font,color,background etc, without another call of document.execCommand.

expecting your feedback.
thanks

Comment 1 Anne 2012-05-31 08:14:12 UTC

Any particular reason why this cannot be done using selection? http://dvcs.w3.org/hg/editing/raw-file/tip/editing.html

Comment 2 Aryeh Gregor 2012-05-31 11:45:03 UTC

You can only have one Selection per page, which has to correspond to the user-visible selection.  So it's not that great for this sort of thing.  Moving this kind of thing to Range is probably a good idea in principle -- although it means Range would no longer be a pure, simple DOM thing.  We definitely want to be able to call execCommand() on arbitrary Ranges.

Since this is more or less doable right now using Selection, though, I wouldn't say it's a high priority.

Comment 3 Anne 2012-05-31 12:15:18 UTC

I suggest we add such extensions to Editing then at some point in the future.

Comment 4 Tim Down 2012-05-31 12:32:21 UTC

(In reply to comment #2)
> You can only have one Selection per page, which has to correspond to the
> user-visible selection.  So it's not that great for this sort of thing.  Moving
> this kind of thing to Range is probably a good idea in principle -- although it
> means Range would no longer be a pure, simple DOM thing.  We definitely want to
> be able to call execCommand() on arbitrary Ranges.
> 
> Since this is more or less doable right now using Selection, though, I wouldn't
> say it's a high priority.

Presumably you're referring just to the execCommand() part? I agree that isn't high priority. However, I think the TextRange stuff is important. There is nothing that comes close to doing the job in non-IE browsers.

For what it's worth, I've recently implemented character and word-based TextRange-like functionality for my Rangy library. See http://code.google.com/p/rangy/wiki/TextRangeModule and http://rangy.googlecode.com/svn/trunk/demos/textrange.html

Comment 5 Aryeh Gregor 2012-05-31 14:56:04 UTC

(In reply to comment #4)
> Presumably you're referring just to the execCommand() part? I agree that isn't
> high priority. However, I think the TextRange stuff is important. There is
> nothing that comes close to doing the job in non-IE browsers.

The same general idea is covered by the non-standard Selection.modify, isn't it?  That's not high-priority for me because it's a nightmare to spec properly, although I agree it would be valuable.

Comment 6 Tim Down 2012-05-31 15:30:52 UTC

(In reply to comment #5)
> (In reply to comment #4)
> > Presumably you're referring just to the execCommand() part? I agree that isn't
> > high priority. However, I think the TextRange stuff is important. There is
> > nothing that comes close to doing the job in non-IE browsers.
> 
> The same general idea is covered by the non-standard Selection.modify, isn't
> it?  That's not high-priority for me because it's a nightmare to spec properly,
> although I agree it would be valuable.

Selection.modify is hopeless. It works differently in WebKit and Mozilla (definitions of which collections of characters constitute a word, which granularities are supported, whether a word includes its trailing space) and is awkward and unintuitive to use.

For example, it's next to impossible to expand a selection to encompass whole words because it only works with the selection's focus and forces you to do tedious tests to work out if a selection boundary is already at a word boundary. This seems to me exactly the kind of thing it should do easily but doesn't. Here's a floundering and broken attempt of mine: http://stackoverflow.com/questions/7380190/select-whole-word-with-getselection. I reckon I could make it work but not without another significant chunk of code.

TextRange's move(), moveStart(), moveEnd() and expand() methods are a miracle of clear thinking in comparison.

Comment 7 Ryosuke Niwa 2012-05-31 16:38:09 UTC

(In reply to comment #6)
> Selection.modify is hopeless. It works differently in WebKit and Mozilla
> (definitions of which collections of characters constitute a word, which
> granularities are supported, whether a word includes its trailing space) and is
> awkward and unintuitive to use.

This is from necessity. Different operating systems have different conventions here, and we have to follow the platform. If we're spec'ing something to be consistent here, then we're breaking the platform convention.

Also, including/not including whitespace after/before a word is somewhat dubious concept because some languages such as CJK don't use spaces as a word delimiter. We can only heuristically determine word boundary in those languages so this whole idea about agreeing on one and exactly one behavior is flawed at least in those languages.

Comment 8 Tim Down 2012-05-31 19:53:54 UTC

(In reply to comment #7)
> (In reply to comment #6)
> > Selection.modify is hopeless. It works differently in WebKit and Mozilla
> > (definitions of which collections of characters constitute a word, which
> > granularities are supported, whether a word includes its trailing space) and is
> > awkward and unintuitive to use.
> 
> This is from necessity. Different operating systems have different conventions
> here, and we have to follow the platform. If we're spec'ing something to be
> consistent here, then we're breaking the platform convention.
> 
> Also, including/not including whitespace after/before a word is somewhat
> dubious concept because some languages such as CJK don't use spaces as a word
> delimiter. We can only heuristically determine word boundary in those languages
> so this whole idea about agreeing on one and exactly one behavior is flawed at
> least in those languages.

I understand and agree with all of that. However, cross-browser differences in simple cases in English, which I do appreciate is just one of many languages but is my native tongue, make Selection.modify a frustrating thing to use (in addition to the other problems I've mentioned). Further, it hasn't been spec'ed yet so Microsoft won't be implementing it in IE, and last time I looked Opera didn't have it either, so it's not really an option for authors at the moment. It seems that it would have to be spec'ed for it to find its way into browsers and become usable, in which case I'd much rather see the effort go into TextRange-y extensions to Range.

Comment 9 Ryosuke Niwa 2012-05-31 20:12:29 UTC

(In reply to comment #8)
> I understand and agree with all of that. However, cross-browser differences in
> simple cases in English, which I do appreciate is just one of many languages
> but is my native tongue, make Selection.modify a frustrating thing to use (in
> addition to the other problems I've mentioned).

As I mentioned earlier, this is from necessity. Windows/Linux and Mac have different conventions regarding whether whitespace should be included or not when moving to the left or to the right across a word IN ENGLISH.

Comment 10 Tim Down 2012-05-31 21:05:07 UTC

(In reply to comment #9)
> (In reply to comment #8)
> > I understand and agree with all of that. However, cross-browser differences in
> > simple cases in English, which I do appreciate is just one of many languages
> > but is my native tongue, make Selection.modify a frustrating thing to use (in
> > addition to the other problems I've mentioned).
> 
> As I mentioned earlier, this is from necessity. Windows/Linux and Mac have
> different conventions regarding whether whitespace should be included or not
> when moving to the left or to the right across a word IN ENGLISH.

Yes, I know, and I understand. I'm talking about differences between browsers, not platforms.

Comment 11 Tim Down 2012-05-31 21:18:32 UTC

(In reply to comment #10)
> (In reply to comment #9)
> > (In reply to comment #8)
> > > I understand and agree with all of that. However, cross-browser differences in
> > > simple cases in English, which I do appreciate is just one of many languages
> > > but is my native tongue, make Selection.modify a frustrating thing to use (in
> > > addition to the other problems I've mentioned).
> > 
> > As I mentioned earlier, this is from necessity. Windows/Linux and Mac have
> > different conventions regarding whether whitespace should be included or not
> > when moving to the left or to the right across a word IN ENGLISH.
> 
> Yes, I know, and I understand. I'm talking about differences between browsers,
> not platforms.

Also, I'm not just talking about the white space issue. A simple example of different implementations is apostrophes, which terminate the word in Mozilla but not in WebKit.

<div id="editable" contenteditable="true">Don't ask</div>

var textNode = document.getElementById("editable").firstChild;
var sel = window.getSelection();
sel.collapse(textNode, 1);
sel.modify("extend", "forward", "word");

In WebKit in Windows, "on't" is selected. In Firefox in Windows, only "on'" is selected.

Comment 12 Ryosuke Niwa 2012-05-31 21:34:18 UTC

(In reply to comment #11)
> Also, I'm not just talking about the white space issue. A simple example of
> different implementations is apostrophes, which terminate the word in Mozilla
> but not in WebKit.

This is analogous to the problem that word boundaries are not well defined in CJK. In WebKit's case, we rely on ICU or some other i18n layer available in each port (or platform) to detect word boundaries. In fact, different WebKit-based browsers use different versions of ICU or an equivalent library. So for example, Safari which uses OS X's ICU treats each character in "日本語は" as a word (4 words) on Snow Leopard whereas Chrome on Mac treats "日本語" and "は" as words (2 words in total). As far as I know, Firefox (Gecko) uses its own library (doesn't even use ICU) to do these kinds of word-boundary detection. And I don't think we can agree on a single word boundary detection algorithm here.

Comment 13 Ehsan Akhgari [:ehsan] 2012-05-31 23:09:21 UTC

Exposing the platform conventions to web APIs is a bad idea for almost all cases.  The only use case that I've heard to make selection.modify() behave differently on different platforms is to make it possible for a web page to imitate user actions, which I do not find compelling.

We should spec something that is platform agnostic, and well defined.  Relying on random platforms conventions (which may be non-existent, e.g. on Android, among others), different versions of ICU, etc. is an extremely bad idea.

FWIW, Gecko used to have a platform dependent behavior for selection.modify(), and I changed it to act consistently across platforms (which matched WebKit back then), and then later on WebKit changed their behavior to be platform dependent.  I find that a regression in the quality of the API, and I have no intention to make selection.modify behave differently on different platforms again in Gecko.  The same goes for any possible future APIs for this kind of thing.

Comment 14 Ryosuke Niwa 2012-05-31 23:25:05 UTC

(In reply to comment #13)
> Exposing the platform conventions to web APIs is a bad idea for almost all
> cases.  The only use case that I've heard to make selection.modify() behave
> differently on different platforms is to make it possible for a web page to
> imitate user actions, which I do not find compelling.

This is perhaps a difference in philosophy. While Gecko tries to have one behavior on all platforms for various things (usually favoring Windows convention), WebKit tries to match the platform convention as much as possible in various features such as hit testing and selection API.

While I agree that we should spec API to be platform agnostic as much as possible, certain features are inherently platform dependent and we can't force users to follow the "Web" convention. That's confusing at best for users who are used to a certain platform convention. e.g. selection made by a mouse drag on Mac is directionless. Making it not directionless will be inconsistent with the platform, and will worsen the user experience.

> We should spec something that is platform agnostic, and well defined.  Relying
> on random platforms conventions (which may be non-existent, e.g. on Android,
> among others), different versions of ICU, etc. is an extremely bad idea.

I don't think it's realistic to agree on one behavior here particularly for CJK. Most text segmentation algorithm uses heuristics in CJK and depends largely on the corpus size (which can be adjusted in the case of ICU). Also note that we can't just improve the algorithm for English because detecting that the user is typing English itself is hard.

Comment 15 Ryosuke Niwa 2012-05-31 23:33:58 UTC

I'll further note that it's common for English words or phrases to appear within a Japanese sentence. e.g.

"World Wide Web Consortium の設立は、今日のインターネットの基礎技術を確立しそれを無償で公開したティム・バーナーズ＝リーの努力によるところが大きい。彼は、欧州原子核研究機構(CERN)における中心的な活動にも係わってきた。" (source: http://ja.wikipedia.org/wiki/W3C)

Of course, if someone can come up with a text segmentation algorithm that correctly detects word boundaries for all languages on the Web, I'm more than happy to accept it as the standard algorithm but I believe that's impossible because doing so in CJK appears to require the complete knowledge of all vocabularies in the existence.

Comment 16 Ehsan Akhgari [:ehsan] 2012-05-31 23:34:45 UTC

(In reply to comment #14)
> (In reply to comment #13)
> > Exposing the platform conventions to web APIs is a bad idea for almost all
> > cases.  The only use case that I've heard to make selection.modify() behave
> > differently on different platforms is to make it possible for a web page to
> > imitate user actions, which I do not find compelling.
> 
> This is perhaps a difference in philosophy. While Gecko tries to have one
> behavior on all platforms for various things (usually favoring Windows
> convention), WebKit tries to match the platform convention as much as possible
> in various features such as hit testing and selection API.

That is not true at all.  We do our best to match platform conventions as far as user visible behavior is concerned, except for where we have bugs.  A notable example is painting selection background, where we fail to imitate the platform convention on Mac (and WebKit fails to do that on non-Mac platforms ;-)

Now, let's remember that we're talking about APIs on the web.  I can't think of any specced web API which has platform dependent behavior.  So I would argue that this is about more than just a difference in philosophy.

> While I agree that we should spec API to be platform agnostic as much as
> possible, certain features are inherently platform dependent and we can't force
> users to follow the "Web" convention. That's confusing at best for users who
> are used to a certain platform convention. e.g. selection made by a mouse drag
> on Mac is directionless. Making it not directionless will be inconsistent with
> the platform, and will worsen the user experience.

The directionality of the selection is a bad example here.  Given <span>|foo bar</span>, what should the result of getSelection().modify("move", "forward", "word") be?  Asking the user to check the user agent string for hints of the platform (which is pretty much the only way for a web app to answer this question with the current WebKit implementation is tough to argue for.  ;-)

However, for selection directionality, one could expose an API which will tell the web app whether the selection is directionless by default.

> > We should spec something that is platform agnostic, and well defined.  Relying
> > on random platforms conventions (which may be non-existent, e.g. on Android,
> > among others), different versions of ICU, etc. is an extremely bad idea.
> 
> I don't think it's realistic to agree on one behavior here particularly for
> CJK. Most text segmentation algorithm uses heuristics in CJK and depends
> largely on the corpus size (which can be adjusted in the case of ICU). Also
> note that we can't just improve the algorithm for English because detecting
> that the user is typing English itself is hard.

Maybe we can settle for something easy and fool-proof?  We have already done that for dir=auto at least.  Predictability is much more valuable than preciseness, especially for the types of problems which don't have an optimum solution anyways, such as determining word boundaries.

Comment 17 Ehsan Akhgari [:ehsan] 2012-05-31 23:35:33 UTC

(In reply to comment #15)
> I'll further note that it's common for English words or phrases to appear
> within a Japanese sentence. e.g.
> 
> "World Wide Web Consortium
> の設立は、今日のインターネットの基礎技術を確立しそれを無償で公開したティム・バーナーズ＝リーの努力によるところが大きい。彼は、欧州原子核研究機構(CERN)における中心的な活動にも係わってきた。"
> (source: http://ja.wikipedia.org/wiki/W3C)
> 
> Of course, if someone can come up with a text segmentation algorithm that
> correctly detects word boundaries for all languages on the Web, I'm more than
> happy to accept it as the standard algorithm but I believe that's impossible
> because doing so in CJK appears to require the complete knowledge of all
> vocabularies in the existence.

We could also say the same thing for dir=auto and other similar problems, but that won't get anyone anywhere.  :-)

Comment 18 Ehsan Akhgari [:ehsan] 2012-05-31 23:36:07 UTC

(Note that I don't know a lot about CJK languages myself, so I don't know what a simple and fool proof algorithm would look like...)

Comment 19 Ryosuke Niwa 2012-06-01 00:02:55 UTC

(In reply to comment #16)
> Now, let's remember that we're talking about APIs on the web.  I can't think of
> any specced web API which has platform dependent behavior.  So I would argue
> that this is about more than just a difference in philosophy.

focus, from controls, etc... all have similar platform-dependent behaviors:
http://www.whatwg.org/specs/web-apps/current-work/multipage/editing.html#focus
User agents may track focus for each browsing context or Document individually, or may support only one focused element per top-level browsing context — user agents should follow platform conventions in this regard.

So things like document.activeElement are Web-facing platform-dependent API.

> > While I agree that we should spec API to be platform agnostic as much as
> > possible, certain features are inherently platform dependent and we can't force
> > users to follow the "Web" convention. That's confusing at best for users who
> > are used to a certain platform convention. e.g. selection made by a mouse drag
> > on Mac is directionless. Making it not directionless will be inconsistent with
> > the platform, and will worsen the user experience.
> 
> The directionality of the selection is a bad example here.  Given <span>|foo
> bar</span>, what should the result of getSelection().modify("move", "forward",
> "word") be?

It depends.

> However, for selection directionality, one could expose an API which will tell
> the web app whether the selection is directionless by default.

I'm all for exposing some property on DOMSelection that tells the author whether selection is directionsless or not. We've already done this for input/textarea with selectionDirection. I'd imagine adding the same property to DOMSelection won't be controversial. But this is a topic for another bug :)

> > I don't think it's realistic to agree on one behavior here particularly for
> > CJK. Most text segmentation algorithm uses heuristics in CJK and depends
> > largely on the corpus size (which can be adjusted in the case of ICU). Also
> > note that we can't just improve the algorithm for English because detecting
> > that the user is typing English itself is hard.
> 
> Maybe we can settle for something easy and fool-proof?

I don't think we can come up with a fool-proof text segmentation algorithm for CJK because there aren't any clear rules to segment text into words as far as I know. e.g. the only reason I know 欧州原子核研究機構 contains 4 words: 欧州 原子核 研究 機構 is because I know 欧州, 原子核, 研究, and 機構 are known words. Similarly, ありがとう (thank you) is a single word whereas ありがいる (there is an ant / there are ants) contains three words: あり, が, いる (note first 3 characters are identical in both phrases).

> We have already done that for dir=auto at least.

I wasn't aware of this. Where is this spec'ed?

Comment 20 Tim Down 2012-06-01 00:09:56 UTC

The bug is about extra range functionality. We got onto Selection.modify because Aryeh suggested that it could do the job of TextRange's moveStart() etc. The word boundary identification issue and possibly the white space issue would also exist in TextRange-like extensions to Range; my larger concern is that Selection.modify is hard to use for common use cases and is needlessly bound to the selection rather than ranges.

For an API, I think you could do a lot worse than TextRange's moveStart(), moveEnd() and expand(). expand() in particular is hard to imitate using Selection.modify(). While you're at it, perhaps you could spec findText() as well and kill off window.find(), but that's really a separate issue.

Comment 21 Ehsan Akhgari [:ehsan] 2012-06-01 00:23:56 UTC

Right, I won't talk about selection.modify any more here, sorry for hijacking the bug.  :)  (But I do have more to say if we decide to spec it in the future.)

Comment 22 Yang Sun 2012-06-01 02:13:41 UTC

We can done using this, but more complex, we have test both method.
And we are desperated to meet different user in IE, Fiirfox(together with WebKit).


(In reply to comment #1)
> Any particular reason why this cannot be done using selection?
> http://dvcs.w3.org/hg/editing/raw-file/tip/editing.html

Comment 23 Yang Sun 2012-06-01 02:18:38 UTC

We are targetting the programming language, like C++/Python etc, the web based programming toools coping with C++/Python etc seems only cope with English word style.

What we want is just another method in range object, but not replace existing node based mechanism.

I know node based mechanism has its advantage....



(In reply to comment #7)
> (In reply to comment #6)
> > Selection.modify is hopeless. It works differently in WebKit and Mozilla
> > (definitions of which collections of characters constitute a word, which
> > granularities are supported, whether a word includes its trailing space) and is
> > awkward and unintuitive to use.
> 
> This is from necessity. Different operating systems have different conventions
> here, and we have to follow the platform. If we're spec'ing something to be
> consistent here, then we're breaking the platform convention.
> 
> Also, including/not including whitespace after/before a word is somewhat
> dubious concept because some languages such as CJK don't use spaces as a word
> delimiter. We can only heuristically determine word boundary in those languages
> so this whole idea about agreeing on one and exactly one behavior is flawed at
> least in those languages.

Comment 24 Aryeh Gregor 2012-06-01 10:57:15 UTC

(In reply to comment #20)
> For an API, I think you could do a lot worse than TextRange's moveStart(),
> moveEnd() and expand().

The problem is speccing them.  I haven't looked at them in detail, but my bet would be they just hook into whatever code IE happens to use for selection manipulation, which is probably whatever code Windows uses, which probably varies by Windows version/installed languages/etc.  It's easy for any given browser to write such an API, because they have to support concepts of "select a word" and so on anyway, but actually speccing it and getting it interoperable is a totally different story.

Comment 25 Tim Down 2012-06-01 11:14:41 UTC

(In reply to comment #24)
> (In reply to comment #20)
> > For an API, I think you could do a lot worse than TextRange's moveStart(),
> > moveEnd() and expand().
> 
> The problem is speccing them.  I haven't looked at them in detail, but my bet
> would be they just hook into whatever code IE happens to use for selection
> manipulation, which is probably whatever code Windows uses, which probably
> varies by Windows version/installed languages/etc.  It's easy for any given
> browser to write such an API, because they have to support concepts of "select
> a word" and so on anyway, but actually speccing it and getting it interoperable
> is a totally different story.

Yes, I can see that. You'd also have to deal with the dreaded innerText / selection stringifier issue. For some use cases, interoperability is less important than just having an API. Just some way for script to hook into the browser's existing ability to identify visible text and to split it into words and sentences.

Comment 26 Alexey Proskuryakov 2012-06-01 17:39:20 UTC

I agree with Ryosuke's concerns.

The point of having an API would be to enable applications that build compound algorithms on top of it. In WebKit, we have some experience with building editing algorithms on top of slightly different and platform specific building blocks, and it's not pretty. Minor differences get exaggerated to serious bugs when the definition of "word boundary" changes even a little.

And if the API is cross-browser and cross-platform, then I don't see a use case for it. What's good about a function that manipulates words in a manner that disagrees with the rest of the browser engine, and with user expectations as well?

Returning to the original bug report, a web based programming tool needs custom linguistic algorithms anyway. Matching default platform behavior is not helpful, because programming languages have a different structure and conventions. For example, a period generally creates a word boundary in programming languages ("window.navigator" is two words), but not in English ("U.S.A." is one word for double-clicking). IDEs also tend to have unique concepts that do not exist in natural language editors (e.g. sub-word boundaries in CamelCase or unix_style identifiers for both spell checking and selection).