26278 – getElementText - no info about U+200E, U+200F

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 26278 - getElementText - no info about U+200E, U+200F

Summary: getElementText - no info about U+200E, U+200F

Status:	RESOLVED FIXED

Alias:	None

Product:	Browser Test/Tools WG
Classification:	Unclassified
Component:	WebDriver (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Browser Testing and Tools WG
QA Contact:	Browser Testing and Tools WG

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:	20860
	Show dependency tree / graph

Reported:	2014-07-07 20:48 UTC by Andrey Botalov
Modified:	2014-07-16 20:09 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Andrey Botalov 2014-07-07 20:48:01 UTC

Atoms contain some treatment for those characters - https://github.com/SeleniumHQ/selenium/blob/master/javascript/atoms/dom.js#L1208

This change seems to be done in August, 2013 after the algorithm for Webdriver W3C spec was written. So I think README.md of Webdriver project should have a not that when someone makes changes to API of remote end he should also file a bug against Webdriver spec (or even file a bug against Webdriver spec prior to making a change) so Webdriver spec and Selenium won't become out-of-sync.

P.S.: Also I haven't noticed in lines near L1208 code that removes \f, \v

Comment 1 David Burns :automatedtester 2014-07-16 13:32:28 UTC

(In reply to Andrey Botalov from comment #0)
> 
> P.S.: Also I haven't noticed in lines near L1208 code that removes \f, \v

Step 2 -> 1 -> 2nd bullet -> 1 handles this scenario

Comment 2 David Burns :automatedtester 2014-07-16 13:53:18 UTC

https://dvcs.w3.org/hg/webdriver/rev/4e8c789c7f54

Comment 3 Andrey Botalov 2014-07-16 20:02:26 UTC

There are other whitespace and BiDi characters in http://www.unicode.org/Public/6.3.0/ucd/PropList.txt and http://en.wikipedia.org/wiki/Space_(punctuation)#Spaces_in_Unicode.

I think that if only \u200b, \u200e, \u200f, \v, \f should be removed by getElementText() from the string, then the spec should also contain an explanation (note) about what makes those characters special and why other invisible "spaces" shouldn't be removed.

I don't know much about Unicode but IMO those "spaces" also look like zero-width:
U+180E
U+200C
U+2060
U+061C
etc.

I also found this line in gecko-dev repository:
https://github.com/mozilla/gecko-dev/blob/master/browser/base/content/browser.js#L2205:

> value = value.replace(/[\u00ad\u034f\u061c\u115f-\u1160\u17b4-\u17b5\u180b-\u180d\u200b\u200e-\u200f\u202a-\u202e\u2060-\u206f\u3164\ufe00-\ufe0f\ufeff\uffa0\ufff0-\ufff8]|\ud834[\udd73-\udd7a]|[\udb40-\udb43][\udc00-\udfff]/g, encodeURIComponent);

It seems that implementation in Firefox is a bit more complicated.

Comment 4 David Burns :automatedtester 2014-07-16 20:09:16 UTC

I would rather have this either mimick the current implementation, which this bug initial was about.

Adding other spaces would need to be in a new bug with specific use cases so that we can discuss. I suggest bringing this issue up on the mailing list once you have a bug.