Annotation WG Telco

<azaroth> RangeFinder draft: http://w3c.github.io/web-annotation/api/rangefinder/

<azaroth> Takeshi's character length research: https://gist.github.com/tkanai/e2984cfa14cf099baa94

[people introducing]
azaroth: issues to dicuss:

... i18n for annotation
... and in particular (brought up in F2F of april): codepoint issues
... and how that would affect the counts for the anchoring of an annotation
... second issue: rangefinder API

character counting

https://gist.github.com/tkanai/e2984cfa14cf099baa94

azaroth: issue is
... in annotation model, there are different selectors
... we need consistent way of implementations of range by character count
... text length is counted differently in different languages
... (see gist)

addison: programming languages that use UTF16 use code points
... languages that use UTF8 hide that
... code points makes most sense, most languages can find out code points
... other issues: the 'user perceived boundaries'
... e.g., boundary between colors, emoji's etc..
... that is harder from a spec point of view

azaroth: so python uses code points, and javascript can get to the code points?

addison: yes

azaroth: easiest is thus to count code points, and make note that javascript implementations will be able to this, but currently can't

addison: and help for e.g. unicode controls

azaroth: e.g. e + acute character?

addison: yes, and also emoji's

TimCole: thinking about use cases: user can highlight part of text
... so perceived characters is what the user thinks he/she is annotating
... e.g., perceived characters that count differently on different devices:
... is there a different between laptop-viz and smartphone-viz?

addison: that won't be a problem
... e.g. scripted selection or programmatic manipulation is tricky for user perceived text selection
... boundaries will always be in the same places (codepoints), but scripting languages might translate those codepoints to different bytes (eg utf8 vs utf16

r12a: there are other control characters, e.g., bidirectional controls
... if one of those appears at boundary of selection, that might be an issue
... also: if a user will count the amount of characters they want, that will be more problematic
... third thing: in javascript it is possible to detect characters out of the 'normal range'

takeshi: to my understanding: it is not necessary to find accurate character length, but enough if we can count letters with unicode code points

adisson: invisible code points can change visualization of other code points. If you stop the boundary too soon, you might not pick up important modifier

takeshi: invisible character might become visible as a square

<Zakim> azaroth, you wanted to ask about polyfill possibility

addison: programmers can detect when a length is 2 instead of 1, because that will lie in a certain range

azaroth: is it possible to polyfill that? that probably needs some hardcore javascript programmer

r12a: the multilingual plane will always return single code point
... beyond that (e.g. chinese and japanese) will always need to be combined, they are surrogates
... so you can look for the second 'character' to combine them
... However, the question is: why would you need to count characters? user selects text via highlight, so he is in control of the range
... what gets picked up is a range of text, then you don't need to count

addison: you need to remember what the original annotated content was.
... When you try to compute where the annotation should go, that is a different use case then manual selection (e.g., RangeFinder)

azaroth: there is offset-based and string-based selectors: one of this issues is copyright and IP
... if annotations are created that record 100 characters, and with enough annotations, you could reconstruct the entire copyrighted text
... with offset, you cannot reconstruct the entire text
... so only recording the exact text string is not enough

addison: offset is also more efficient: you don't want to ship the entire selection (if that selection is large) together with that annotation

RangeFinder API

azaroth: RangeFinder API is browserlevel api to discover ranges using input that describes that range of text
... basic constructor has inputs, e.g., prefix, suffix, text, character start, character length, xpath
... or case folding (is case important or not)
... or unicode folding (is e+accent vs e important or not)
... or word boundary important (should tooth also be used to identify toothpaste)
... Addison had a look from a i18n point of view

<aphillip> https://lists.w3.org/Archives/Public/www-international/2015AprJun/0136.html

addison: we've been working on a character string model
... first reaction: bunch of options, they might not all make sense

<aphillip> http://w3c.github.io/charmod-norm/#searching

azaroth: ways forward: first as simple as possible?

adisson: no, you need to think of these problems (of other languages) from early on
... e.g., word boundary is very easy implementable for latin-based languages, and makes sense for japanese, but japanese does not use spaces
... that would need a dictionary

r12a: if you doubleclick on a word of any language, browsers can handle the word selection form japanese and thai...
... other languages use special characters to split words

addison: word boundary (browser-based) is not perfect

azaroth: documentation should include those links, together with charmod
... and keep RangeFinder in sync

addison: there is a section about case folding, that is language sensitive (spec might need to take that into account)
... current text in RangeFinder about unicode normalization has some major challenges
... you can do harm to text if you remove all combining characters, that might break some languages

TimCole: Anno WG needs to think more about use cases and requirements, in order to say:

<aphillip> see also: http://www.unicode.org/reports/tr10/#Searching

TimCole: in languages X and Y, we will not support word folding
... to make more realistic implementations for browsers

azaroth: stronger i18n-based use cases are needed

addison: collation algorithm is more tuned to sorting tokens than to search text
... computing time is larger

azaroth: i18n for the body of the annotation not just the target document segment is a great point
... encoding of the annotations themselves if body or target has i18n-characters
... is that a problem for JSON(-LD) or Turtle that we need to take into account?

addison: both formats use sequence of unicode-characters, so it shouldn't an issue

bigbluehat: [will continue about this on mail]

addison: we are very interested for input on how best to document this, and getting reviews for our documents

azaroth: takeshi, and others, can you have a look at the character model document
... thank you Addison and Richard for your input

addison: [about RangeFinder API review] you can use the mailing list
... of i18n working group
... If there comments about character model (@Takeshi), you can add issues to the github repo

<ivan> tracker, end telcon

<ivan> trackbot, end telcon

Annotation WG Telco

03 Jun 2015

Attendees

Contents

character counting

RangeFinder API

Summary of Action Items