See also: IRC log
<azaroth> RangeFinder draft: http://w3c.github.io/web-annotation/api/rangefinder/
<azaroth> Takeshi's character length research: https://gist.github.com/tkanai/e2984cfa14cf099baa94
[people introducing] ... i18n for annotation
... and in particular (brought up in F2F of april): codepoint issues
... and how that would affect the counts for the anchoring of an
annotation
... second issue: rangefinder API
https://gist.github.com/tkanai/e2984cfa14cf099baa94
azaroth: issue is
... in annotation model, there are different selectors
... we need consistent way of implementations of range by character
count
... text length is counted differently in different languages
... (see gist)
addison: programming languages that use
UTF16 use code points
... languages that use UTF8 hide that
... code points makes most sense, most languages can find out code
points
... other issues: the 'user perceived boundaries'
... e.g., boundary between colors, emoji's etc..
... that is harder from a spec point of view
azaroth: so python uses code points, and javascript can get to the code points?
addison: yes
azaroth: easiest is thus to count code points, and make note that javascript implementations will be able to this, but currently can't
addison: and help for e.g. unicode controls
azaroth: e.g. e + acute character?
addison: yes, and also emoji's
TimCole: thinking about use cases: user can
highlight part of text
... so perceived characters is what the user thinks he/she is annotating
... e.g., perceived characters that count differently on different
devices:
... is there a different between laptop-viz and smartphone-viz?
addison: that won't be a problem
... e.g. scripted selection or programmatic manipulation is tricky for
user perceived text selection
... boundaries will always be in the same places (codepoints), but
scripting languages might translate those codepoints to different bytes
(eg utf8 vs utf16
r12a: there are other control characters,
e.g., bidirectional controls
... if one of those appears at boundary of selection, that might be an
issue
... also: if a user will count the amount of characters they want, that
will be more problematic
... third thing: in javascript it is possible to detect characters out
of the 'normal range'
takeshi: to my understanding: it is not necessary to find accurate character length, but enough if we can count letters with unicode code points
adisson: invisible code points can change visualization of other code points. If you stop the boundary too soon, you might not pick up important modifier
takeshi: invisible character might become visible as a square
<Zakim> azaroth, you wanted to ask about polyfill possibility
addison: programmers can detect when a length is 2 instead of 1, because that will lie in a certain range
azaroth: is it possible to polyfill that? that probably needs some hardcore javascript programmer
r12a: the multilingual plane will always
return single code point
... beyond that (e.g. chinese and japanese) will always need to be
combined, they are surrogates
... so you can look for the second 'character' to combine them
... However, the question is: why would you need to count characters?
user selects text via highlight, so he is in control of the range
... what gets picked up is a range of text, then you don't need to count
addison: you need to remember what the
original annotated content was.
... When you try to compute where the annotation should go, that is a
different use case then manual selection (e.g., RangeFinder)
azaroth: there is offset-based and
string-based selectors: one of this issues is copyright and IP
... if annotations are created that record 100 characters, and with
enough annotations, you could reconstruct the entire copyrighted text
... with offset, you cannot reconstruct the entire text
... so only recording the exact text string is not enough
addison: offset is also more efficient: you don't want to ship the entire selection (if that selection is large) together with that annotation
azaroth: RangeFinder API is browserlevel api
to discover ranges using input that describes that range of text
... basic constructor has inputs, e.g., prefix, suffix, text, character
start, character length, xpath
... or case folding (is case important or not)
... or unicode folding (is e+accent vs e important or not)
... or word boundary important (should tooth also be used to identify
toothpaste)
... Addison had a look from a i18n point of view
<aphillip> https://lists.w3.org/Archives/Public/www-international/2015AprJun/0136.html
addison: we've been working on a character
string model
... first reaction: bunch of options, they might not all make sense
<aphillip> http://w3c.github.io/charmod-norm/#searching
azaroth: ways forward: first as simple as possible?
adisson: no, you need to think of these
problems (of other languages) from early on
... e.g., word boundary is very easy implementable for latin-based
languages, and makes sense for japanese, but japanese does not use
spaces
... that would need a dictionary
r12a: if you doubleclick on a word of any
language, browsers can handle the word selection form japanese and
thai...
... other languages use special characters to split words
addison: word boundary (browser-based) is not perfect
azaroth: documentation should include those
links, together with charmod
... and keep RangeFinder in sync
addison: there is a section about case
folding, that is language sensitive (spec might need to take that into
account)
... current text in RangeFinder about unicode normalization has some
major challenges
... you can do harm to text if you remove all combining characters, that
might break some languages
TimCole: Anno WG needs to think more about use cases and requirements, in order to say:
<aphillip> see also: http://www.unicode.org/reports/tr10/#Searching
TimCole: in languages X and Y, we will not
support word folding
... to make more realistic implementations for browsers
azaroth: stronger i18n-based use cases are needed
addison: collation algorithm is more tuned
to sorting tokens than to search text
... computing time is larger
azaroth: i18n for the body of the annotation
not just the target document segment is a great point
... encoding of the annotations themselves if body or target has
i18n-characters
... is that a problem for JSON(-LD) or Turtle that we need to take into
account?
addison: both formats use sequence of unicode-characters, so it shouldn't an issue
bigbluehat: [will continue about this on mail]
addison: we are very interested for input on how best to document this, and getting reviews for our documents
azaroth: takeshi, and others, can you have a
look at the character model document
... thank you Addison and Richard for your input
addison: [about RangeFinder API review] you
can use the mailing list
... of i18n working group
... If there comments about character model (@Takeshi), you can add
issues to the github repo
<ivan> tracker, end telcon
<ivan> trackbot, end telcon