Internationalization Working Group Teleconference -- 04 Aug 2016

Agenda

Discussion of ActivityStreams with Social WG

r12a: One of the most urgent questions at the moment is how to go about ensuring that directionality works. I think we should not talk about language yet, but focus on text direction. There are significatn differences between how language and direction work

<aaronpk> +1

r12a: The key question is do we need a direction property to capture the base direction of the text
... There are two aspects to specifying the base direction
... The overall default base for a paragraph or sequence of paragraphs
... The other is directional changes inline

<Francesco> finally I can hear you - sorry

r12a: They're slightly different. certainly if we had a direction property it would only be capable of describing the default base direction for the paragraph as a whole and you'd still need other mechanisms to indicate inline changes in direction
... The AS2 spec, and micropub and webmention have the same problem
... What we say for aS2 is probably relevant for them as well
... We'll talka bout AS2 to keep it simple
... Allows two types of message. One can take html markup and one cannot
... There is another question, whether anything can be done about that
... I think we should talk about that after the direction thing
... Should we capture the default base direction for the paragraph in a separate property, or should we just rely on the data in order to obtain that information?
... One way that you can do that by relying on the data is by testing the first strong character in the text
... You miss out any weak or neutral characters until you find a strong one, and if it's a rtl character you say the overall direction for the paragraph is rtl
... That works a lot of the time and if you look at people's twitter streams you see that most of the time it works okay, because they tend to have just arabic text, or they have arabic with some embedded latin, which either is handled specially by twitter or is a simple embedding which doesn't produce any problems
... But there are situations where that first-strong rule can be duped
... Which is when you need to say explicitly no this is not actually a ltr phrase even though it starts with a ltr strong character
... Twitter actually doesn't use first strong characater to determine
... It looks at the number of characters in one direction and the number in another
... And the results due to that are unpredictable
... But that's the basic idea. If you use the text you can most of the time figure out the direction, and the rest you need to find a way to indicate it
... If you're dealing with markup you can add it there
... If you're dealing with name, you can add a lrm or rlm character at the beginning of the text
... (control characters)
... Unfortunately things are not quite so simple as that. The main question here is whether there is a value in having a separate direction property

<Zakim> sandro, you wanted to ask about why directiion first, since it seems derived

sandro: Does direction have to be managed separately from language? I would naively assume that if I knew the primary language of a text I'd know the primary direction of the text?

aphillip: direction has a weak relation to the language. And language information isnt' always available or authoritative

sandro: The order of solving these is surprising to me. If we solve the language problem we solve the direction problem?

various: no

<cwebber2> are there any languages that are both rtl and ltr?

aphillip: Sometimes you can use language information to help infer the direction, but the direction you need in order to process it for display. It has it's own structure and needs to be managed in a particular way. Language does have some impact on display, but that's generally processes that are done separately from the bidi
... With language you're only inferring what the direction is likely to be for a particular paragraph

<jasnell> there are fairly successful heuristic approaches to guessing the directionality from language, but it's not foolproof by any means

r12a: there are many languages written in both ltr and rtl scripts

sandro: that makes sense

<aaronpk> https://www.w3.org/International/questions/qa-bidi-unicode-controls

<aaronpk> "how these control characters"

aaronpk: I was reading the w3 guide on unicode controls and from this, unfortunately there's no anchor, but if you search for ^ you'll see the paragraph
... Speciically about the title attribute in html
... There's obviously no mechanism for base direction of an attribute in html

Addison: You can set the base direction. You can't put markup inside the title

aaronpk: The example given here is the example of mixed rtl and ltr text where there isn't one being dominent because the text is so short
... This seems to solve it
... I'm wondering why we can't just use this to sovle the problem everywhere?

aphilip: If you want proper complete bidi layout then you need to do other thing sin order to make that happen. That can include using control characters
... A challenge is that in annotation cases you're going to be taking text that doesn't necessarily include that, or includes markup for directioanlity, and trying to get things to do the right thing

jasnell: There are a number of consideratioins. If you have a name that's plain text and have these control characters, not every implementation is going to understand these control characters

aaronpk: Having an extra property will need to be understood too

jasnell: Control characters can work they just add extra complexity

aaronpk: They have to support control characters anyway to suppoort bidi?

aphilip: They ahve to support control characters to support unicode
... But again the question is where you get information in order to do an implementation when you're constructing text, we find that mostly markup generally works better than the invisible controls for helping people with authoring content

aaronpk: Exaples of those?

aphillip: Some in some articles.. the challenge is the controls are invisible
... Whereas markup is visible to people trying to get the right direction
... When somebody is authoring a tweet or an annotation on a document, they're not authoring markup the control characters are generated on the fly to get the display to look correct
... What the text direction property is to do is to capture the context of the text that's entered or selected
... If you snippet a piece of text from an html document, the base direction might be delcared as far back as the html element on a webpage
... The DOM structure knows what the base direction is and could populate a text direction property, even though there's no markup nearby on the text that's being snippeted

aaronpk: the browser could also embed the control character in the text?

aphillip: That's possible, but a more likely or simpler implementation is to take a piece of data you already have and apply it as metadata rather than having to mutate the text that's being clipped
... Or similarly if you write an android application you can know that your runtime environment is set to rtl and therefore the input control used to enter the text is a rtl base direction context
... not to say that you couldn't try to manage the control characters for the user, but you're interfering with their text by inserting or removing control characters based on the runtime context
... That's a reason why we might want to have a separate property
... That isnt' to say that we couldn't solve it by writing instructions instead that says your implementation must or must not include control characters in a particular way

jasnell: Just to provide more context wrt the property approach. AS2 is a JSON based format. It is written to be compatible with or aligned with JSON-LD. While we can have objects, embedded and nested objects, there really is no concept of inheritance
... outside of the JSON-LD context within a document
... So if you have an object nested 3 or 4 levels deep it doesn't actually inherit the properties of its parent
... And these individual objects can be fro different sources, different authors
... In those cases declaring a base document level direction may not necessarily work, and we'd have to put the direction metadata at each object within that document
... So you potentially end up with multiple default direction properties throughout a single document

aphillip: our best practice is to recommend that language and base direction information is associated with each object that could contain it,s o each can be set separately
... And it's also useful to have a document level way of saying the default to have a fallback so you don't have tos et it on every single thing
... JSON-LD itself doesn't provide any of this structuring

jasnell: The point there is that adding.. I have no quarrel with adding this information as properties. The tradeoff though is that it does have a fairly significant complexity tradeoff for implementors. Ther'es also a backwards compat concern with as1
... Existing implementations, display name is a simple string as plain text without any language tagging or directionality
... We have made breaking changes from as1, so less of a concern now than it was before, but if we are going to provide this metadata we need to do it in a way that causes as little disrpution and complexity as possible

aphillip: I think these are all optional properties, that's less intrusive than requiring implementations to do control character insertion?

jasnell: It is less, we just need to be careful with the wording
... If we strongly recommend, it sends a signal. Implicit MUST

r12a: I'm for the idea of putting the information in the text itself, and I've been trying hard to think of scenarios where having a separate direction property would be advantageous and I haven't come up witha lot, however there are a couple of situations that are worth mentioning
... james, you mentioned increased implementor complexity
... If you had typed the text in a field, and it knew its rtl because of the context from the html, the user wouldnt' type in any information to say this is rtl
... If you're working with first-strong heuristics you wouldn't need that
... But if you had started with @mention which is in latin, then unless you have some very special handling in the target to say that's a twitter handle and you should ignore it, then you're going to get a situation where the first-strong char is ltr when the resto fth emessage is rtl
... I don't think you could expect the users to say 'this is going to be wrong if it goes somewhere else'
... The user isn't going to think about or want to add control characters to do that
... You want to get the data that th eDOM knows about and apply that in some way to the text so it comes out appropriately
... Whether you do that by putting hte data into a property value or by changing the text I'm not sure which is best, but they both would invovle some additional complexity in terms of making sure that when that piece of text finially comes out somewhere there's informationa bout the default directionaly is expected to be
... The other issue that we have in AS2 which we didnt' have in WA, in WA each leaf in the object only has one text property, and therefore we had a direction property and a text property which were closely related
... in AS2 you can have map property with translations, summary and content in same object with only one direction property which would give a default for all of those strings which may be wrong
... that's an addtional problem with having a direction property

cwebber: You were just saying that you weren't sure where it would be a difficultly to have markup
... I definitely want to support i18n. The clear case where it's problematic is titles, which are supposed to be just text, and possible to be rendered out of band, but very simply rendered. We don't want bold, we don't want links... just text
... We have one language to parse which is JSON and then you have another to parse which is HTML
... If you put HTML in a title element, it's difficult to parse in the first place. But it's also broad. If we permit links and bold and CSS in there, that's a lot more stuff to be concerned about than ... maybe we could reduce and say it's just <span> and that's all you're allowed to have
... Maybe that would work. It would reduce the scope, but would still be much more complex
... I have seen myself, sometimes people embed in RSS and atom readers you end up looking at blog entries and there are angle brackets rendered on the title, and I'm pretty sure that that's what will happen in our implementations
... I would like to support rtl stuff correctly, but that's why I feel this incling that having the control characters would be nicer
... But I have this itching feeling that we're going to end up with a lot of trouble if we permit html in this element

tantek: +1 to what chris said. The only experience we have with formats that are not html but then try to do embedded markup have basically all been failures in terms of implementation support, interop, and dependability by anyone trying to use those
... the hypothesis that using nested html markup in json, theonly data we have when that hypothesis has been tested has shown that that is false
... That that solution does not work
... we have zero examples of that working
... Iw ould go so far as to say we MUST NOT add markup in these elements
... the control character approach I'm not as familar with
... but that seems to be a simpler solution to try
... Has anyone tried that and what are the results? I would defer to i18n for research on that

r12a: Most people just want to type the text. Even fairly technical people who write in arabic or hebrew hate the control characters. They are hard to use. One of the problems is that they're invisible and you can never quite know whether you got it right
... If you try to edit something with the embedding, and you need start and end, and it gets really complicated

tantek: my understanding is the same tools the user is using to input text would be generating the embedded markup
... No user would ever type in or see any control codes
... Nobody is advocating users typing control codes

<r12a> https://www.w3.org/International/wiki/Bidi_in_social_media

r12a: most people don't have access to these control characters on their keyboard either. I did some testing ^
... If you put the rlm at the beginning and then try to make that work on twitter or facebook it doesn't actually work. They strip them out before posting the message

<aphillip> most is probably too strong, but certainly mobile users

r12a: There are all those disadvantages with control codes. What I wanted to understand was that there are properties like summary and content that can hold html. Where does that html come from? How do they end up with html in them?
... Maybe one answer is what you just said tantek, maybe it's created during the process of creating the text
... I was trying to understand.. people are not going to type in html either

jasnell: If you look at like blog software, for the authoring UI they provide a plain text title field and a rich text or markup editor that allows the user to format the content
... The editing tool itself is providng the markup for those values
... THe title tends to be plain text, and that's what would end up in the name property
... Whereas a rich text editor would provide the values for summary and content

r12a: I wonder how we would manage direction in that sort of context

jasnell: I'm not aware of any rich text editors that have directionality as default option. If they do they would be markup oriented not control characters

aphillip: there are a number. IN Arabic and hebrew context. Yahoo mail has controls for that
... Not necessarily obvious
... particulary to non-users of them

jasnell: And they operate in terms of markup, setting the directional spans rather than using control characters

aphillip: that's my understanding

<KevinMarks> Hebrew and Arabic keyboards often have the relevant chars

<aaronpk> https://github.com/w3c/activitystreams/issues/338#issuecomment-237570361

tantek: I left a long but clear comment on the AS2 github
... tha'ts my last point, I have to leave

jasnell: on the point of markup in name, and I made this in 338 too, one of the primary points in use cases, the whole semantic of the name property, is to provide a reliably readable label for the object
... If some implementation for instance doesn't understand the object type, it would still have a relable fallback to use the label
... Allowing markup of any kind makes it more problematic and complicated
... We have to retain that ability in order for the open extensibility model to continue working as it has been
... That's something we cannot lose
... Thatw as the point, part of the earlier discussion

aphillip: It's very hard to only permit limited forms of markup as well
... Once you kind of let some html in then you're kind of inviting a whole bunch of other html
... I don't think there's a lot of success in trying to limit what markup is applied
... It's not just bs and is and ems and strongs

cwebber: I think, building off what James said, and what you just said, we have to assume that it's not possible to embed html in that name element. So what can we do given that it's really not possible?
... THere's a real semantic need to have a plain text name for that object which won't work if we have markup
... It seems the control characters, or an addtional property. Are there any other options?

<KevinMarks_> ‏the vreating user agent can embed the control chars

cwebber: We definitely want to support that, everybody wants this to work
... If we assume that markup is not possible, what can we do at this point?
... Can we simplify the conversation if we acknowledge that?

aphillip: A property is supplying a base direction, I made that distinction early
... The base direction is not the same as providing inline controls to fix.. Richard has a whole bunch of examples.. text that needs help with multiple directions
... That's why we'd additionally need to look for control characters inside the text
... If you're going to have a plaintext string, you're still going to need control characters for perfect bidi

jasnell: if we're not going to allow markup, to propertly support bidi the only way is to support control characters
... We do have the option right now in the json format to say name is an object, as an option, that has a direction and language property, and a value
... It's mroe complicated for implementors and consumers, but it does give us the option of declaring on a per-field basis without having to rely on markup
... What is the complexity tradeoff?

<cwebber2> it would be possible, but a big headache to add that so late to all our activitystreams libraries

aphillip: Can we describe rules for insertion and removal of control codes for the bidi
... Properties of the field... just the base direction that would be a property there... vs inline metadata

<Zakim> aaronpk, you wanted to say I completely agree with tantek, and was never advocating that users type control characters themselves

aaronpk: I'm not sure about the comment r12a made about me, I want to echo tantek earlier, I fully expect that the tools would be the ones adding the appropriate characters to the string, I'd never expect users to add that themselves
... My understanding is that the main reason html has a base dir property for elements is not so much so that the string itself is in the correct order, but that html elements can flow in the correct order

aphillip: that's not correct
... It doesn't change the order in which the elements flow
... What it has to do with is how the text is processed for unicode base direction, but doesn't hcange what order the elements are presented in

aaronpk: One reason that html needs the attribute is if you imagine a full width element, setting the base direction on that element means the text will appear on the right side of the screen. That won't happen in control characters..

aphillip: that's not necessarily true

aaronpk: html is describing the layout. In most of these json format we're not describing the layout, just the string
... we don't know what format it will be presented in
... html is specifically describing the presentation

aphillip: I think that's an invalid reading of the use of dir

<Steve_Atkin> I have to drop the call now.

aphillip: It's the case that the dir attribute causes that kind of rtl display that a rtl user would expect. But it's also an inherant property of the text. the reason it doesn't live in CSS is because it's an inherant property of the text

aaronpk: that's absolutely my point
... outside of the context of html, the text does not have an inherant presentation

aphillip: we're not talking about presentation
... We're talking about if i Get a piece of text, I'm going to assume a base direction generally of ltr, and that will cause rtl text to display incorrectly

r12a: aaron, there are two aspects of rendering
... One aspect is that if you know the base direction is rtl and you have "{arabic} w3c" that woudl determine where "w3c" goes in relation to the arabic text
... And another aspect is where the entire line of text appears on the page, against the left margin or against the right
... Sometimes you might want to sequence things rtl but keep them on the lefthand side
... If youlook at twitter and facebook dealing simply with strings and they detect rtl direction and they move it to the right side of the box. That's some processing their application does

aaronpk: what I'm actually trying to say is that while html is describing the presentation of the whole rendering of the page, but AS2 does not talk about presentation at all. The presentation is left up to the consuming application. It feels wrong to use a mechanism that exists in a presentation format in a spec that does not talk about presentation

aphillip: I think you're missing the point. There are two kinds of presentation
... One is what you're talking about, layout sand that sort of thing
... What html is concerned with
... But the data itself has a direction.. the example Richard gave is which side of the string do the letters "w3c" on the arabic, depends on the base direction of that text regardless of where you present it
... That's a property of the text, not a property of the presentation of the text
... same on a teletype, html, etc

aaronpk: that's why I'm so interested in it being actually in the text, not as a property on the text

r12a: aaron, I wanted to get some background infromation out. The problems we have with control characters may be something we have to deal with in applications rather than AS2
... I wanted to go back to the question chris said, what are the options here
... It seems to me that the options we are looking at currently are either if we know that the thing should be rtl that we stick a control char at the beginning of the string, or we stick it in an extra field
... I'm not sure that we're saying you sould necessarily have a direction property partly because it's not specific enough when we have multiple strings within one object
... I'm just saying I think we have two options
... We change the string, or we put some metadata alongside each specific string where needed

aphillip: I think tha'ts what's necessary
... you can't have one text direction property that applies to six strings

r12a: so which of those is the better approach

cwebber: The control character at the start of the string will be fine, but having the additional metadata as a separate property... instead of having say name : "text" having name: { object } I think is going to screw up implementations just as much as having html in there
... Most of the fields in this can have html

<cwebber2> {"name": "This is LTR", "nameDir": "ltr"}

cwebber: The vast majority of the fields in which this applies is kind of a non-issue. Only name you can't
... So what if ^

<KevinMarks_> ‮ inline works in reverse without implementers knowing

cwebber: Just solve this for name or a few small fields where html is not permitted
... if an implementation doesn't know how to pay attention to nameDir they were going to fail anyway
... It will maybe hit the best middle ground

<rhiaro> doesn't solve multiple directions in one name value though

scribe: or stick a control character at the start

<aaronpk> again you *need* to support control characters in strings in order to properly support bidirectional text (a string with text in both directions)

KevinMarks: The advantage of doing it with injected control characters should work for anyone who is correctly using utf8
... whereas an extra property we're creating extra work for anyone creating and display
... in terms of most likely preservation of intent, putting it directly in the utf8 seems to be the strongest way to do that
... Maybe adding a note that creating user agents should do that

<r12a> https://www.w3.org/International/wiki/Bidi_in_social_media

r12a: The additional wrinkle here, third thing at the bottom of that url
... There's a two line text input
... THe top line needs to be treated ltr and the second line is rtl
... If you don't do that then text is in the wrong place
... The rest of the stuff there shows that twitter and facebook don't manage this very well
... If the name property has multiple lines in it (haven't seen examples of that yet) then it's not just a question of sticking a control character at the beginning of the strong, it's putting it at the beginning of each line
... Same applies with summary and content where you have html

<aphillip> line == paragraph

r12a: Perhaps it's more likely, where you have multiple paragraphs
... You probably ought to establish the basedir for each paragraph
... Or you could put a wrapper around the whole thing like <div dir="rtl"
... There are intricacies in there I'm not terribly clear about

jasnell: Whatever we do with the metadata, however we indicate this base direction, there is definitely a tradeoff cost
... We already have some complexity of name and nameMap
... I'm suspecting that the property approach is probably goign to be the most reliable for the base direction. Some combination of this property and the control codes
... But we need to take that time to balance the approach against existing complexity of name vs nameMap
... We should take our time, put together a proposal

r12a: I'll try to provide some tests you can use

jasnell: appreciate that

aphillip: do you all want to come back next week? How shall we proceed?

jasnell: works for me

aphillip: I will reserve time next week to discuss language
... If there are proposals for how to discuss direction further, do we want to use a particular list or github issue for that discussion?
... Preferences?

jasnell: if we can get a proposal in place by then we can discuss it then

aphillip: it's taken years of our lives, so don't be surprised..

AOB?

aphillip: thanks social, I'll reserve time next week

Internationalization Working Group Teleconference

04 Aug 2016

Attendees

Contents

Agenda

Discussion of ActivityStreams with Social WG

AOB?

Summary of Action Items

Summary of Resolutions