OWL + Internationalization

21 Jul 2008

See also: IRC log


Felix, Sandro




<aphillip_> Axel, could you send a message to everyone saying that we'll use this channel? I have to disconnect my email to use IRC :-(

<AxelPolleres> yup

<aphillip_> hi, are we public or member-only here?

<sandro> public on a # channel

<fsasaki> public

<aphillip_> i18n is public... I'm actually asking if we should make minutes public

<sandro> Ah.

<sandro> public.

<bmotik> What is the conference code?

<aphillip_> 4878

<AxelPolleres> Jie had collected a list of issues... (which I extend a bit)

<AxelPolleres> Open issues for further discussion include:

<AxelPolleres> * The choice of name space. Alternatives include "rif", "owl", "rdf" or "xsd". Note that the RIF Working Group [7] did not put "rif:text" into the xsd (XML Schema) namespace becuase such a datatype is not considered primitive.

<AxelPolleres> * The construct's name, e.g., "text" or "internationalizedString".

<AxelPolleres> * In language tag pattern matching, whether allow case insensitive matching [8].

<AxelPolleres> * Whether supersede RFC 3066 with RFC 4646 (Tags for Identifying Languages)

<AxelPolleres> * Shall we do an own datatype hierarchy?

<AxelPolleres> * Should the subtag hierarchy have semantic implications?

<baojie> Axel: could you paste link your extended issue list here?

<AxelPolleres> I didn't put it online yet, took your list as a basis:

<AxelPolleres> http://www.w3.org/2007/OWL/wiki/InternationalizedString#The_Proposal_owl:langPattern_.28OWL_Working_Group.29

<fsasaki> scribe: Felix

<fsasaki> scribeNick: fsasaki

<scribe> meeting: OWL / i18n meeting

alex: two proposals:
... how to deal with internationalized text
... one proposal to have one data type, or a hierarchy of data types
... this has different implications depending on where we go
... sub typing would have semantic implication
... but not sure if the semantic implication is really wanted
... would "en" vs "en-us" mean that if we query for "en" we also get "en-us"?
... if we would have a type hierarchy we would get that
... even if we do that we need to make clear: how to define the value spaces and lexical spaces
... if we have just one data type, as a lexical space we have strings with the "@" sign in the language tag
... with one data type, we would have pairs of language tags and string parts
... another question whether we go for RFC 4646 or RFC 3066 (which seems to be obsolete)
... opinions on what I said?
... I think data type hierarchy is feasible, but not sure if it can be done easily
... not sure how we could have semantic implication of the data type

<AxelPolleres> who is speaking???

addison: language tag structure is important for operations like matching
... most W3C technologies are not designed for dealing with structured strings
... if I ask for let's say an i18n string in English, how do I get all English "en", "en-us" etc. with one request
... if you construct the hierarchy of tags, as with sub tags
... there is a lot of machinery for a straight forward kind of thing

<AxelPolleres> Also the sub-tags are not the *only* about sub-strings, right?

addison: matching algorithm is very simple string matching

boris: axel said if we don't go for the data type hierarchy we can't have implications
... I'm not sure about that
... currently OWL says we need a value space consisting of pairs
... I think even if we have a hierarchy these strings need to be different
... if we agree that the value space is a set of pairs (string, string) I don't see a lot of differences
... we could provide a regex based on the 2nd element of the pair

Addison: concern: regex are good but they are limiting, they don't understand what language tags are about

sandro: Addison said using heavy semantic is overkill, but now you say regex is not enough

Addison: regex is not enough, they could work, but the ones for language tags are a bit complicated

<aphillip_> static final String langtag_ex =

<aphillip_> "(\\A[xX]([\\x2d]\\p{Alnum}{1,8})*\\z)"

<aphillip_> + "|(((\\A\\p{Alpha}{2,8}(?=\\x2d|\\z)){1}"

<aphillip_> + "(([\\x2d]\\p{Alpha}{3})(?=\\x2d|\\z)){0,3}"

<aphillip_> + "([\\x2d]\\p{Alpha}{4}(?=\\x2d|\\z))?"

<aphillip_> + "([\\x2d](\\p{Alpha}{2}|\\d{3})(?=\\x2d|\\z))?"

<aphillip_> + "([\\x2d](\\d\\p{Alnum}{3}|\\p{Alnum}{5,8})(?=\\x2d|\\z))*)"

<aphillip_> + "(([\\x2d]([a-wyzA-WYZ](?=\\x2d))([\\x2d](\\p{Alnum}{2,8})+)*))*"

<aphillip_> + "([\\x2d][xX]([\\x2d]\\p{Alnum}{1,8})*)?)\\z";

Addison: the structure of language tags with "-" makes it sometimes difficult to use regex

boris: if we can agree on value space I don't see an issue

<AxelPolleres> addison, did I understand correctly that your "exceptions" from the typical subtag pattern (reg-expressions) could be more or less "read off" from http://www.iana.org/assignments/language-subtag-registry

<AxelPolleres> ?

sandro: think that the pair is the right way to go

axel: in the pair would "en" and "en-us" be disjoint?

<aphillip_> axel, exceptions are a very small list in the registry (or in rfc)

boris: we are dealing with language of content here
... if you talk about values you have to distinguish "en" and "en-us"
... you could have a data type which is called lang-en which includes all values of "en", but that's a "class" thing
... first question is whether we deal with one or two values, hierarchy is secondary question
... you can apply a function saying "give me all chinese tags"
... there is no need to put that into the semantics of the types, but have this in the built-in functions

axel: do you deal in RIF with data types and / or facets?

boris: what do you mean by facets here?

axel: we use facets e.g. from XML Schema, to create facets for e.g. integer
... how do you do that in RIF?
... in RIF so far we only define a basic set of data types
... we did not consider data type restrictions, facets at all yet

boris: something to start working from: we could agree on one data type (both working group)
... e.g. "internationalized string"

<AxelPolleres> felix, it was the other way boris <-> axel

boris: value set is a set of pairs (string, string)

Addison: question is how to deal with semantics of 2nd string
... XML Schema has a type "xs:language"
... that is a string, it can be used to represent xml:lang

<AxelPolleres> please paste the link

<AxelPolleres> (just for completeness)

<aphillip_> http://www.w3.org/TR/xmlschema11-2/#language

boris: agree, would be better to refer to the language tag standard, RFC 3066

addison: better BCP 47

axel: afraid of referring to a moving target

addison: we (i18n core) we have dealt with this elsewhere
... see e.g. XML Schema saying "RFC 3066 or its successor"

sandro: sounds OK to me

boris: is an empty language tag valid?

addison: it's not a valid language tag, but can be used e.g. in xml:lang

boris: Don't think that rdf land is effected here


<sandro> boris: the only thing rdf needs is that (x, "") be distinct from all (x, *)

<sandro> (where * is not "")

<AxelPolleres> boris' proposal to include xs:string as a special case sounds fine to me.

addison: can what you want be described as a standard form of language tag matching?

boris: yes

addison: BCP 47 has three algorithms for matching
... these include how to provide a list

boris: as long as you can use some regex we could be fine

<aphillip_> http://www.ietf.org/rfc/rfc4647.txt : extended filtering

addison: your approach is similar to extended filtering

<aphillip_> http://www.inter-locale.com/ID/draft-ietf-ltru-matching-15.html#extMatching

<AxelPolleres> boris, you mean e.g. that lang:en would be a subtype of (internationalized) string that covers all those with langtag en plus its subtags?

boris: basic internationlized string data type would allow to implement something like that on top of this
... in OWL it may be not so easy since you are quantifying over ranges
... that might not be decideable , need to check expressibility

addision: yes, and language tags provide ways to deal with that
... it's more complicated than other types like integer

alex: confirm boris proposal: have one type "i18n string"
... if you want to have a type that covers all English strings, that would be a sub type?

boris: yes
... that would be a sub set of the general value space

sandro: we get the same functionality, no matter if we do one data type or one per language tag?
... the lexical spaces would be different

boris: the value spaces are sub sets, that is most interesting

<AxelPolleres> basically, if i understand correctly, bortis says, we can do the single datatype on top, and specify the type hierarchy below it afterwars.

alex: about boris' earlier proposal to include "string" in here:
... not sure how we could distinguish "string" from language tag "en" string

boris: for OWL "i18n" string, to be able to embedd it into RDF
... we need a unique lexical representation
... proposal from OWL WG is:

<AxelPolleres> in (simple) RDF, btw "blabla" is different from "blabla"^^xs:string

boris: lexical space of i18n string: text of string, "@", language tag

<bmotik> String "abc" without langTag is "abc@"^^owl:internationalizedString

<bmotik> "abc"^^xsd:string

<bmotik> is equivalent to the previous one

<AxelPolleres> and we' define xs:string as a sybtype of internat.string which has exactly that valuespace? ... yest

<bmotik> "abc"@en

boris: obviously the lexical representation is not equivalent

<bmotik> What is "abc"@en?

boris: OWL WG says this is a syntactic shortcut for:

<bmotik> IT is a syntacitc shortcut for "abc@en"^^owl:internationalizedString

<AxelPolleres> Yup, we had talked about this shortcut in RIF, BTW, but not yet approved it.

boris: this is to be compatible with RDF and the representation syntax

<bmotik> You define "abc" as a syntacitc shortcut for "abc@"^^owl:internationalizedString

boris: internally you can say that all literals have this structure

sandro: having the "@" sign in the string is kind of a hack
... it technically works but I'm worried that it is pretty ugly

boris: that is the reason why we have the syntactic shortcut

sandro: in the examples we use the shortcut, but the tools may or may not use the shortcut

<AxelPolleres> people aren't supposed to use it (just like people aren't supposed to use "a"^^rif:iri ... or no?)

<bmotik> "abc"^^lang:en

boris: otherwise you would always have to write the above

<sandro> "chat"@en ==="chat"^^lang:en

<sandro> instead of "chat@en"^^owl:internationalizedString

boris: do you agree that we still define the whole value space without the lexical space

sandro: seems fine by me, sounds like something which will not be serialized

boris: you could use owl:internationizedString
... we could call it "text"

addison: better name
... using "internationalized" sounds that other strings are not internationalized

<AxelPolleres> "text" is simpler, agreement, it seems.

Addison: better not to introduce an artifical distinction of strings if its not necessary

sandro: I'm OK with the @ sign, the other proposal looked nicer

axel: boris convinced me that we need the value space to treat all values differently
... we would not get that by type hierarchy with "lang"

<AxelPolleres> basically... from before: "basically, if i understand correctly, bortis says, we can do the single datatype on top, and specify the type hierarchy below it afterwars."

<AxelPolleres> ..." boris, you mean e.g. that lang:en would be a subtype of (internationalized) string that covers all those with langtag en plus its subtags?"

<AxelPolleres> lang:en *could* be defined by a reg exp.

boris: using lang "en" in the lexical space might be confusing

addison: yes, in the matching document RFC 4647 you have two names: the "sub tag" and the "range" which says "the thing a tag starts with"

boris: lang data type is rather used for querying
... by not allowing a particular lexical representation we would make this clear

addison: that gets you out of the problem that in language ranges you can have "*", but not in language tags

<AxelPolleres> The other extreme would be that "abc"^^lang:en is indeed the same as "abc"^^lang:en-us... just like "1.0"^^xs:decimal is indeed the same as "1"^^xs:integer ... thes seems to be not wanted, yes?

boris: we could put some strong wording to the spec to make this clear
... we need to specify what the allowed items are

<aphillip_> extended-language-range = (1*8ALPHA / "*")

<aphillip_> *("-" (1*8alphanum / "*"))

addison: I would go to extended language range from RF 4647, see above
... reference those for matching

<sandro> scribe: Sandro

<scribe> scribenick: sandro

Addison: If you just say it's a string, then you're going to not be helping people very much. There is infrastructure for this, and it would be good to reference that.

<AxelPolleres> What I mean is "lang:en" is then no longer a datatype, but a built-in function.

Boris: Axel is saying that *matching* is a more appropriate operation on this data type -- a builting in RIF, a facet in OWL, that refer to RFC 4647.
... In OWL, these facets are relatively easy. For strings you have regexp pattern facets, restricting the string. We could easily introduce a language-range facet, which is a query in this RFC 4647 language. All pairs which match this query. RIF could have a similar built-in function.

Axel: Our conclusion -- real datatype hierarchy is no practical. Given that, this all sounds fine.

sandro: My inclination is rdf: as the prefix.

<AxelPolleres> rdf:text?

<AxelPolleres> +1

boris: I don't care.

<baojie> I will summarize the recent emails and the discussion in an update document

Addison: sounds good to me.

<AxelPolleres> boris said: I' fine, I don't care (slight difference)

<AxelPolleres> :-)

<baojie> http://www.w3.org/2007/OWL/wiki/InternationalizedString#

<baojie> http://www.w3.org/2007/OWL/wiki/InternationalizedStringSpec

<baojie> ok

Summary of Action Items

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.133 (CVS log)
$Date: 2008/07/21 18:08:40 $

Scribe.perl diagnostic output

[Delete this section before finalizing the minutes.]
This is scribe.perl Revision: 1.133  of Date: 2008/01/18 18:48:51  
Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/

Guessing input format: RRSAgent_Text_Format (score 1.00)

Succeeded: s/xxx/boris/
Succeeded: s/alex/axel/
Succeeded: s/felix/boris/
Succeeded: s/axel: something/boris: something/
Succeeded: s/boris: in RIF/axel: in RIF/
Succeeded: s/boris/axel/
Succeeded: s/yyy/boris/
Succeeded: s/xm:lang/xml:lang/
Succeeded: s/rdf:lang/rdf land/
Succeeded: s/text/"text"/
Found Scribe: Felix
Found ScribeNick: fsasaki
Found Scribe: Sandro
Inferring ScribeNick: sandro
Found ScribeNick: sandro
Scribes: Felix, Sandro
ScribeNicks: fsasaki, sandro

WARNING: No "Topic:" lines found.

WARNING: No "Present: ... " found!
Possibly Present: Addison Addison_Phillips Axel AxelPolleres Boris Felix P0 P3 P8 Sandro addision alex aphillip_ baojie bmotik fsasaki lang rdf scribenick
You can indicate people for the Present list like this:
        <dbooth> Present: dbooth jonathan mary
        <dbooth> Present+ amy

WARNING: No meeting chair found!
You should specify the meeting chair like this:
<dbooth> Chair: dbooth

Got date from IRC log name: 21 Jul 2008
Guessing minutes URL: http://www.w3.org/2008/07/21-i18n-minutes.html
People with action items: 

WARNING: No "Topic: ..." lines found!  
Resulting HTML may have an empty (invalid) <ol>...</ol>.

Explanation: "Topic: ..." lines are used to indicate the start of 
new discussion topics or agenda items, such as:
<dbooth> Topic: Review of Amy's report

[End of scribe.perl diagnostic output]