Warning:
This wiki has been archived and is now read-only.

InternationalizedString

From OWL
Jump to: navigation, search

[Hide Review Comments]

Document title:
OWL 2 Web Ontology Language
Proposals and Issues of Internationalized Strings in OWL (Second Edition)
Authors
Jie Bao, Rensselaer Polytechnic Institute
Abstract
This document presents the specification efforts of internationalized strings in OWL 2.
Status of this Docment
This is an editors' draft.

1 Introduction

Internationalized string is a special data type of string with a tag of which natural language the string belongs to. The OWL Working Group has considered introducing a new language construct to support such a feature, namely owl:internationalizedString. It has been pointed out that the RIF Working Group has a parallel effort on rif:text with a close functionality and a joint effort of the two working groups on the choice of language constructs for internationalized strings was suggested [1][2]. This document summarizes the existing proposals and related open issues.

2 Existing Proposals and Specifications

2.1 Internationalized Strings in RDF

Resource Description Framework (RDF): Concepts and Abstract Syntax has a syntax for plain literals with language tags, namely "<string>"@<tag>

"Plain literals have a lexical form and optionally a language tag as defined by [RFC-3066], normalized to lowercase."

Note: RFC 3066 is "Tags for the Identification of Languages".

2.2 The Proposal rif:text (RIF Working Group)

The RIF Data Types and Built-Ins document specifies that:

"rif:text (for text strings with language tags attached).

This symbol space represents text strings with a language tag attached. The lexical space of rif:text is the set of all Unicode strings of the form ...@LANG, i.e., strings that end with @LANG where LANG is a language identifier as defined in [RFC-3066]."

2.3 The Proposal owl:internationalizedString (OWL Working Group)

The OWL 2 Structural Specification and Functional-Style Syntax provides the owl:internationalizedString construct:

"The unary datatype owl:internationalizedString represents pairs of strings and language tags, and thus represents plain RDF literals with a language tag. The lexical space of this datatype is a string of the form "text@languageTag"; thus, the text of each lexical value before last @ sign is the actual text, and the text after the last @ sign is the language tag of the constant. Each such lexical value is interpreted as a pair <"text",languageTag>."

and

"The constants of datatype owl:internationalizedString can be written simply as "lexical value"@language." [3]

"Padre de familia"@es is an abbreviation to an internationalized constant "Padre de familia@es"^^xsd:internationalizedString — that is, a pair consisting of the string "Padre de familia" and the language tag es denoting the Spanish language. Note that the lexical values of xsd:internationalizedString constants are strings that contain the actual string value, the @ sign, and the language tag, without any spaces between them.

OWL Model-Theoretic Semantics specifies that:

"The value space for owl:internationalizedString consists of all pairs <"text",languageTag>, where "text" is a string and languageTag is a language tag."

2.4 The Proposal owl:langPattern (OWL Working Group)

Another proposal in the OWL Working Group from Peter F. Patel-Schneider [4] specifies how to match language tags:

"...Using the already-existing facilities for datatype restrictions to select on language tags.

This would add a new dataytpe facet, langPattern - owl:langPattern in RDF - that would be applicable only to owl:internationalizedString. The meaning of this facet would be to match the *value* of the language tag against the pattern, using the same algorithm as in XML Schema Datatypes.

As well, owl:internationalizedString would admit the length, minLength, maxLength, and pattern facets, which would be applied to the string part of the literal.

So, strings in English or dialects of English would be

DatatypeRestriction(owl:internationalizedString langPattern "en*")"

Follow up comments about language pattern matching: Addison Phillips Jul 8, Axel Polleres Jul 10, Axel Polleres Jul 14, Felix Sasaki Jul 14, Addison Phillips Jul 14

2.5 The Proposal of a String Datatype Hierarchy

From Axel Polleres:

"A probably more feasible solution would be to do a real type hierarchy, for language tags and - instead of a datatype owl:internationalizedString or rif:text which has pairs of strings and language tags as lexical space - define separate datatypes and (subtypes) for each lang-tag, ie.

use:

message("Hello"^^lang:en-US)

where e.g. lang:en-US is a subtype of lang:en, i.e. that would also imply

message("Hello"^^lang:en)"

In a following comment, Boris Motik expressed concerns about the semantic effect of the proposal.

3 Open Issues

Due to the close intended functionality of the above mentioned proposals, it has been suggested [5] [6] that the RIF working group and OWL group to work together for the choice of a language construct.

Open issues for further discussion include:

  • The choice of name space. Alternatives include "rif", "owl", "rdf" or "xsd".
    • Note that the RIF Working Group [7] did not put "rif:text" into the xsd (XML Schema) namespace becuase such a datatype is not considered primitive.
    • Sandro Hawke: One odd thing about using the RDF namespace is that the rdf:text datatype will never be used in (existing) RDF serializations, because they already have a way to serialize such data. Happily, this lets us avoid worrying about the constraint in RDF Syntax, "Any other names [in the RDF namespace] are not defined and SHOULD generate a warning when encountered". We should note this in the spec, I think. Note also that future RDF serializations might choose to use this, so they don't have to special-case language-tagged strings.
  • The construct's name, e.g., "text" or "internationalizedString".
    • Addison Phillips: I'm not sure I like the name "internationalizedString". I realize that this is an expansion on xsd:string and thus needs a different name. However, it implies that other strings are somehow "not internationalized". Perhaps something along the lines of "languageString", "nlString" (nl for natural language), or similar.
  • In language tag pattern matching, whether allow case insensitive matching [8].
  • Whether supersede RFC 3066 with RFC 4646 (Tags for Identifying Languages) or BCP 47
  • Shall we do an own datatype hierarchy?
  • Should the subtag hierarchy have semantic implications?
  • Should we use Unicode as the standard for strings since it has composed characters. [Boris Motik]
  • comments about OWL i18n issues that are not about language tags from Addison Phillips.

Corresponding sections in The OWL 2 Structural Specification and Functional-Style Syntax and OWL Model-Theoretic Semantics will be updated once an agreement is made.

4 Meetings

4.1 Joint meeting of OWL, RIF and I18N WGs, July 21, 2008

IRC Log

Some conclusions:

  • datatype hierarchy may not be practical
  • name: "text" is preferred in the meeting (simpler)
  • name space: "rdf" is preferred in the meeting


4.2 Joint meeting of OWL, RIF and I18N WGs, August 11, 2008

Meeting Summary

5 References

[RFC-4646]
RFC 4646 - Tags for Identifying Languages. M. Phillips and A. Davis. IETF, September 2006, http://www.ietf.org/rfc/rfc4646.txt. Latest version is available as BCP 47, (details) .
[UNICODE]
The Unicode Standard, Version 5.1.0. The Unicode Consortium.