Difference between revisions of "PlainLiteral"

From OWL
Jump to: navigation, search
(Syntax)
(Semantics)
(11 intermediate revisions by one user not shown)
Line 36: Line 36:
 
=== Syntax ===
 
=== Syntax ===
  
<p>The lexical space of <tt>rdf:text</tt> is the set of all Unicode strings of the form '<i>text</i>@<i>lang</i>', i.e., Unicode strings that end with '@<i>lang</i>' where '<i>lang</i>' is a language identifier as defined in [BCP-47] or the empty string. Thus, the <i>text</i> before the last '@' sign represents the actual text, and the text after the last '@' sign is the (possibly empty) language tag of the constant. Each such lexical value is to be interpreted as a pair ("<i>text</i>","<i>lang</i>").</p>
+
<p>The lexical space of <tt>rdf:text</tt> is the set of all strings of the form '<i>text</i>@<i>lang</i>', i.e., strings that end with '@<i>lang</i>' where '<i>lang</i>' is a language identifier as defined in [BCP-47] or the empty string. Thus, the <i>text</i> before the last '@' sign represents the actual text, and the text after the last '@' sign is the (possibly empty) language tag of the constant. The character set of '<i>text</i>' is assumed to be isomorphic with the set of integers, that is, infinite; however, a <tt>rdf:text</tt> string is valid only if its <i>text</i> part is a Unicode string.</p>
 +
 
 +
<!--
 +
Each such lexical value is to be interpreted as a pair ("<i>text</i>","<i>lang</i>").
 
{{Review|[[User:Baojie|Baojie]] 10:49, 27 August 2008 (UTC)|I think the last sentence can be removed, since it is about semantics}}
 
{{Review|[[User:Baojie|Baojie]] 10:49, 27 August 2008 (UTC)|I think the last sentence can be removed, since it is about semantics}}
 +
-->
  
 
<p>For presentation syntaxes that use the form</p>
 
<p>For presentation syntaxes that use the form</p>
Line 46: Line 50:
  
 
<div class="grammar">
 
<div class="grammar">
TextConstant        ::= '"' <i>UNICODESTRING</i> LangTag '"^^<<nowiki>http://</nowiki>www.w3.org/1999/02/22-rdf-syntax-ns#text>'  | TextShort
+
<span class="nonterminal">TextConstant</span>         ::= '"' <i>UNICODESTRING</i> LangTag '"^^<<nowiki>http://</nowiki>www.w3.org/1999/02/22-rdf-syntax-ns#text>'   
LangTag              ::= '@' <i>LANGTAG</i> | /* Empty */
+
| <span class="nonterminal">TextShort</span><br/>
TextShort            ::= '"' <i>UNICODESTRING</i> '"' LangTag
+
<span class="nonterminal">LangTag</span>             ::= '@' <i>LANGTAG</i> | /* Empty */ <br/>
 
+
<span class="nonterminal">TextShort</span>           ::= '"' <i>UNICODESTRING</i> '"' <span class="nonterminal">LangTag</span>
 
</div>
 
</div>
  
Line 61: Line 65:
  
 
The value space <tt>rdf:text</tt> is the set of all pairs of the form ("<i>text</i>","<i>lang</i>"), where <i>text</i> is a Unicode character sequence and <i>lang</i> is a lowercase Unicode character sequence which is a natural language identifier as defined by [BCP-47]. The lexical-to-value-space mapping of <tt>rdf:text</tt>, denoted L<sub><tt>rdf:text</tt></sub>, maps each symbol "<i>text</i>@<i>lang</i>" in the lexical space of <tt>rdf:text</tt> to ("<i>text</i>",lower-case("<i>lang</i>")), where lower-case("<i>lang</i>") is "<i>lang</i>" written in all-lowercase letters.
 
The value space <tt>rdf:text</tt> is the set of all pairs of the form ("<i>text</i>","<i>lang</i>"), where <i>text</i> is a Unicode character sequence and <i>lang</i> is a lowercase Unicode character sequence which is a natural language identifier as defined by [BCP-47]. The lexical-to-value-space mapping of <tt>rdf:text</tt>, denoted L<sub><tt>rdf:text</tt></sub>, maps each symbol "<i>text</i>@<i>lang</i>" in the lexical space of <tt>rdf:text</tt> to ("<i>text</i>",lower-case("<i>lang</i>")), where lower-case("<i>lang</i>") is "<i>lang</i>" written in all-lowercase letters.
 +
 +
'''Note''': Several Unicode points might denote one logical character. This may affect some operations, e.g., length counting and string comparison, on <tt>rdf:tex</tt> strings. An implementation of this specification SHOULD provide such operations on Unicode code point level. However, there are also other alternatives that can be adopted optionally by applications, e.g., defining those operations on Unicode normalized form [http://www.w3.org/TR/2004/WD-charmod-norm-20040225/#sec-UnicodeNormalized].
  
 
==== <tt>xs:string</tt> as a restriction of <tt>rdf:text</tt> ====
 
==== <tt>xs:string</tt> as a restriction of <tt>rdf:text</tt> ====
Line 80: Line 86:
 
: <cite>[http://www.ietf.org/rfc/rfc4646.txt RFC 4646 - Tags for Identifying Languages]</cite>. M. Phillips and A. Davis. IETF, September 2006, http://www.ietf.org/rfc/rfc4646.txt. Latest version is available as BCP 47, ([http://www.w3.org/International/core/langtags/rfc3066bis.html details]) .
 
: <cite>[http://www.ietf.org/rfc/rfc4646.txt RFC 4646 - Tags for Identifying Languages]</cite>. M. Phillips and A. Davis. IETF, September 2006, http://www.ietf.org/rfc/rfc4646.txt. Latest version is available as BCP 47, ([http://www.w3.org/International/core/langtags/rfc3066bis.html details]) .
 
; <span id="ref-unicode">[UNICODE]</span>
 
; <span id="ref-unicode">[UNICODE]</span>
: <cite>[http://www.unicode.org/unicode/standard/versions/ The Unicode Standard, Version 5.1.0]</cite>. The Unicode Consortium.
+
: <cite>[http://www.unicode.org/unicode/standard/versions/ The Unicode Standard]</cite>. The Unicode Consortium.
 +
; <span id="ref-UnicodeNormalized">[Unicode Normalized]</span>
 +
: <cite>[http://www.w3.org/TR/2004/WD-charmod-norm-20040225/#sec-UnicodeNormalized Character Model for the World Wide Web 1.0: Normalization]</cite>. W3C.
 
; <span id="ref-unicode">[IRC Log July 21, 2008]</span>
 
; <span id="ref-unicode">[IRC Log July 21, 2008]</span>
 
: <cite>[http://www.w3.org/2008/07/21-i18n-irc Joint meeting of OWL, RIF and I18N WGs.]</cite>.  
 
: <cite>[http://www.w3.org/2008/07/21-i18n-irc Joint meeting of OWL, RIF and I18N WGs.]</cite>.  

Revision as of 16:56, 3 September 2008

__NUMBEREDHEADINGS__

[Hide Review Comments]

Document title:
OWL 2 Web Ontology Language
Internationalized Strings in RIF and OWL (Second Edition)
Authors
Jie Bao, Rensselaer Polytechnic Institute
Axel Polleres, DERI Galway at the National University of Ireland, Galway, Ireland
Abstract
This document presents the specification for a primitive datatype representing internationalized text that is used both in the RIF and OWL languages.
Status of this Docment
This is an editors' draft being developed jointly for RIF and OWL. Please send comments and questions to public-rdf-text@w3.org (public archive).

1 Introduction

Internationalized texts, that is text which additionally conveys information in terms of a language tag, are used in several existing W3C specifications, such as RDF, XML, OWL, and RIF, etc. In order for specifications like RDF, RIF, or OWL to refer to such internationalized text literals, those specifications require a special data type which is provided in the present document. Parallel efforts have been made to support internationalized strings by several W3C working groups, including the OWL WG and the RIF WG. Collaboration between the two working groups on the choice of language constructs for internationalized strings has lead to the present specification [1][2]. This document gives an initial specification for the both groups to discuss.

Parts of this document are based on the current work on rif:text[3] (RIF WG) and owl:internationalizedString[4] (OWL WG), for more details see a summary.

2 Specification (Draft)

Throughout this document we use the following namespace prefixes:

  • the xs: prefix stands for the XML Schema namespace URI http://www.w3.org/2001/XMLSchema#
  • the rdf: prefix stands for http://www.w3.org/1999/02/22-rdf-syntax-ns#

Internationalized texts defined in this document form a primitive datatype. The datatype is identified by the URI rdf:text (http://www.w3.org/1999/02/22-rdf-syntax-ns#text) and represents pairs of strings and language tags. Except for the RIF and OWL specifications, this datatype is expected to supersede RDF's plain literals with language tags, cf. [5], which is why a URI with the rdf: prefix was chosen to identify the detatype.

In the following, we will define both the syntactic representation (also referred to as lexical space) along with several shortcuts to be used in abstract syntaxes such as OWL's or RIF's presentation syntaxes or RDF's TURTLE syntax, as well as the value space and lexical-to-value mapping for rdf:text.

2.1 Syntax

The lexical space of rdf:text is the set of all strings of the form 'text@lang', i.e., strings that end with '@lang' where 'lang' is a language identifier as defined in [BCP-47] or the empty string. Thus, the text before the last '@' sign represents the actual text, and the text after the last '@' sign is the (possibly empty) language tag of the constant. The character set of 'text' is assumed to be isomorphic with the set of integers, that is, infinite; however, a rdf:text string is valid only if its text part is a Unicode string.


For presentation syntaxes that use the form

"literal"^^<datatype-identifier>

for constants that belong to a certain datatype, such as RIF's presentation syntax [6], OWL's functional-style syntax [7], or RDF's TURTLE syntax [8] we define some convenient shortcuts for rdf:text typed constants. The allowed shortcuts for rdf:text constants are defined by the following EBNF.

TextConstant  ::= '"' UNICODESTRING LangTag '"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#text>' | TextShort
LangTag  ::= '@' LANGTAG | /* Empty */
TextShort  ::= '"' UNICODESTRING '"' LangTag

Here, LANGTAG is a valid language tag according to [BCP-47] and UNICODESTRING is a Unicode string where quotes are escaped and additionally all the other escape sequences defined in [9] and [10].

"Padre de familia"@es is an abbreviation to an internationalized text constant "Padre de familia@es"^^rdf:text — that is, a pair consisting of the string "Padre de familia" and the language tag es denoting the Spanish language. Note that the lexical values of rdf:text constants are strings that contain the actual string value, the @ sign, and the language tag, without any spaces between them. Note that, in turn, "Padre de familia", i.e., the string without a language tag, is an abbreviation for an internationalized text constant "Padre de familia@"^^rdf:text. The latter abbreviation is usable for plain literals without language tag in RDF, or as an alternative for constants in the xs:string datatype in RIF, cf. Section on Semantics of rdf:text below.

2.2 Semantics

The value space rdf:text is the set of all pairs of the form ("text","lang"), where text is a Unicode character sequence and lang is a lowercase Unicode character sequence which is a natural language identifier as defined by [BCP-47]. The lexical-to-value-space mapping of rdf:text, denoted Lrdf:text, maps each symbol "text@lang" in the lexical space of rdf:text to ("text",lower-case("lang")), where lower-case("lang") is "lang" written in all-lowercase letters.

Note: Several Unicode points might denote one logical character. This may affect some operations, e.g., length counting and string comparison, on rdf:tex strings. An implementation of this specification SHOULD provide such operations on Unicode code point level. However, there are also other alternatives that can be adopted optionally by applications, e.g., defining those operations on Unicode normalized form [11].

2.2.1 xs:string as a restriction of rdf:text

xs:string can be viewed as derived datatype from rdf:text in the following sense: The lexical space of xs:string - which is the set of all Unicode character sequences - maps one-to-one to the restriction of the lexical space of rdf:text of all Unicode character sequences terminated by '@' in a . Likewise, the value-space of xs:string maps one-to-one to the pairs of Unicode character sequences with the empty language tag. That is, specifications like RIF or OWL, may treat rdf:texts with an empty language tag and normal xs:strings equivalently, by equating the respective value spaces, that is the unary value space of Unicodes strings used for xs:string and the pairs of Unicode strings with an empty language tag.

3 Open Issues

Corresponding sections in The OWL 2 Structural Specification and Functional-Style Syntax, OWL Model-Theoretic Semantics and RIF Data Types and Built-Ins will be updated once an agreement is made.

4 References

[RFC-4646]
RFC 4646 - Tags for Identifying Languages. M. Phillips and A. Davis. IETF, September 2006, http://www.ietf.org/rfc/rfc4646.txt. Latest version is available as BCP 47, (details) .
[UNICODE]
The Unicode Standard. The Unicode Consortium.
[Unicode Normalized]
Character Model for the World Wide Web 1.0: Normalization. W3C.
[IRC Log July 21, 2008]
Joint meeting of OWL, RIF and I18N WGs..