Re: mlw-lt-track-ISSUE-131: URI scheme used in NIF conversion [MLW-LT Standard Draft] from Felix Sasaki on 2013-09-02 (public-multilingualweb-lt@w3.org from September 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Mon, 02 Sep 2013 23:09:17 +0200
To: public-multilingualweb-lt@w3.org
Message-ID: <5224FE7D.1040100@w3.org>
Hi all again,

today I talked to Sebastian briefly about Dave's point

[
If the URL used in the RDF for the NIF string subclass does not actually
need the char 'attributes' because we have nif:beginIndex and
nif:endIndex then is the rest of the URL redundant as we have that
information also (sort-of) explicitly in nif:wasConvertedFrom? If so why
even attempt to encode this information in the URL of the String
instance - could we just use any otherwise meaningful unique identifier
right?
]

Sebastian said that the reason to have this RDF Class
http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#RFC5147String
is to express: the URI contains the offset information. So dropping the 
offsets from the URI does not make sense.

This means that we still have the two options that the RDF WG gave us: 
registering a fragment ID, or encoding URIs that include the offset 
information in the URI query part

For this second option I'd like to propose one variant: not forcing 
everybody to use the same service, but say that the "RFC5147String" URI 
needs to encode the offsets in the query part - but it is up to them 
how. So below would be fine

http://www.w3.org/its?resource=http://example.com/exampldoc.html&char=0,29
http://www.w3.org/its?resource=http://example.com/exampldoc.html&xpath=/html/body[1]/h2[1] 


but this one too

http://example.com/myitsservice?input=http://example.com/exampldoc.html&char=0,29 

http://example.com/myitsservice?input=http://example.com/exampldoc.html&xpath=/html/body[1]/h2[1]

The nice thing about this variant is that it is aligned wiht what 
Sebastian did for his demos - see e.g.
http://nlp2rdf.lod2.eu/demo/NIFStanfordCore?input=Welcome+to+Dublin+in+Ireland&input-type=text
(which still uses "offset" instead of the RFC5147 "char=" syntax)

Again, please state your opinions in this thread and / or attend 
Wednesday's call. Even if not many people implement RDF we need to form 
a WG response.

Thanks,

Felix

Am 30.08.13 19:51, schrieb Felix Sasaki:
> Hi all,
>
> to keep things together for tracker and the planned PR transition 
> request I am replying to Phil's
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Aug/0066.html 
>
> and Dave's mails
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Aug/0067.html 
>
> in this thread.
>
> Phil is saying in above mail
> "I like option 1. of registering the char fragment id."
> My co-chair response is to this: we need to take into account that the 
> process of registering the char fragment id is not clear at all. The 
> guidelines that the RDF WG is citing
> http://www.w3.org/TR/fragid-best-practices/
> are *guidelines* - there is no idendification of clear steps etc., a 
> timeline to expect etc. So with our plan to finalize ITS2 this year, I 
> would advise against this option if there is no clear and "safe" 
> estimation how long this would take.
>
> Dave is saying in his mail
>
> [
> If the URL used in the RDF for the NIF string subclass does not actually
> need the char 'attributes' because we have nif:beginIndex and
> nif:endIndex then is the rest of the URL redundant as we have that
> information also (sort-of) explicitly in nif:wasConvertedFrom? If so why
> even attempt to encode this information in the URL of the String
> instance - could we just use any otherwise meaningful unique identifier
> right?
>
> I only ask because that latter option might avoid any further confusion
> over the NIF examples in the spec, e.g. the query string option might
> still tempt the question of how its used, but there might be other NIF
> related implications I'm not aware of.
> ]
>
> Not a co-chair opinion, but my personal one: indeed, the NIF 
> conversion itself and also the testing we use on top of it 
> (validate.jar in github,
> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/nif-conversion/sparqltests 
>
> provided by Sebastian) do not rely on using "char=" as part of the URI 
> - and not at all on "#". Here is an example of converting a copy of a 
> wikipedia page
> http://sasakiatcf.com/felix/diverse/Biology
> with this service
> https://github.com/fsasaki/its20-extractor/tree/master/wikipedia-extractor 
>
> you can transform it to NIF  and have "char" with "#" or anything 
> else, see e.g.
>
> http://tinyurl.com/plhk9qz
>
> That kind of prooves that the "#" in the URI is not relevant for the 
> conversion at all.
>
> Now, one suggestion behind the RDF WG proposal is probably "make sure 
> that each URI resolves to something". With both solutions 1) and 2) we 
> can achieve that. But with 2) it is actually up to the implementer 
> what the URI resolves to: 2) says "The WG uses a different URI scheme, 
> ", but the query part of an URI with parameters "&" can also be 
> generated by above service. So if we use the URIs just as identifiers 
> (as Dave suggested), without a new URI scheme, we still can reply to 
> the "linked data" requirement of allowing to resolve them. It is just 
> up to the implementers to realize that resolution.
>
> best,
>
> Felix
>
> Am 28.08.13 18:10, schrieb MultilingualWeb-LT Working Group Issue 
> Tracker:
>> mlw-lt-track-ISSUE-131: URI scheme used in NIF conversion [MLW-LT 
>> Standard Draft]
>>
>> http://www.w3.org/International/multilingualweb/lt/track/issues/131
>>
>> Raised by: Felix Sasaki
>> On product: MLW-LT Standard Draft
>>
>> Copied from
>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Aug/0057.html 
>>
>>
>> Felix,
>>
>> this is the official review of the RDF WG on the ITS Draft, more 
>> exactly the NIF conversion section[1]. The RDF WG discussed the issue 
>> and took a resolution on this response[2]
>>
>> The problem we see in the conversion algorithm is the URI-s that the 
>> algorithm generates, namely the URI-s of the form
>>
>> <http://example.com/exampledoc.html#char=0,29>
>> <http://example.com/exampledoc.html#xpath(/html/body[1]/h2[1])>
>>
>> although it is quite obvious what these are for, we do sense a 
>> problem with these nevertheless. Indeed
>>
>> - RDF Concepts 1.1 Last Call document[3] refers to IRI-s: RFC3987[4]
>> - IRI-s map to URI-s: RFC3986[5]
>> - What RFC3986 says about fragments is:
>>
>> [[[
>> The fragment's format and resolution is therefore dependent on the 
>> media type [RFC2046] of a potentially retrieved representation, even 
>> though such a retrieval is only performed if the URI is 
>> dereferenced.  If no such representation exists, then the semantics 
>> of the fragment are considered unknown and are effectively 
>> unconstrained.
>> ]]]
>>
>> Looking at the URI-s above:
>>
>> - The 'char' fragment id is defined by rfc 5147[6], but is defined 
>> for text/plain only. ITS talks about XML and HTML, ie, talks about 
>> resources whose media types are definitely _not_ text/plain
>> - The 'xpath' fragment id is fine for XML. But it is not defined for 
>> text/html
>>
>> In view of this, we do not feel comfortable with the choice of the 
>> mapping; the resulting RDF triples will not be entirely correct 
>> because these URI-s are not correct. Additionally, although that is 
>> not an RDF requirement per se, the URI-s are not dereferenceable 
>> (because they are incorrect) which is also in contradiction with 
>> Linked Data Principles which are also prevalent in the community.
>>
>> We do see two ways around this issue
>>
>> 1. The WG registers the 'char' fragment id-s (see also [7] for 
>> guidelines) through IETF for HTML and XML. (Actually, extending the 
>> usage of 'char' to XML/HTML would be generally very useful). Also, 
>> the WG registers 'xpath' for HTML (although we realize that this may 
>> be difficult because it might not be acceptable for the HTML WG which 
>> 'owns' the text/html media type)
>>
>> 2. The WG uses a different URI scheme, trying to avoid fragment ids. 
>> Something like:
>>
>> http://www.w3.org/its?resource=http://example.com/exampldoc.html&char=0,29 
>>
>> http://www.w3.org/its?resource=http://example.com/exampldoc.html&xpath=/html/body[1]/h2[1] 
>>
>>
>> where, of course, the www.w3.org/its part can be some other URI and, 
>> ideally, would refer to a service returning something feasible and 
>> intelligent on the request there.
>>
>> However. We also recognize that the mapping in the ITS document is 
>> _not_ normative. As a consequence, the ITS WG is perfectly in its 
>> right to go ahead and not to follow the comments of the RDF Working 
>> Group. In other words, the ITS Working Group does not have to ask 
>> again for a formal approval of the RDF Working Group on any decision 
>> it may take (although I would be interested by the decision:-)
>>
>> I hope this was helpful to you
>>
>> Sincerely, in the name of the RDF Working Group
>>
>> Ivan Herman (staff contact for the RDF WG)
>>
>> P.S. Note that there are similar efforts elsewhere, like the 
>> string-range fragment id[8] or the work IDPF did for ebooks[9], but 
>> we recognize none of these offer an alternative.
>>
>>
>> [1] http://www.w3.org/TR/2013/WD-its20-20130820/#conversion-to-nif
>> [2] https://www.w3.org/2013/meeting/rdf-wg/2013-08-28#resolution_1
>> [3] http://www.w3.org/TR/2013/WD-rdf11-concepts-20130723/
>> [4] http://tools.ietf.org/html/rfc3987
>> [5] http://tools.ietf.org/html/rfc3986
>> [6] http://tools.ietf.org/html/rfc5147
>> [7] http://www.w3.org/TR/fragid-best-practices/
>>
>>
>>
>
>
Received on Monday, 2 September 2013 21:09:47 UTC