Re: Adding a datatype for HTML literals to RDF (ISSUE-63)

On May 2, 2012, at 21:29 , Richard Cyganiak wrote:

> On 2 May 2012, at 19:15, Andy Seaborne wrote:
>> I think I'm saying, start simple, prove a need for more complicated.
>> 
>> We can define a value space that is all character sequences (and is disjoint from xsd:string).  Do we need to be more complicated?  What's the use case?
> 
> One use case might be RDFa parsers with HTML literal support.
> 
> Let's say you have @datatype="rdf:HTMLLiteral" on some element, and the element contains text with markup, and the desire is that the resulting HTML literal contains the text with markup intact.
> 
> Now the RDFa parser may not have access to the actual HTML string, but only to a representation that has already been parsed into a DOM tree.

That is certainly the case for pyRdfa.

> 
> So the parser may have to serialize the DOM into a string, which would probably be different from the original string.
> 

Indeed.

That being said, there is an issue here. HTML5 parsers transform invalid HTML5 into a DOM tree that does not reflect the original. In such a case, an RDFa parser may have no choice than to output the transformed DOM tree in HTML5. But that is an RDFa problem and not an RDF one. Note that if we follow the official HTML5 algorithm in defining defined a value space, then what would happen is to issue an HTML5 Literal that is different in lexical space but is identical in value space. Which is, sort of, all right.

> (Or is this nonsense and the parser could always just do myDOMElement.innerHTML to get the original HTML?)
> 

I am not sure whether this is available in all HTML5 parsers. I do not see it in the python HTML5Lib Parser that I use, for example (but I may have missed it).


> Anyways, the advantage of having a value space that is isomorphic to the DOM is that you can parse and re-serialize the HTML and still get the same value.
> 

Yes, see above.


>> (Not all RDF systems have access to info set support code now that we are standardising Turtle and N-triples.)
> 
> Yeah and that's why we're trying to change rdf:XMLLiteral to make it optional and to relax its lexical space.
> 
> I imagine that rdf:HTMLLiteral would be optional too, and the lexical space should certainly be as unrestrictive as possible.
> 
> Only those who want to compare HTML literals, or those who *need* to parse and re-serialize HTML literals, need to care what the value space is. (And yeah, if we can't come up with evidence that some systems need to do one of those, then there's little point in defining anything more complicated than a 1:1 L2V mapping.)
> 
> Best,
> Richard
> 
> 
> 
>> 
>> 	Andy
>> 
>>> 
>>> Ivan
>>> 
>>>> Best,
>>>> Richard
>>>> 
>>>> 
>>>> 
>>>>>> And I guess in theory, DOMs and XML Infosets should be isomorphic, no?
>>>>> 
>>>>> In theory:-) To be checked. There may be corner cases.
>>>>> 
>>>>>> 
>>>>>> Between all these transformations, there should be something that works for us. The devil is in the details of course.
>>>>> 
>>>>> Exactly...
>>>>> 
>>>>>> 
>>>>>> Or we could just avoid all of that trouble and simply define the value space of the HTML datatype as identical to the lexical space.
>>>>> 
>>>>> And then we are back to the same issue as we had with XML Literals. Except that... there is no such thing as a formal canonical HTML5
>>>>> 
>>>>> Ivan
>>>>> 
>>>>>> 
>>>>>> Best,
>>>>>> Richard
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Just some food for thoughts...
>>>>>>> 
>>>>>>> Ivan
>>>>>>> 
>>>>>>> 
>>>>>>> On May 1, 2012, at 18:41 , Gavin Carothers wrote:
>>>>>>> 
>>>>>>>> On Tue, May 1, 2012 at 6:46 AM, Richard Cyganiak<richard@cyganiak.de>  wrote:
>>>>>>>>> All,
>>>>>>>>> 
>>>>>>>>> The 2004 WG worked under the assumption that the future of HTML was XHTML, and that the use case of shipping HTML markup fragments as RDF payloads would be addressed by rdf:XMLLiteral. But in 2012, shipping HTML fragments really means HTML5. Is rdf:XMLLiteral still adequate for this task? Is a new datatype with a lexical space consisting of HTML5 fragments needed? This question is ISSUE-63.
>>>>>>>>> 
>>>>>>>>> I think it would be useful to have a straw poll sometime soon on this question:
>>>>>>>>> 
>>>>>>>>> PROPOSAL: RDF-WG will work on an HTML datatype that would be defined in RDF Concepts.
>>>>>>>> 
>>>>>>>> +1, and for internationalization should be a required datatype, might
>>>>>>>> also have a simple syntax in Turtle (though would likely require a new
>>>>>>>> last call but a Web formating that doesn't understand HTML doesn't
>>>>>>>> seem like much of a web format)
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> If there is general support for this, then we could start work on the details of the datatype definition (lexical space, value space, L2V mapping and so on).
>>>>>>>>> 
>>>>>>>>> All the best,
>>>>>>>>> Richard
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----
>>>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>>>> Home: http://www.w3.org/People/Ivan/
>>>>>>> mobile: +31-641044153
>>>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> ----
>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>> Home: http://www.w3.org/People/Ivan/
>>>>> mobile: +31-641044153
>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> ----
>>> Ivan Herman, W3C Semantic Web Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf

Received on Thursday, 3 May 2012 07:33:55 UTC