RE: RDF-ISSUE-129 Re: json-ld-api: change proposal for handling of xs:integer

On Monday, May 13, 2013 11:25 AM, Gregg Kellogg wrote:
> On May 13, 2013, at 4:36 AM, Sandro Hawke <sandro@w3.org> wrote:
> 
> > [this is really two related issues -- one about xs:integer, then
> other about xs:double, in JSON-LD.]
> >
> > On 05/12/2013 09:45 PM, Manu Sporny wrote:
> >> On 05/10/2013 06:31 PM, Sandro Hawke wrote:
> >>> I believe, by saying in situations where there might be a loss, one
> >>> MUST NOT convert to a number.
> >> We didn't do this because the range for a JSON number isn't defined
> >> anywhere.

Right. JSON-LD the data format doesn't has this issue as it has an unlimited
value space. So it's really just problematic for systems converting those
strings (even the things without quotes are strings on the wire) to numbers.


> >>> It's true we don't know exactly when there might be a loss, but
> after
> >>> talking with Markus, I'm pretty confident that using the range of
> >>> 32-bit integers will work well.
> >> ... except that most systems support 64-bit numbers, and we'd be
> >> hobbling those systems. :/

And problem is still there for 16bit or 8bit systems. That might not matter
much in practice but in a couple of years the 32bit limit won't matter
anymore - just as 16bit or 8bit don't do much anymore today.


> > Yes, but I'm not sure the demand is *that* great for efficient
> handling of integers outside the range of 32-bits.      We're hobbling
> their handling of numbers in the range of +- (2^31...2^53), for the
> most part.
> >
> > But yes, there is a tradeoff of efficiency against correctness.
> >
> > I can't help wondering how the JSON standards community thinks about
> this.  It seems like a huge problem when transmitting JSON to not know
> if bits will be dropped from your numbers because the receiving system
> is using a different-from-expected representation of numbers.

Typically large numbers are represented as strings. Twitter run into that
problem when their Tweet IDs crossed 53bit a couple of years ago. They are
now serializing it as both a number and a string; id and id_str, see:

https://dev.twitter.com/docs/twitter-ids-json-and-snowflake

In JSON-LD we have a way to add a type to such a string-number, so that
shouldn't be a big problem.


> The point of being able to use native numbers in JSON is that this is
> much more convenient for JSON developers to use than strings, which
> might still need tom be evaluated. But it is impossible to do this for
> every possible integer. I think that restricting this to 32 bits is a
> reasonable restriction, given the limitations of important JSON
> parsers, but requiring the use of BigInteger-like libraries should be
> considered.

We need to distinguish between the data format (the thing on the wire) and
processors. On the wire the range and precision is unlimited. Processors
converting that to some native type of course have limitations but as Gregg
said that limit can be stretched quite far these days... even though it
makes implementations much more complicated as off-the-shelves JSON parsers
don't do this (yet). PHP allows e.g. to parse large numbers into strings so
that nothing is lost (except the info that it was a number and not a
string).


> >> We might want to put in guidance that moves the decision to the
> >> processor (it can detect when a conversion would result in data
> loss).
> >> Perhaps it should be up to the implementation to determine when data
> >> could be lost.

That would be my preferred solution.


> > The problem is:
> >
> > step 1:  64-bit server pulls data out of its quadstore and serializes
> it as JSON-LD
> > step 2:  Server sends that JSON-LD to client
> > step 3:  32-bit client uses that data.
> >
> > If the server is using native json numbers, and some number is in the
> 2^31...2^53 range, then the client will silently parse out the wrong
> number.    That's a pretty bad failure mode.    I'm not sure whether
> people will react by:
> >
> >  - not using native json numbers for that range (as I'm suggesting)
> >  - insisting that clients handle json numbers the same as the server
> does (somehow)
> >  - not using native json numbers at all
> >  - not using json-ld at all
> >
> > I suspect if we give no guidance, the we'll find ourselves at the
> later options.

I don't agree with that reasoning. JSON does exactly the same and I haven't
heard from people stopping using it because of that. Yeah, in some cases it
might be better to serialize numbers as strings but in contrast to JSON,
JSON-LD allows to add a datatype - so it won't be an opaque string as in
JSON.


> Prefer the second option, but could live with the first.
> 
> >>> I'd also add:
> >>>
> >>> "1"^^xs:int              // not native since it's 'int' not
> >>> 'integer' "01"^^xs:integer     // not native since it's not in
> >>> canonical form
> >> +1

So we are just converting numbers in *canonical* lexical form? Would be fine
with that.


> >>> These rules will make xs:integer data round tripping through JSON-
> >>> LD perfectly lossless, I believe, on systems that can handle at
> >>> least 32 bit integers.
> >> Yeah, but I'm still concerned about the downsides of limiting the
> >> number to 32-bits, especially since most of the world will be using
> >> 64-bit machines from now on.

Me too... and in a couple of years the same will be true about 64bit.


> > Another option is to say JSON LD processors MUST retain at least 53
> bits of precision on numbers (my second option above), but Markus tells
> me PHP compiled for 32-bit hardware, and some C JSON parsers, wont do
> that.

-1, that will make it impossible to implement conformant JSON-LD processors
on certain platforms.


> Likely, languages with these limitations have some kind of BigInteger
> implementation; if so, we could consider using the 64-bit space.
> 
> >> I do agree that we might be able to change the text to ensure that
> >> precision loss isn't an issue, and I do agree with you that it's
> >> definitely worth trying to prevent data loss.
> >>
> >> Tracking the issue here:
> >>
> >> http://lists.w3.org/Archives/Public/public-rdf-wg/2013May/0136.html
> >>
> >>> On a related topic, there's still the problem of xs:double.  I
> don't
> >>> have a good solution there.   I think the only way to prevent
> >>> datatype corruption there is to say don't use native number when
> the
> >>> value happens to be an integer.
> >> I don't quite understand, can you elaborate a bit more? Do you mean,
> >> this would be an issue?
> >>
> >> "234.0"^^xsd:double --- fromRDF() ---> JsonNumber(234)
> >
> > Yes.

"234.0"^^xsd:double --- fromRDF() ---> JsonNumber(234) --> toRDF
"234"^^xsd:integer


> >  Option 0: leave as-is.   RDF data cannot be faithfully transmitted
> through JSON-LD if 'use native numbers' is turned on.

That's what the flag is for. I'm wondering how other RDF libraries handle
that!? For example, what happens if you call Jena's getInt() with an integer
> 32bit? Will it throw an exception?

http://jena.apache.org/documentation/javadoc/jena/com/hp/hpl/jena/rdf/model/
Literal.html


> >  Option 1: in converting RDF to JSON-LD, processors MUST NOT use
> native json numbers for xs:double literals whose values happen to be
> integers.  Leave them in expanded form.

That would be a very weird and surprising behavior for most users.


> >  Option 2: in converting between RDF and JSON-LD, processors SHOULD
> handle the JSON content as a *string* not an object.  When they
> serialize as double, they SHOULD make sure the representation includes
> a decimal point.  When they parse, they should map numbers with a
> decimal point back to xs:double.   Also, when they parse, they should
> notice numbers that are too big for the local integer representation
> and keep them in a string form.

Isn't that exactly what useNativeTypes = false does?


> > FWIW, I hate all of these options.   I can't even decide which I hate
> the least.   Seriously hoping someone has a better idea....
> 
> The point of having the useNativeTypes flag is to address these issues,
> hobbling the implementations for all implementations to guarantee no
> data loss goes against the whole point of using a JSON representation
> in the first place; the format is optimized for applications,

I think we should keep in mind that we are primarily designing a data
format. The data has none of these issues as numbers can be of arbitrary
size and precision. The problem manifests itself when those numbers are
converted to some native representation. You have the same problem
anywhere.. plain-old JSON, XML, etc. I think we should just add a note or
something highlighting the problem and explaining the various approaches to
avoid it.


> Any JSON-LD processor can faithfully transform from other RDF formats
> by turning off the useNativeTypes option; the only thing to consider is
> if this guidance needs to be made more pro intent and if we should
> consider changing the default for that option.

+1.. don't care much about the default value.


> Option 0 preserves the intent of the format the best, but developers
> should be aware that, for the sake of convenience and utility,
> developers should recognize the possibility of round-tripping errors.

+1, that's how JSON has been successfully used for years.


> Option 1 is much more inconvenient for developers, as their code now
> needs to branch if the value is a string or hash, rather than just
> count on its being a number.

-1, very unintuitive behavior


> Option 2 places more of a burden on processor developers. In Ruby, I'd
> need to always use custom datatypes for numbers to carry around the
> original lexical representation, but this could be easily lost through
> intermediate operations. I'd also need a custom JSON parser and
> serializer to ensure that the serialized form is represented properly,
> not worth it IMO.

Just use useNativeTypes = false if you want that behavior. Requiring
implementers to write their own JSON parsers is not an option in my opinion.



--
Markus Lanthaler
@markuslanthaler

Received on Monday, 13 May 2013 17:21:14 UTC