Re: change proposal for issue-86, was: ISSUE-86 - atom-id-stability - Chairs Solicit Proposals from Julian Reschke on 2010-04-15 (public-html@w3.org from April 2010)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Thu, 15 Apr 2010 11:03:55 +0200
To: Maciej Stachowiak <mjs@apple.com>
CC: Sam Ruby <rubys@intertwingly.net>, Ian Hickson <ian@hixie.ch>, "public-html@w3.org WG" <public-html@w3.org>
Message-ID: <4BC6D67B.6040106@gmx.de>
On 15.04.2010 01:20, Maciej Stachowiak wrote:
> ...
> Speaking solely as an amateur spec lawyer:
>
> Here is what the Atom spec says about uniqueness of IDs:
>
> "When an Atom Document is relocated, migrated, syndicated, republished,
> exported, or imported, the content of its atom:id element MUST NOT
> change. Put another way, an atom:id element pertains to all
> instantiations of a particular Atom entry or feed; revisions retain the
> same content in their atom:id elements. It is suggested that the atom:id
> element be stored along with the associated resource."
>
> <http://www.ietf.org/rfc/rfc4287.txt>
>
> Is converting an HTML file to Atom an example of one of the covered
> actions that MUST NOT change the atom:id? It doesn't seem like it to me
> - it's definitely not a relocation, migration, syndication republication
> or export. Is it an import? It's not clear to me if "import" means from
> one Atom system to another, or from any non-Atom format.

I think "republished" applies here.

 From the spec you quoted it should be clear that the authors wanted to 
be clear that it really applies to *any* way to produce Atom.

> Personally, I would read it as import between Atom publishing systems.

Nope.

> If you require import from external formats to give fixed IDs, it gives
> potentially nonsensical results. For example, creating an Atom post from
> plain text that does not contain an ID would be nonconforming, since you
> can't guarantee another Atom system would give the same ID if you
> imported it there. Or if there was a globally defined algorithm for

That would be a problem if you considered the imported piece of text to 
be the "same" once it's imported into different feeds.

But when you do things like that, you usually put that piece of 
information into a context. You can't simply assume that anybody else 
that takes the same text is producing the "same" atom element from it. 
For instance, given plain text as input, how do you produce titles and 
timestamps?

So in general, I would expect somebody importing plain text to assign a 
unique ID upon import, and keep it.

> creating the ID when importing from plain text, it would have to be
> based on something like a hash of the text, and therefore could not
> preserve ID across edits of the text.

Yes, that would be a poor ID (as poor as none from a usability point of 
view, but still conforming with respect to the XML syntax).

> Therefore, it seems to me that "import" can't possibly refer to
> converting from another format, or it would effectively be nonconforming
> for anyone to convert from another format ever (except formats that
> already include unique IDs).
>
> But let's assume "import" or one of the other verbs does include
> conversion from other formats. Applying this to HTML:
>
> (1) Should converting the exact same document to Atom multiple times
> with the same converter give the same atom:ids? That seems like a
> practical requirement to add, an implementation could just compute a
> hash of the whole document and append a sequence number for each entry.

That would have the problem you mentioned above. The ID would be unique, 
but it would change more frequently than it should.

> (2) Should converting the exact same document to Atom multiple times
> with different converters give the same atom:ids? You could handle this
> like (1) if the spec defined an exact algorithm for generating Atom IDs,
> but that would conflict with (3) below.

I don't see that as a requirement.

> (3) Should converting an edited version of the same document to Atom
> multiple times with the same converter give the same atom:ids? It seems
> like this *might* be doable if the converter tracks what documents it
> has converted before, and has some way to identify that the edited copy
> is the same. It seems like "edited but the same" is kind of fuzzy
> though. What if you rename the file, and make extensive edits, how could
> any system tell it is "the same" without altering the HTML original? But
> it seems like if you just rename and change nothing else, that should
> count as "the same", so it can't solely be based on the filename.

I would expect the HTML document to contain a fixed atom:id, in which 
case tracking wouldn't be a problem.

I agree that all of this is hard. It is likely to be impossible if the 
HTML document doesn't contain the necessary metadata in the first place.

So a very simple answer is: if you can't, don't.

> (4) Should converting an edited version of the same document to Atom
> multiple times with different converters give the same atom:ids? It
> seems that this is fundamentally impossible unless the document already
> has an embedded unique ID per entry.

See (2).

> So, if we take the strongest possible interpretation, that the Atom spec
> requires all of (1)-(4), then any conversion along the lines of what is
> in the HTML spec fundamentally conflicts with the Atom spec; converting
> from HTML to Atom would be automatically nonconforming to Atom unless
> the HTML has embedded globally unique IDs. If that's the case, and we
> care about Atom conformance

...You continued with:

> If that's the case, and we care about Atom conformance, then the only possible fix is to remove the feature.

I think you *can* generate conforming Atom if sufficient metadata is 
there in the first place.

But you can't if this is not the case, and the spec should not give the 
impression that you can.

> If we take a looser interpretation, and, say, only (1) or only (1) and
> (2) are needed, then we could add requirements to HTML5 which would
> enforce Atom's requirements.
>
> If we take an even more lenient interpretation and say that conversion
> from a foreign format is not covered by the Atom requirement to preserve
> IDs, we could leave the spec as-is without conflicting with Atom.
>
> Do any of the Atom experts here have an opinion on which of the
> interpretations is correct?
>
> Regards,
> Maciej

Best regards, Julian
Received on Thursday, 15 April 2010 09:04:35 UTC