Disposition of comments received on draft-ietf-appsawg-xml-mediatypes-05

This document contains all the comments received wrt draft-ietf-appsawg-xml-mediatypes-05, together with my response. They are divided into two sections, the first for more substantive comments, the second for more editorial ones. Ones where my response has been more-or-less negative are sorted to the end of their respective sections, and shown with a pink background. Where a comment includes a quote from the draft, this is shown with a green background.

Substantive/contentious comments

Comment on section 3, HST

The last para is as close as this spec gets to saying how to make sense of XML documents. In keeping with the "follow your nose" principal, there really should be something pointing to the core specs of the XML family, mostly the ones for the technologies named in this para, but without references

Response: Added an appendix listing the Core WG products which get you started: namespaces, xml:id, xml:base, xml-stylesheet, xml-model, and several TAG refs

Comment on section 3.1, Duerst

Section 3.1, Encoding considerations (and elsewhere): The term "charset encoding" shows up. This is an unfortunate mixture of terminology. MIME has the "charset" parameter, and XML has the "encoding" pseudo-attribute, but this doesn't mean that these two words should be combined just like this. Also, this isn't used uniformly through the spec, e.g. there are things like "ASCII-compatible character sets" (see also http://www.w3.org/MarkUp/html-spec/charset-harmful.html).

I suggest, in Section 2, to shortly talk about the fact that MIME has the "charset" parameter, and XML has the "encoding" pseudo-attribute, and then use a single term. I'd personally suggest "character encoding" (see e.g. RFC 3986), but I'd be happy with any term that has been used widely already.

Response: Good catch, of a problem that's actually inherited from 3023. I've tried to look at every use of either phrase, or any related terminology, and be sure to use "charset parameter" for what is present in the Content-Type Response header, and "encoding declaration" for what's in the XML or Text Declaration, as well as "MIME charset" for e.g. "utf-16le" and "encoding [form]" for e.g. UTF-16 as such. See also extended rewrite, per email discussion, in (new sub-)section 2.2

Comment on section 3.6, Bjoern Hoehrmann

Under the rules of RFC 3023 and the XML 1.0 Recommendation, including its 5th Edition, [external] parsed entities could be encoded in an encoding like ISO-8859-1 without a corresponding text declaration so long as they are labeled with a charset parameter indicating the encoding. Under the revised rules of this specification the character data of such entities can be interpreted as byte order mark or Unicode signature. Implementions might see elements and other markup in such entities they did not see when interpreting them under the previous rules. Attackers may be able to exploit this difference in interpretation to bypass security systems.

Response: External parsed entities with the necessary characteristics (i.e. non-UTF encoded, without text declaration, not wrapped in any markup and beginning with (if for example encoded in iso-8859-1) 'þÿ' or 'ÿþ' or '' are unlikely to occur at best. They are not well-formed XML outside a MIME-labelled context (where non UTF-encoded entites must include text declarations, i.e. must begin with '<?'). And, for better or worse, the only browsers that read external entities at all (Chrome and Safari) all treat such entities as UTF today.

Response: Added a "MUST use text declaration" fix, as well as brief discussion in Changes from RFC 3023 section.

Comment on section 3.6, Bjoern Hoehrmann

XML implementations [must now] treat data:application/xml;charset=utf-32,%FF%FE%00%00... as malformed UTF-16 encoded document

Response: This was not an intentional consequence of the changes in wrt BOMs, but I see how it might be thought to follow. Although the XML spec itself allows for UTF-32 BOMs, most if not all browsers either have never supported it (e.g. IE, I think) or are removing their support (e.g. Firefox, I think). So rather than add a lot of complexity by including UTF-32 throughout as a possibility, I've chosen to deprecate it instead, and mention this in the Changes from RFC 3023 section.

Comment on section 3, Duerst

Last paragraph before Section 3.1: It was unclear what exactly the spec tried to say here. I suggest to add a sentence at the end, e.g. "Such processing is not specified in this document."

Response: Discussed this with the TAG and the XML Core WG. See HST's comment above

Minor/editorial/overtaken/duplicate comments

Comment on section Meta, Duerst

Copyright notice: Given the long history of this draft, I'd guess that this document needs the following addition in the copyright:

This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English.

(This can easily be produced with a setting on some attribute in the XML source.)

Response: Done, silently

Comment on section 3, 8, Duerst

Section 3 and Section 8: These sections have more than a page of text before the first subsection. I suggest to add one or more additional subsection titles at the start or very close to the start of the section for better structuring.

Response: Done, silently

Comment on section 3, Duerst

Section 3 says:

document entities The media types application/xml or text/xml MAY be used.

First, it would be good to have some syntactic delimiter (colon maybe) between "document entities" and the rest. Same for the other items.

Second, RFC 2119 defines MAY as follows: "This word, or the adjective "OPTIONAL", mean that an item is truly optional." This is quite a bit misleading in the sentence above. Using application/xml or text/xml for XML document entities is the default case, not just an optional option. I suggest something like "The media types application/xml or text/xml, or a more specific media type, SHOULD be used." (A should without additional qualification is probably too strong.)

Response: 1) Done, silently; 2) Done, with a cross-ref to section 8.

Comment on section 3, Duerst

external parsed entities The media types application/xml-external- parsed-entity or text/xml-external-parsed-entity SHOULD be used. The media types application/xml and text/xml MUST NOT be used unless the parsed entities are also well-formed "document entities" and are referenced as such.

The last clause ("and are referenced as such") is confusing. Stuff is just served or sent; on the server side, it's unclear how it's being referenced, and so such a condition does not make sense operationally.

Response: Removed

Comment on section 3, Duerst

Note starting with

Note that [RFC3023] (which this specification obsoletes) recommended the use of text/xml and text/xml-external-parsed- entity for document entities and external parsed entities,

Because of the indenting, it looks as if this note only applies to the immediately preceding item ("external parameter entities"), but content-wise, it seems to apply more generally. The note should be outdented (if that's possible) or should be moved to another place where it's less confusing to the reader as to what it applies to.

Response: Both notes here brought out to normal level, silently

Comment on section 3, Duerst

Para starting with:

Compared to [RFC2376] or [RFC3023], this specification alters the charset handling of text/xml and text/xml-external-parsed-entity,

This is very long, in particular the first sentence. A very easy first step towards improvement would be to use a period before the "however", and change "however" to "However". Any additional untangling would be appreciated, too.

Response: Split and untangled, as requested.

Comment on section 3, Duerst

"for the text/xml... types" should be changed to "for types with a top-level media type text". (several instances)

Response: Changed to "for the two text/ types" here, 9.2 reworded and 9.4 done as you suggest. I thought your proposed change unnecessarily verbose in the first two cases.

Comment on section 3.1, Duerst

Also, we have "7bit or 8bit data, for example data with charset encoding UTF-8 or US-ASCII". In general speach, a chiasmus is something nice, but it's generally only confusing in specs, so I'd change this to "7bit or 8bit data, for example data with charset encoding US-ASCII or UTF-8".

Response: Done, silently

Comment on section 3.1, Duerst

Section 3.1, Applications that use this media type: There is a missing "and" but a superfluous comma : "is supported by a wide range of generic XML tools (editors, parsers, Web agents, ...)*,* *and* generic and task-specific applications." Probably, reordering makes this easier to read: "is supported by generic and task-specific applications and a wide range of generic XML tools (editors, parsers, Web agents, ...)."

Response: Done, silently

Comment on section 3.2, Duerst

Section 3.2: Text/xml Registration This is defined as an "alias", but the Media Type registry (e.g. in contrast to the charset registry) doesn't know the concept of an alias. So this should be reworded, e.g. saying that the registration information is the same. This also applies to Section 3.4.

Response: Done

Comment on section 3.3, Duerst

Section 3.3, Encoding considerations (and other items): There are two "as" prepositions in short succession. What about "Same as application/xml, see Section 3.1." or some such?

Response: Done, silently, passim

Comment on section 3.3, Duerst

Section 3.3, Interoperability considerations

Identifying XML external parsed entities with their own content type should enhance interoperability of both XML documents and XML external parsed entities.

Lowercase "should" SHOULD be avoided! (there are other cases, too) I suggest to change to "will", or just say "enhances".

Response: Done, passim, except in section 9 examples, which is non-normative

Comment on section 3.6, Duerst

XML MIME producers are RECOMMENDED to provide means for XML MIME entity authors to control the supply of charset parameters for their entities, for example by enabling user-level configuration of filename-to-Content-Type-header mappings on a file-by-file or suffix basis.

"control the supply" reads as if these charset parameters were in ample or short supply. I suggest to replace "supply" with "presence" or "presence or absence".

Response: Reworded

Comment on section 3.6, Duerst

Section 3.6: It may be helpful to create (sub)subsections for producers and consumers, because that's what many readers of the spec will look for.

Response: Done, silently

Comment on section 3.6, Duerst

Section 3.6, para starting with (and the following citation and para)

When a charset parameter is specified for an XML MIME entity, then

This is way too lengthy and complicated. The first sentence is almost six lines long. This is another case of mixing history and justification with the hard facts, and should be untangled.

Response: The 'history' is actually hard facts about the XML specification, but I've tried to simplify the structure here.

Comment on section 3.6, Duerst

"Section 4.3.3 of the [XML] specification": Please fix this to "Section 4.3.3 of [XML]" (as in other locations) or "Section 4.3.3 of the XML specification [XML]".

Response: Done

Comment on section 3.6, Duerst

"When MIME producers conform to the requirements on them stated above," "on them" is redundant and should be removed.

Response: Done, and a reference to the newly created sub-sub-section added

Comment on section 4, Duerst

Section 4: I'm not sure why this is a separate section, as the content is tightly related to Section 3.6. At the minimum, I suggest moving the stuff about BOMs from 3.6 to 4. A better solution would be to promote 3.6 to a section, and include the current section 4 in there (with appropriate additional subsections as suggested above).

Response: Done

Comment on section 4, Duerst

byte order mark (BOM), which is a hexadecimal octet sequence 0xFE 0xFF (or 0xFF 0xFE, depending on endianness)

A byte order mark is a character, not an octet sequence. Also, better say which endianness is which. This would result in

byte order mark (BOM), which appears as the hexadecimal octet sequence 0xFE 0xFF (big-endian) or 0xFF 0xFE (little-endian)

The change from "is" to "appears as" is also needed for UTF-8.

Response: Done

Comment on section 4, Duerst

Applications which convert XML into "utf-8" SHOULD add a BOM after conversion is complete.

There are two problems there: 1) "after conversion is complete", if taken literally, would lead to very efficient implementations (adding three bytes at the start of a long file). This clause should therefore be removed. 2) There is absolutely no need for a SHOULD. SHOULDs are only used when otherwise, there are interoperability problems, but XML in UTF-8 without a BOM 'should' not have any such problems. MAY seems much more appropriate here.

Response: Done, both

Comment on section 5, Duerst

Section 5: "IRI" is mentioned without a reference. The reference was dropped between -04 and -05 because it looked as if it wasn't needed, but it should be put back in (with "Dueerst" fixed to "Duerst").

Response: Done

Comment on section 5, Duerst

When a URI has a fragment identifier, it is encoded by a limited subset of the repertoire of US-ASCII [ASCII] characters, as defined in [RFC3986].

I'm not sure what this helps here. A pointer to the relevant parts of the XPointer spec(s) would be better, because some issues with respect to XPointer encoding in URIs (and IRIs) can be rather tricky.

Response: Done

Comment on section 6, Duerst

Section 6: Again very complicated language. I'd shorten the background information drastically, e.g. as follows (replacing the first *two* paragraphs of Section 6)

An XML MIME entity of type application/xml, text/xml, application/xml-external-parsed-entity or text/xml-external-parsed-entity MAY use the xml:base attribute, as described in [XMLBase], to establish a base URI for that entity (see Section 5.1 of [RFC3986]).

Response: Done, with a slight expansion/correction

Comment on section 8, Duerst

Section 8: A Naming Convention... I suggest removing the "a". In RFC 3023, this was in many ways just a trial, and so "a" was appropriate. Today, this doesn't have to be stressed anymore, and there are no other naming conventions for XML-Based Media Types.

Response: Done

Comment on section 8, Duerst

"pattern '*/*+xml'": This is shell notation applied to something else than file names. It's close to the syntax allowed in an HTTP Accept: header, but (as correctly noted in the draft) not the same. It should be obvious to many readers, but it would be better if it were clearly explained.

Response: Simplified

Comment on section 8, Duerst

When an XML-based media type is restricted to UTF-8, it is not necessary to introduce the charset parameter. "UTF-8 only" is a generic principle and UTF-8 is the default of XML.

I'm not sure what ""UTF-8 only" is a generic principle" is referring to. I guess it refers to the idea that using UTF-8 only for certain use cases on the Internet simplifies things a lot and is therefore a good idea. I fully agree. But a) this should be clearer, and b) it should be separated from "UTF-8 is the default of XML", because the former is a justification for the antecedent of the previous sentence (which I don't think is actually necessary), whereas the later is a justification for the conclusion made in that sentence. So in order of decreasing preference: 1) Remove ""UTF-8 only" is a generic principle and" 2) Split the second sentence into two, explaining the "generic principle" in slightly more detail.

Response: (1) Done

Comment on section 8, Duerst

"Similarly, media subtypes that do not represent XML MIME...": I don't see any similarity to what comes before. If a connective is really needed, I'd use "Conversely", but the best solution would be to do without connectives altogether.

Response: Removed

Comment on section 8.1, Duerst

8.1: "Referencing": Please use a slightly longer subsection title to make it easier for readers to understand what this subsection is talking about. Maybe "Registration Template Details"?

Response: Done, using "Registration guidelines for XML-based media types not using '+xml'"---see below

Comment on section 8.1, Duerst

"Registrations for new XML-based media types under top-level types" Please remove "under top-level types". It doesn't add any information.

Response: I think this is meant to be quite separate from the +xml case. I've pinged Chris Lilley to check. . . For now, I've moved it to after the +xml template for that reason.

Comment on section 8.2, Duerst

For fragment identifiers matching the syntax defined in [XPointerFramework], where the fragment identifier does _not_ resolve per the rules specified there, then process as specified in "xxx/yyy+xml";

Is this the case of an unregistered XPointer scheme? If yes, it would be good to mention here (not with MUSTard) that this is a bad idea. If not, I don't understand what case this addresses.

Response: This is pretty much copied from 6838. It's about failure to resolve in an instance. So, for example ...#foo, pointing to text/rdf+xml, can be interpreted, per the text/rdf+xml registration, as identifying an individual described by the rdf:xml fragment <rdf:Description rdf:ID='foo'>..., precisely because rdf:ID is not an XML ID, so ...#foo does not resolve per XPointer.

Comment on section 9, Duerst

Section 9: The last para before subsection 9.1 should be moved to the start of this section.

Response: Done, silently

Comment on section 9, Duerst

the charset portion, if any, of the value of the MIME Content-type header

I'd prefer to keep the full Content-type header, as in the examples in RFC 3023. Why? I think something like Content-type: application/xml; charset="UTF-8" is easier to read than Content-type charset: charset="UTF-8" Either this can use different types in the different examples, or use application/xml throughout. I'd personally prefer the later.

Response: Done, with an explanatory note in the intro

Comment on section 9, Duerst

and the XML MIME entity may contain other data in addition to the XML declaration;

The 'may' here is misleading a) because lower-case 'may' SHOULD be avoided and b) because (except for an XML declaration) there 'may' indeed be no data, but that would be exceedingly rare. So I would change this to something like "and the XML MIME entity will contain other data in addition to the XML declaration (or might be empty);", where the parenthetical in my opinion isn't even necessary.

Response: Fixed, more simply

Comment on section 9.1, Duerst

Why is there no <?xml version="1.0">, similar to 9.2? Why are there no cases without any XML declaration (difficult to represent in the current way but very realistic)?

Response: Disjunct added

Comment on section 9.1, Duerst

If sent using a 7-bit transport (e.g., SMTP[RFC0821]), the XML MIME entity MUST use a content-transfer-encoding of either quoted- printable or base64.

I may be wrong here, but in my understanding, it would be possible to send pure US-ASCII labeled as UTF-8 through 7-bit transport (i.e. what defines the transport is the actual data and not the potential bit patterns allowed by the charset).

Response: Clarified to the general case

Comment on section 9.5, Duerst

9.5: This is lengthy. Remove "and UTF-8 Entity" from the title, and simply say that this is interpreted as UTF-8 because that's the default for XML.

Response: Done, perhaps more than you intended

Comment on section 9.6, Duerst

9.6 "Observe that the BOM does not exist." -> "Observe that the BOM isn't present." or "Observe that there is no BOM." (the BOM exists as well as any other character :-)

Response: Heh. Fixed.

Comment on section 9.8, HST

UCS-4 is now deprecated, and this one is confusing anyway. Get rid of it.

Response: Done

Comment on section 5, Duerst

A registry of XPointer schemes [XPtrReg] is maintained at the W3C. Document authors SHOULD NOT use unregistered schemes. Scheme authors SHOULD register their schemes ([XPtrRegPolicy] describes requirements and procedures for doing so).

I fully agree with the SHOULDs here, but they don't belong in this spec.

Response: Surely the normative specification for the interpretation of fragment identifiers can place constraints on their use? Or would you prefer this to read "Fragment identifiers matching XPointer syntax involving SHOULD NOT involve unregistered schemes. [XPtrRegPolicy] describes requirements and procedures for registering schemes."

Comment on section 8.2, Duerst

8.2, Reference: Replaces "This specification" with "RFC XXXX". This makes the template more portable. There are more occasions that would benefit from the same change.

Response: I don't agree. This is not a template, it is not here to be copied, except in the case of the "MUST make reference" bit of frag-id considerations, which a) is effectively a template and b) already uses XXXX

Comment on section 8.2, Duerst

8.2, Fragment identifier considerations: The two provisions "they MAY restrict the syntax to a specified subset of schemes" and "They MAY further require support for other registered schemes" look okay, but they leave open the question of what's the default. As far as I understand, barenames and element scheme pointers are the (bare :-) minimum, and so the first provision seems unnecessary. Removing that provision (and removing the "further" in the second provision) should make it clear that the minimum is the default.

Response: I don't think your argument is correct. a foo+xml registration which states no restrictions allows any registered scheme to be used. So it's perfectly coherent for it to restrict to e.g. the minimum.

Comment on section 9.1, 9.2, Duerst

Section 9.1 and 9.2: The 7-bit/8-bit/binary considerations repeat what's already said in 3.1, so they should be replaced with a pointer to that section.

Response: The examples repeat a lot from higher up -- it's intentional (and carried forward from 3023)

Comment on section 9.3, Duerst

9.3: The title includes "and 8-bit MIME entity", but the considerations apply equally well e.g. to encoding="iso-2022-jp", which is 7-bit.

Response: 8859-1 is an 8-bit encoding, that's what this is an example of. Changing the title would be confusing, IMO. Unchanged from 3023.