27973 – Easy identification of XHTML5

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 27973 - Easy identification of XHTML5

Summary: Easy identification of XHTML5

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 minor
Target Milestone:	---
Assignee:	This bug has no owner yet - up for the taking
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-02-07 13:20 UTC by Stephan Kreutzer
Modified:	2015-02-16 01:45 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description Stephan Kreutzer 2015-02-07 13:20:32 UTC

How does a XHTML5 document identify itself in terms of self-descriptiveness, as there's no identifier in the DOCTYPE declaration or even no DOCTYPE declaration at all? Will simple processing applications or general purpose XML tools be required to make elaborate guesses by analyzing large portions, if not all of the document before they'll be able to determine against which schema they're supposed to validate? What happens if a future version of XHTML is needed, which will be or won't be backward compatible?

If seen in the larger XML context and not only in browser context, I would really like to know how to distinguish XHTML5 easily from other XHTML and self-identifying XML formats. I couldn't find information about it except that XHTML5 is “the one without identification” (my impression).

Comment 1 Jirka Kosek 2015-02-10 09:19:04 UTC

(In reply to Stephan Kreutzer from comment #0)
> How does a XHTML5 document identify itself in terms of self-descriptiveness,
> as there's no identifier in the DOCTYPE declaration or even no DOCTYPE
> declaration at all? Will simple processing applications or general purpose

XHTML document must have elements in XHTML namespace (http://www.w3.org/1999/xhtml). So checking namespace of the root element (html) should be sufficient to identify content as XHTML from XML point of view.

Comment 2 Stephan Kreutzer 2015-02-11 00:42:33 UTC

So XHTML5 won't introduce new tags to the XHTML 1.1 namespace, so a XHTML5 document will be valid according to, for instance, the W3C XHTML 1.1 Schema? If XHTML 1.1 and XHTML5 documents differ in their vocabulary, but can only be identified by the same namespace, then such documents can't easily be distinguished from each other without analyzing them in detail, can they?

Comment 3 Jirka Kosek 2015-02-11 08:24:01 UTC

(In reply to Stephan Kreutzer from comment #2)
> So XHTML5 won't introduce new tags to the XHTML 1.1 namespace,

It will definitively introduce new elements over the time as XHTML5 is just XML based serialization of HTML5 content.

> so a XHTML5
> document will be valid according to, for instance, the W3C XHTML 1.1 Schema?
> If XHTML 1.1 and XHTML5 documents differ in their vocabulary, but can only
> be identified by the same namespace, then such documents can't easily be
> distinguished from each other without analyzing them in detail, can they?

Why you need to differentiate between those two? What's the use-case?

Also please note that many constraints posed on HTML5 (and thus on XHTML5) can't be checked by grammar based schema languages. For better validation I would suggest using vnu tool:

http://validator.github.io/validator/

Comment 4 Stephan Kreutzer 2015-02-16 01:45:45 UTC

If a program encounters a XML file, it can check for unique identifiers of the format to determine if it is capable of processing the file or not. Namespaces, some special identifier within the XML or for previous XHTML versions the DOCTYPE declaration would tell the program in which format and version thereof it is composed, while the program may be able to handle some custom XML formats and a particular XHTML version, but not others, and would select corresponding validation mechanisms and transformations based on this information.

In my particular situation, to give a concrete example, I've written a small program to package XHTML 1.0 Strict to EPUB2 (uses internally XHTML 1.1), but if I want to implement EPUB3 support (uses internally HTML5), it will become difficult because if the user provides XHTML 1.0 Strict input, it might need transformation, but another one than for XHTML5 input. The user, however, can't be expected to care about HTML version numbers and probably won't provide information about it.

It seems like the Markup Validation Service implements the "Doctype: detect automatically" feature for HTML5 by looking for <!DOCTYPE html>, which differs from the HTML4/XHTML 1.0/XHTML 1.1 DOCTYPEs in terms of the public identifier, while the presence of the XHTML namespace distinguishes between HTML and XHTML. And a future HTML version might distinguish itself from the previons versions by the absence of the DOCTYPE altogether.