Re: [HTML5] 2.8 Character encodings

Ian Hickson:
> On Sun, 16 Aug 2009, Dr. Olaf Hoffmann wrote:
> > Ian Hickson:
> > > On Wed, 12 Aug 2009, Dr. Olaf Hoffmann wrote:
> > > > The meaning of some elements is different in 'HTML5' as well or is
> > > > defined in a more restrictive way, what excludes some use cases
> > > > possible in HTML4.
> > >
> > > Yes, but in practice that's not an issue since HTML5 describes how
> > > HTML4 UAs actually did things.
> >
> > User agents present the elements somehow, often this does not directly
> > imply a meaning.
>
> I agree, only the spec can imply a meaning.
>
> > And if we take (again, I already discussed this with Anne) the sample
> > of the element small, the presentation implicates no specific meaning,
> > what is ok for HTML4, because the definition does not imply a specific
> > meaning either. The audience has to derive this from relations to the
> > content around it. 'HTML5' defines a meaning for small.
> >
> > Therefore the 'HTML5' definition does not apply for all use cases
> > in HTML4 documents, just to a subset.
>
> By that reasoning, the 'HTML4' definition does not apply for all instances
> of HTML4 documents either. After all, most HTML4 documents have some level
> of conformance error, and an overwhelming number of HTML4 documents use
> elements incorrectly. I don't think this is a useful line of reasoning.

'Not really sure what you're saying here.'
I never had a problem to understand the specified meaning of most
HTML4 elements, therefore I cannot see errors in the definitions
of the meaning of those elements, especially because for several
of them the semantical meaning is vague - what corresponds often
to vague use cases in many documents. In general HTML does not
have many elements and therefore to cover every possible type
of text the meaning of elements has to be very broad. 
Another approach could be of course to have a larger collection
of elements to markup text (what is in parts available in other
formats for more or less specific use cases).
That some authors use some elements for not intented purposes
is not directly a problem of the specification, this can be a social
problem, a problem of limited intellectual capabilities, of indifference 
and ignorance. Such things cannot be changed with another version
of a language. Maybe this cannot be changed at all.

>
> > However, because user agents do not need to care about the meaning,
> > the presentation may not differ. Authors have to care about the meaning
> > and cannot use the element in 'HTML5' for some use cases.
> > This is not necessarily a problem for authors, if there are other
> > elements intented in 'HTML5' for their use case and because HTML4 and
> > 'HTML5' are different versions and looking at the version indication
> > (doctype) one can at least indentify, when authors use the HTML4
> > definitions. As long as 'HTML5' has no version indication, there is no
> > simple way to indicate, that the definition in 'HTML5' applies.
>
> Just always assume the HTML5 definition applies. 

This would be surely wrong for several use cases, 'HTML5' excludes.
It is no problem for me, that 'HTML5' excludes them for 'HTML5'
documents. However, for HTML4 documents they are still possible and
this is not time dependent and does not depend on other people
specifying other versions of the language.
To believe in 'HTML5' for such documents would implicate, that some
constructions have no meaning at all, because the elements are
used for the wrong purpose.
Or if I define a HTML-version on my own, exchanging the
definitions of the elements p and blockquote, and ensuring
that this new HTML-version superseedes any other version,
I think this does not really change the meaning of p and
blockquote in HTML4 or in 'HTML5', it changes only the
meaning for documents written in my own HTML-version.
Or more related to the discussion - in 'my version' of
'HTML5' I could define a version attribute - how does
this change the meaning of the W3C draft?
It changes meaning and usability of my draft, but not
that of the W3C draft. And the changes are not
necessarily relevant for the presentations in a user
agent.


> As far as I can tell, 
> that won't cause any problems. Can you point to a page where doing this
> causes a practical problem with an software package?
>

Well, authors are no software packages, therefore 
'Not really sure what you're saying here.'

> > Or the meaning of cite is defined more precisely, what is ok for
> > a new version, but not applicable for the usage in HTML4 documents.
>
> HTML4's description was apparently vague enough that different people
> consider it to mean different things. My interpretation of HTML4's text is
> that HTML5's definition of <cite> is a superset of HTML4's.
>
> > Ok - what really a proper content is, depends on several things,
> > if simply a private communication or a dictum is quoted, it has no
> > title and one has to note the name of the person.
> > If a work has specific authors, both title and authors and maybe
> > a source or a unique identifier may belong to the citation information.
>
> Not really sure what you're saying here.
>

I just compare what is often included in citations with what
is currently noted about the content of cite in the current
draft.
I think you mean, that the draft definition is a subset of
that what is possible as content in HTML4?
It is no problem for me, that the draft defines in more
detail, what to note in cite in 'HTML5' documents, however,
HTML4 does not do it and therefore this element can
contain other content in HTML4 documents.

For example:
<blockquote>
Not really sure what you're saying here.<br>
<cite>Ian</cite>
</blockquote>
looks ok for me within HTML4, not in 'HTML5'.

Or
<p>
It was demonstrated, that the simultaneous
optical excitation of atoms and molecules 
within collisions can be observed with differential
detection in a beam experiment. <br> 
<cite id="ref1">V. A. Aleskseev, J. Grosser, O. Hoffmann, F. Rebentrost in 
JCP <b>129</b> 201102 (2008)</cite>
</p>

Alternatively the cite could contain an element a with
a reference:
<cite><a href="#ref1">[1]</a></cite>

There are different styles for citation and different
information - whom, what and which resource, not
only the title 'title of a work'.

For 'HTML5' one has to write something like this:
<cite>Simultaneous optical excitation of Na electronic and
CF<sub><small>4</small></sub> vibrational modes in
Na+CF<sub><small>4</small></sub> collisions</cite>

What is a quite different information.
This sample includes a problem with the element
small of course, therefore this is again only ok in
HTML4, but not in 'HTML5', there I think one has
to use MathML to markup the molecule.




> > acronym - non conforming feature in the current draft, well
> > defined in HTML4.
> > I think, with the instead recommended element abbr there is a problem
> > with other (legacy?) versions of MSIE.
> > Obviously here the 'HTML5' draft does not include an explanation of
> > the meaning of HTML4 documents and does not necessarily
> > do a better job concerning the description of the interpretation of
> > legacy viewers.
>
> I don't understand what you're asking for here. HTML5 says that <acronym>
> should be handled as a synonym for <abbr>.

It is noted under:
'12.2 Non-conforming features'
with:
"acronym
Use abbr instead.
"

Because every acronym is an abbreviation too,
there is no problem in doing this in 'HTML5' - one can 
use microdata/RDFa to specify it in more detail, if
required.
However, in HTML4 it is not a 'non-conforming feature'
and can have a slightly different and more specific meaning 
as abbr.
Therefore this is surely another example of something
an author can use in HTML4 with a meaning, but should
not in 'HTML5' because it is indicated as a 
'non-conforming feature'. Obviously, this definition does
not apply to acronym within HTML4 documents.



>
> > The content model of dl is more restrictive in 'HTML5' - surely
> > it cannot describe uses of the less restrictive model of HTML4.
>
> <dl> hasn't changed as far as I can tell.
>

HTML4:
<!ELEMENT DL - - (DT|DD)+              -- definition list -->

'HTML5':
Content model:
Zero or more groups each consisting of one or more dt elements 
followed by one or more dd elements.




> > And viewers have no problem to present such uses, therefore
> > 'HTML5' may have a better definition of definition lists, excluding
> > some not very nice use cases, but it does not describe several
> > really existing HTML4 documents or how they are presented
> > by current viewers.
>
> I don't follow. Could you include some examples maybe?

These are for example the nasty 'poetry' samples as discussed
a longer time ago. Because HTML still has no elements for poetry,
at least in HTML4 documents one has to work around this.
In XHTML one can use related elements from other languages
or in XHTML+RDFa one can indicate the meaning with RDFa,
in 'HTML5' this may work with microdata/RDFa as well.
However, for the last two variants one has to use proper
elements with a sufficient structure model for strophes (stanzas)
and strophe lines. For example dl/dd (excluded in 'HTML5') or 
div/div or maybe section/div.
Because 'HTML5' has other elements and dl/dd is not
applicable, the best possible solution in 'HTML5' looks different 
than in HTML4 or XHTML1.x. 

Maybe some use dl for recipes, bills or some structures
in the bible for example (is already discussed in the related
wiki). 'HTML5' does not describe this, what is no problem
for documents of previous HTML versions and authors
of 'HTML5' documents can find other (maybe better)
solutions. This is why 'HTML5' is different from HTML4
and why it does not define the meaning of some structures
in HTML4 documents.
This happens mainly, because there is no concept in
'HTML5' either just do simplify the element collection
and to use something like RDFa to provide a semantical
meaning or to define a more complete collection of
semantical elements for specific use cases.
In 'HTML5' it is more a matter of tast mixture
of changes or improvements, therefore clearly
different from HTML4 and none of them is a
true subset of the other. They share many common
or similar features.

>
> > I think, for object some attributes are missing.
> > Well, some authors used some of them wrong and
> > something like declare was not widely implemented.
> > Both does not indicate directly a problem with the HTML4
> > definition.
>
> I think that's exactly what it indicates, actually.
>
> > I think, there is still no declarative method in
> > the draft to start some time dependent content of object,
> > therefore declare is really missing in 'HTML5', not only
> > for object. However, if an author uses it in a HTML4
> > document, one cannot expect that the behaviour of
> > a browser ignoring this attribute is that, what was
> > intented by the author ;o)
> > The implementation gap simply excludes some use
> > cases of object in practice - maybe one of the reason,
> > why there is currently a lot of strange content around,
> > trying to simulate such functionality somehow to work
> > around the gap.
>
> I'm not sure what you're asking for here.

Well - not related to this discussion here, but 
SMIL and SVG have declarative methods to begin
and to end for example video and audio  in a
declarative way. HTML4 had at least declare to
begin such objects. 'HTML5' does not have it.
Maybe the best approach for authors is still to
embed SVG or flash to do the job. 'HTML5' 
clearly fails to provide a simple declarative method
to allow authors to specify buttons or a selection
tools to begin and end such media.
To be able to begin for example a video or
audio after an interactivity of the user is often
important to allow to select between different
options.

>
> > I think, there are several more samples, all of them show, that
> > 'HTML5' does not describe all 'valid' HTML4 documents properly.
>
> Could you list them? I should fix them, if so.
>

From my point of view 'HTML5' is a new version of HTML and
can be therefore different. No need to fix something or to
list it. If an author indicates that the version 'HTML5' is used,
this new definitions and meanings apply.
If HTML4 is indicated, the old meanings apply. No problem at
all with version indication. And no need to spend time to
compare and to list differences (which can have good 
reasons of course).



> > I do not think, 'HTML5' has to do this, because it is a new version
> > of the language.
>
> I think HTML5 must do this, because it is a new version of the language.
>
> > It is just pretty useless to disclaim such simple
> > facts and incompatibilities.
>
> Not sure what you mean.

;o)

>
> > > > And has far as I have seen, those changes are not mentioned
> > > > in the current draft (as well as maybe some missing attributes).
> > > > If we take the sample of the version attribute itself, it does not
> > > > define what it means, HTML4 for example does.
> > >
> > > HTML4's statements on the matter are inconsistent with actual
> > > implementations and legacy content.
> >
> > I cannot see, what is inconsistent here:
> >
> > "version = cdata [CN]
> > Deprecated. The value of this attribute specifies which HTML DTD version
> > governs the current document. This attribute has been deprecated because
> > it is redundant with version information provided by the document type
> > declaration.
> > "
>
> This is inconsistent, e.g., with the following text in HTML4:
>
> # The document type declaration names the document type definition (DTD)
> # in use for the document [...]
>
> Which is it? The DOCTYPE or the version="" attribute?
>

Not really sure what you're saying here.
If there is no doctype (as in XHTML+RDFa), the indication in
version applies. If there is a doctype, that applies.
If doctype and version information are incompatible, the
version seems to be undefined, because I think there is no
information what takes precedence. Therefore authors have
to avoid such conflicts as they should in general.


> > This does not even suggest a specific use of the attribute or that
> > the interpretation or presentation of a simple browser must depend
> > on such an information.
>
> Indeed, the text you quoted is completely empty of normative conformence
> criteria. It doesn't define anything; the spec would lose nothing if that
> text was removed. This is typical of much of HTML4.
>


Another variant of (X)HTML is more specific about this.
In XHTML+RDFa it is noted:

'There SHOULD be a @version attribute on the html element with the 
value "XHTML+RDFa 1.0"'

And this is more relevant, because HTML4 documents have the
doctype to indicate the version, this XHTML variant has not
necessarily a doctype.
'HTML5' has no other version indication currently, but the
XHTML namespace has, therefore at least for the XHTML/XML
variant of 'HTML5' one can indicate a version attribute belonging
to the XHTML namespace, but because 'HTML5' still does not
say, how to indicate the version, the value of the attribute is
still a question - best choice could be the URI of the 
recommendation maybe, because there are not two
versions with the same URI if nothing went wrong.



> Anyway, I'm not interested in arguing about the flaws of HTML4. It's a
> decade too late for that.
>
> > [...]
>
> I don't really understand what you want me to do, at this point. If you
> could concisely state what problem exists in the HTML5 spec that you
> believe should be addressed, I can try to address it (if it really is a
> problem). 

Well that is simple. Allow authors to indicate 'HTML5' as a version,
for example with
<html version="http://www.w3.org/TR/html5/" ...>
This would be already better than the XHTML+RDFa approach.
This is mainly a meta information about the semantical meaning
of the document content, not a requirement for user agents to
do something specific. It is similar to those microdata information.
For simple presentation you need not to care about it, but if
there is someone trying to find out the relation between the
current document and the meaning of the used language and
its elements, this is an interesting information.
HTML documents often have meta information not relevant
for any user agent, for example meta elements containing
descriptions or keywords or with encoding information, if the
server already sent the encoding information. 
However, it is not completely useless, just because it is not relevant 
for some user agents or some situations.


> However, this conversation at this point is meandering 
> apparently aimlessly and I'm not sure that it will lead to a productive
> conclusion.
>

My personal impression is, that this happens quite often with 
discussions in the 'HTML5 WG', if semantical issues or issues 
interesting for authors are discussed.
To change this, maybe one has to find out, what the collective
problem of the WG with such issues is ;o)

> > > > A current draft cannot change the meaning of a previous
> > > > specification/recommendation and it does not change the meaning of
> > > > documents written in this previous language version.
> > >
> > > Actually, it can, when the older specification was incorrect.
> >
> > How can it be incorrect, if the semantical meaning of the content of an
> > element is defined?
>
> The spec isn't the final word on the meaning of the language. The use of
> the language is the final word on the meaning of the language.
>

This applies more for spoken languages and dictionaries.
The dictionaries only describe, how the words of a language are
currently used.
With a specified language this is different. It is a technical terminology
with fixed meanings. This is one of the main advantages, why to
use something like markup languages at all.


> If everyone uses <embed> to embed a plugin, then that's what <embed>
> means. If everyone uses <object codebase=""> to specify the source of the
> plugin, then that's what that attribute means, even if HTML4 says that the
> attribute gives the base URL for the classid="" attribute.

In HTML4 documents 'embed' means nothing. And at least on my
Linux computers not even all browsers interprete this (for example
I think, Opera still ignores it, at least in combination with SVG ;o)

And even if millions of people believe in 1+1=1, this does not
mean, that this is the meaning or that this is true or that this
implicates to change the convention, that '+' typically means
addition and not multiplication. It mainly indicates,
that millions of people are wrong. This is not surprising. And one
of the advantages of well defined technical terminologies is, that
a minority is still able to express and to share relevant information, 
even if the majority is not able to understand or to use such information
at all. And if they refer to a specification of there terminology, every
one else can at least learn it and can understand, what was intended.
On the other hand, it is quite simple to check, whether an assertion
within this terminology is meaningful or not. 
If the meaning would be adjusted to the majority, this would
mainly result in more stupidity.


Olaf

Received on Friday, 28 August 2009 15:32:26 UTC