21439 – The MAY w.r.t. treating invalid longdesc URLs as text, is harmful. Remove the harm - or remove the MAY.

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 21439 - The MAY w.r.t. treating invalid longdesc URLs as text, is harmful. Remove the harm - or remove the MAY.

Summary: The MAY w.r.t. treating invalid longdesc URLs as text, is harmful. Remove the...

Status:	RESOLVED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML Image Description Extension (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Charles McCathieNevile
QA Contact:	HTML WG Bugzilla archive list

URL:	http://www.w3.org/TR/2013/WD-html-lon...
Whiteboard:
Keywords:	a11y, a11y_text-alt

Depends on:
Blocks:	21678
	Show dependency tree / graph

Reported:	2013-03-29 21:55 UTC by Leif Halvard Silli
Modified:	2013-05-09 17:54 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Leif Halvard Silli 2013-03-29 21:55:16 UTC

The current permission to interpret the longdesc attribute content as text when it is an invalid URL, does not tell how to do this. 

]] user agents may make that content available to the user. This is because a common authoring error is to include the text of a description, instead of the URL of a description, as the value of the attribute. [[

I believe this should follow a specified algorithm which FIRST checks if the string can reasonably (e.g. if there are many unescaped spaces etc) be interpreted as text and SECOND runs a transformation which turns the text into a data URI which the user agent then opens as an external page (external URL - URL to external page).

When generating the data URI, the data URI should contain a HTML document which should be UTF-8 encoded and which should inherit the language tags of the current page, if any. The title element of the data URI document should say 'long image description' or something like that.

Justification: To define the MAY option like this, would undescore that longdesc is never the less an URL. It would also point out that longdesc can contain a data URI.  On the opposit side, as the spec stands, I find that the fact that longdesc is an URL is underminded - it is "kindness" in the wrong way.

Comment 1 Charles McCathieNevile 2013-04-01 13:41:43 UTC

I'm prepared to offer an example of using a data URL in the examples, and suggest that in prose as a possible repair strategy.

User agents are not required to open the longdesc in a new window or tab and I am not prepared to make that requirement. There are implementation strategies which seem perfectly valid but do not do this. We should avoid introducing restrictions whose effects we don't really understand, IMHO.

Comment 2 Leif Halvard Silli 2013-04-01 15:54:05 UTC

(In reply to comment #1)
> I'm prepared to offer an example of using a data URL in the examples, and
> suggest that in prose as a possible repair strategy.

Super! Perhaps you could add, in that regard, that the purpose of turning it into a data URI is an economical one: 

 * FIRSTLY, it allows the UA to reuse its usual mechanism for handling longdesc URIs (as opposed to implementing two different ways to handle the content of the longdesc attribute).
 * SECONDLY, it allows users to interect with invalid longdesc attribute content in the same way they interact with valid content.
 
> User agents are not required to open the longdesc in a new window or tab and
> I am not prepared to make that requirement.

I’m fine with not telling how the data URI should be handled. My sole point was really only the encomical point I made above. Sorry for my unclarity.

> There are implementation
> strategies which seem perfectly valid but do not do this. We should avoid
> introducing restrictions whose effects we don't really understand, IMHO.

I’m not sure what you mean here. But in the HTML5 spec, it is sometimes/somewhere said that the purpose of describing algorithms is not that implementations follow the algorithms 100% to the letter, but rather that they perform an (possibibly optimized) activity that has the same outcome as the described algorithm. And perhaps it would be possible to say something like that?

That said, I am a bit sceptical about giving invalid data URIs too much attention - to the extent that I am not 100% sure it is a good thing to have it in the spec. I mean, I am not aware of UAs or AT, today, that do anything with such content.

Comment 3 Charles McCathieNevile 2013-04-03 12:36:21 UTC

I've partially resolved this, by adding examples (including one of doing the repair for invalid content), in https://dvcs.w3.org/hg/html-proposals/raw-file/0dd2e510d4e1/longdesc1/longdesc.html

I propose not to do more - i.e. I don't plan to add to the implementations section.

Comment 4 Leif Halvard Silli 2013-04-23 00:41:14 UTC

(In reply to comment #3)

Executive summary:  As even invalid URLs are URLs, the proposed script is flawed. The longer explanation is that the script demonstrates the technical and conceptual flaws of the permission to interpret @longdesc as text:

1) It wrongly assumes that invalid URLs do not work.
   EXAMPLE: <a href="the invalid URL">This link works</a>!

2) It would cause invalid URLs to be presented to users as text.
   EXAMPLE: longdesc="http://example.org/the invalid URL"
   would become <body>http://example.org/the invalid URL</body>

3) It would cause working (a.k.a. non-dead), invalid URLs to
   stop working. EXAMPLE:
   a) iCab and Opera today repeair longdesc URLs that are invalid
      due to space characters.[1] And thus, that class of invalid
      longdesc URL nevertheless work well in said browers.
            [1] Code: longdesc="the invalid URL"
   b) But if the MAY option/the script is applied, then, instead
      of repairing the URL and serving the file [2] to the user,
            [2] Filename: 'the invalid URL.html'
   c) the user would be served a piece of content, whose body
      would say <body>the invalid URL</body>.
      Which would be a loss/degradation of content.

4) It would be a layer violation: An invalid URL should eventually
   be repaired in such a way that it becomes a valid URL. Invalid
   URLs can be - and are - commonly repaired via simple repair
   techniques that almost every Web oriented URL consumer performs.

5) It takes away the attention from the real issue: dead URLs.
   Your script says: "//assumes some URL validating function."
   However, as shown above, URL *validation* would not be enough.
   To be sure that the "repair" did not in fact *destroy* things
   for the user, the UAs would have to run the following steps:

     First: Repair the URL (if necessary). (UAs already do this.)
    Second: Test whether the repaired URL is dead or working.
            That is: Do some sniffing. (Which in turns implies
            that we would have to step into the *formats issue*
            again, not?)
     Third: If dead, then check whether content is likely to
            make sense as text. (E.g. if the @longdesc content
            begins with the string "http://", then the content
            is unlikely to be useful as text.

I am glad that you created this script, as it allowed me to understand that the idea (at least as currently formulated) is flawed.

Comment 5 Leif Halvard Silli 2013-04-23 01:02:52 UTC

Updated the title of this URL, to reflect the last sentence in comment #0 (Quote: "I find that the fact that longdesc is an URL is underminded - it is "kindness" in the wrong way.")

Comment 6 Leif Halvard Silli 2013-04-23 03:25:13 UTC

6) The MAY option complicates things for authors. 

   EXAMPLE:

      <img src="jpeg.jpg" alt="Text." longdesc="foo bar.html"/>

   EXPLANATION: 

   If the longdesc URL happens to contain a space character, then the URL would be invalid. And thus, per the current spec text’s MAY option, a conforming user agent could choose to present the longdesc attribute’s content as the long description itself.

   As a result, some users would be presented with the the content of the very longdesc attribute, while users of user agents that do not implement the MAY option, would get the content of the file "foo bar.html".

   And all this only because the author forgot to escape a space character.

   CONCLUSION:

   The result of the MAY option is that the longdesc attribute would become less robust - both when compared with other attributes that takes a URL, and also compared with status quo

Comment 7 Laura Carlson 2013-04-24 11:41:24 UTC

(In reply to comment #6)

> The MAY option complicates things for authors. 
> 
>    EXAMPLE:
> 
>       <img src="jpeg.jpg" alt="Text." longdesc="foo bar.html"/>
> 
>    EXPLANATION: 
> 
>    If the longdesc URL happens to contain a space character, then the > URL would be invalid. And thus, per the current spec text's MAY 
> option, a conforming user agent could choose to present the longdesc 
> attribute's content as the long description itself.
>
>  As a result, some users would be presented with the content of 
> the very longdesc attribute, while users of user agents that do not 
> implement the MAY option, would get the content of the file "foo 
> bar.html".

That is a good point, Leif. Let's not inadvertently introduce anything into the spec that complicates longdesc for authors. For this to work a user agent would need to differentiate between a space in a text string and a non-escaped space in a longdesc URL (as well performing other checks to ascertain if a link is dead or not).

Charles, here is an idea: perhaps have the spec incorporate Leif's algorithm from Comment 4. Say something like: if a user agent is following the MAY normative repair statement it MUST do Leif's Point 5 First, Second, and Third.

In addition when a user agent detects a non-escaped space it "MUST" check for file extensions (i.e., ".html",  ".htm" , ".pdf", ".txt", ".php", etc, etc, etc). If a file extension is found then consider it a URL. If no file extension exists AND no document fragment exists either, have the data URI repair kick in. 

User agents programmatically detecting all of this *correctly* could certainly help longdesc. If they get it wrong it could/would be a disaster.

To do it right will require a solid UA repair algorithm. Right now we don't have that and a lot can and would be overlooked if the spec is left as vague as it is currently. 

Chaals, if you would rather not provide a repair algorithm in this spec, seriously consider removing the following normative [1] and informative [2] statements entirely from the longdesc spec and get UAAG to add one.  Then you couls add something like  "If a longdesc attribute has invalid content, user agents MAY make that content available to the user. If they do, then they MUST follow the algorithm as detailed in UAAG." 

I would love to see a good algorithm in this spec or in UAAG. But until that happens the longdesc spec is better off without [1] and [2].

[1] "If a longdesc attribute has invalid content, user agents MAY make that content available to the user. This is because a common authoring error is to include the text of a description, instead of the URL of a description, as the value of the attribute."

[2] "One of the most common mistakes authors make that is easily repaired by user agents is to use a description, instead of a URL that links to a description. This means there is often plain text description in the content of an invalid longdesc attribute. Converting such attributes to data URLs is a simple repair strategy that can help recover from cases where authors have made this mistake."

Comment 8 Charles McCathieNevile 2013-04-24 14:35:59 UTC

Please check the latest Editor's draft (from Saturday).

The statement has been removed, in line with a resolution taken last week in the TF.

Comment 9 Leif Halvard Silli 2013-04-24 14:45:27 UTC

(In reply to comment #8)
> Please check the latest Editor's draft (from Saturday).
> 
> The statement has been removed, in line with a resolution taken last week in
> the TF.

Hm. When I scanned the TF report, I wondered if you had come to the opposite conclusion. May be I did not read the entire text ...

Anyway, the draft from Saturday still says:

]]
One of the most common mistakes authors make that is easily repaired by user agents is to use a description, instead of a URL that links to a description. This means there is often plain text description in the content of an invalid longdesc attribute. Converting such attributes to data URLs is a simple repair strategy that can help recover from cases where authors have made this mistake. 
[[

Plus as well:

]]
//Tries to repair errors where the longdesc isn't a URI

var describedImages = document.querySelectorAll('img[longdesc]');

for (i in describedImages) {
  if (i.longdesc && !(validURL(i.longdesc)) { //assumes some URL validating function
    var theData = encodeURIComponent(i.longdesc);
    i.longdesc = "data:text/plain;charset=";
    i.longdesc += document.charset;
    i.longdesc += theData;
  }
}
[[

Comment 10 Charles McCathieNevile 2013-05-01 10:08:17 UTC

The may statement in the conformance requirements has been removed.

Since it is true that a common mistake is to include the description directly in the attribute, the repair example and the informative statement about it are still there, based on the "heirarchy of needs" principle - it is more important to help users than authors...

Comment 11 Leif Halvard Silli 2013-05-02 02:23:54 UTC

(In reply to comment #10)
> The may statement in the conformance requirements has been removed.

Right. But as is, the descriptions blurs things, I feel. (But feel free to suggest we handle this as a separate bug.)

> Since it is true that a common mistake is to include the description
> directly in the attribute, the repair example and the informative statement
> about it are still there, based on the "heirarchy of needs" principle - it
> is more important to help users than authors...

Yea. But even if every second longdesc was text, still, as long as @longdesc is supposed to be a URL container, then everyone should be able to expect that, by *default*, user ageents seek to resolve it as a URL [1] before eventually seeking to consume it as text. Agree? (Note: by default.)

Note, in that regard, that the URL standard, which HTML 5.1 refers to, considers spaces inside a URL as valid,[3] (see bug 21897). But note as well that what the URL standard says about *failure*, might come us to resuce:[2]

   ]]Parsing (provided it does not return _failure_) and
     serializing a URL will turn it into an absolute URL.[[

PROPOSAL:

* In the examples and informative statement’s, replace the references
  to '(in)valid URL' with references to 'failure' to parse URL.
* E.g. instead of this:
  //Tries to repair errors where the longdesc isn't a URI
  Say this:
  //Tries to repair where URL parsing of longdesc leads to failure
* And e.g. instead of 
  (validURL(i.longdesc)) { //assumes some URL validating function
   Say this:
  (failureURL(i.longdesc)) { //some URL parsing failure function

JUSTIFICATIONS ETC:

* A JavaScript could probably pre-check (without following it) the
  url for failure, and thus prepare it for repair.
* Note that URL standard, for the @href attribute requires the
  user agent to 'try more', even beyond failure.[4] And thus,
  thus, compared with @href, the 'read as text' level would be
  lower.
* More aggressive fixing should be left to user’s configuration
  of preferences etc.

Personally, I think that the requiremetn that the URL must failure, is such a high bar that the risk of loosing the URL becomes very low. Probably, so low that much content that is text, would go pass/parse (sic) as URLs.

[1] http://www.w3.org/html/wg/drafts/html/master/infrastructure.html#resolving-urls
[2] http://url.spec.whatwg.org/#concept-parsed-url
[3] http://url.spec.whatwg.org/#url-code-points
[4] http://url.spec.whatwg.org/#dom-url-href

Comment 12 Charles McCathieNevile 2013-05-09 17:54:18 UTC

spaces are not valid in URLs.