6774 – element: restrict insertion by other servers

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6774 - element: restrict insertion by other servers

Summary: element: restrict insertion by other servers

Status:	VERIFIED DUPLICATE of bug 6606

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 major
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:	http://www.w3.org/TR/html5/single-page/
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-04-05 02:16 UTC by Nick Levinson
Modified:	2010-10-22 10:20 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description Nick Levinson 2009-04-05 02:16:34 UTC

I understand the mark element is intended to be under a website owner's control, and that I misunderstood the draft standard as proposing that a server not under the website owner's control could insert the element in response to a user's apparent interests.

That's good. The HTML5 standard should generally conserve the website owner's rights. Otherwise, the mark element could allow security breaches.

However, the HTML5 draft standard's language seems to be somewhat ambiguous. Section 4.6.7 says, "When used in the main prose of a document, it [the mark element] indicates a part of the document that has been highlighted due to its likely relevance to the user's current activity." That suggests knowledge of the user's current activity unknown to the page author. Either the author has to anticipate a number of uses and insert mark elements for all of them or someone else is to insert a mark element.

The examples in the draft are two that are perfectly safe and one that's dangerous.

The safe ones are presentational. If quoting a source and wishing to add a quoter's indication of relevance other than by the traditional means (found in books, legal briefs, etc.) of adding italicization, the mark element is fine. If, when offering one's own prose, one wants a method to supplement the strong and em elements, mark is fine. Both are safe because they're in the control of the page author. Even when quoting, the page author or the website owner is at least vaguely known, and the mark is reasonably attributable to them and reasonably likely to be attributed to them even by nonexperts.

The main problem is with third-party insertion by an unidentified party and use by a user with only basic computer skills who wrongly but naively assumes the website owner did the mark element's resulting markup. That user won't even know about the mark element or how to access source code, and may not be allowed to access source code because browser commands are dimmed by an institution.

Example: Someone runs a small business and thinks the managers are getting hung up on legalities. So they set their browser or server to apply the mark element to copyright notices, terms of use, and other bothersome stuff, and they style the mark element to be in a one-point font in white text on a white background. If anyone asks, that's just the way the website is. If the staff call the website company and debate whether the website does or does not say "x", the staff will be wrong but never know it, and if the staff are lawyers or IT managers, for example, they may commit major violations of contract or other law and never know why.

Even highly skilled computer users review source code on no more than one percent of all the pages they rely on, and that won't change once third-party insertions begin to change the look of websites.

The search engine problem is a good one. I often run a search, get a result, see the snippet, go to the page, and wonder where on the page my search terms are. A browser's Find function can be inefficient. However, I would prefer if browsers would offer a feature whereby search terms can be copied from a search engine URL and then the page auto-scrolled to their location. This could be a UA-specific implementation that could be based on agreements with search engine firms. It does not need a W3C or HTML standard. For example, the Opera 9.52 browser has a search facility that allows me to execute a search using Google, Ask, Yahoo, Amazon, Wikipedia, eBay, Yahoo Shopping, or BitTorrent, and presumably terms can be retained long enough to support a user-dismissable Opera frame pointing to their occurrence or to feed Opera's Find function.

I assume can be styled with anything available in CSS for other text elements, such as <a> and . I haven't even considered the extent to which stylistic creativity can change the meaning of all sorts of marked content.

This only considers the host's and user's servers. It doesn't consider servers in between. The standard should not give permission to third-party server owners to insert and style as they wish. The draft, as it stands now, would.

An earlier commenter elsewhere (Lachlan Hunt), in response to my concern that "[i]f the element is intended to be introduceable by servers other than the website owner's, then that should be preventable", said "No, this is a misunderstanding of the mark element's purpose. If a 3rd party server can inject markup into another site's content, then that's a major security problem, but it is independent from HTML itself. It is also not how the mark element is intended to be used." I replied: "HTML5's role in a security breech would come if it grants permission to system designers, as I saw in this statement: 'Another example of the mark element is highlighting parts of a document that are matching some search string. If someone looked at a document, and the server knew that the user was searching for the word "kitten", then the server might return the document with one paragraph modified as follows: . . . . kitten . . . .' Section 4.6.7. That looks like permission for the server to interject markup into a byte stream. Given that many people in large organizations view outside websites in a way that involves at least two servers per visit, one hosting and others not, the section seems to be permission for any nonhost server to sell advertising or comment on content as if it's the author's commentary. Thus, the security breech would be furthered by HTML as permission. However, as I didn't find any reference in the document to any server that wasn't acting on a served document somehow as authorized, e.g., by checking a certificate, if you're right that the intent was not as I feared, then we should propose rewording the HTML standard before finalization so only the site owner's server might mark the string if nonowners are to be conformant. . . . I'm not an attorney and laws vary by nation and circumstance, but if you believe there's any error in the above please let us know."

Could you please tighten the language to leave the mark element's use in the hands of the page author?

If that can't be done, can restrictions to prevent security breaches be written in? The problem with that, of course, are the malicious attackers.

At the least, allow page authors to block insertion of a mark element not already in the source code. For example, a meta element in the head element might be a preventive for a page. Example:

The True value would be available but trivial, as omitting the meta element would also imply True. Yes/No would be clearer but inconsistent with practice with other meta elements.

A narrower problem is whether a website that supports internal searches might want to allow their own host to insert the mark element in response to a local user's search executed locally even though the addmark is turned off for everywhere else. This potentially applies to any CGI script and perhaps other locally-applied technology. To solve that, a second meta attribute could serve. Example:

The website designer could decide who qualifies as local and how to implement that decision technically, and could use the attribute to prevent anyone deemed nonlocal from marking content on that page. The False value would be available but trivial, as omitting the addmarklocal attribute would also imply its falsity when addmark is False. The order of the attributes should not matter. Placing the attributes in one or two separate meta tags should not matter.

The earlier discussion is in Bug 6606. This responds to <http://www.w3.org/TR/html5/single-page/>, accessed 4-4-09, Working Draft, 12 February 2009 (presumably <http://www.w3.org/TR/2009/WD-html5-20090212/>), section 4.6.7. I'll await possible comment here before considering whether to propose the meta attributes in the appropriate Wiki.

Thank you.

--
Nick

Comment 1 Nick Levinson 2009-04-20 09:14:13 UTC

I've added addmark and addmarklocal to http://wiki.whatwg.org/wiki/MetaExtensions 4-20-09.

Thank you.

-- 
Nick

Comment 2 Lachlan Hunt 2009-04-20 10:23:29 UTC

(In reply to comment #0)
> ... I replied: "HTML5's role in a security breech would come if it grants
> permission to system designers, as I saw in this statement: 'Another example
> of the mark element is highlighting parts of a document that are matching
> some search string. If someone looked at a document, and the server knew that
> the user was searching for the word "kitten", then the server might return
> the document with one paragraph modified as follows: . . .
> . <mark>kitten</mark> . . . .' Section 4.6.7. That looks like permission for
> the server to interject markup into a byte stream.

I really do not understand the source of your confusion, but HTML5 certainly does not give permission for any kind of security breach like you describe.  The technique that the spec is discussing is something that people have already implemented on their own servers, often using elements like <span> or <b>.

See, for example, this article that discusses how to obtain search terms from the HTTP Referer header and dynamically modify the page using some PHP.  This is entirely under the control of the site's developers.  There is no unauthorised access by 3rd parties.

http://www.alistapart.com/articles/searchhighlight

Besides, if a 3rd party could inject markup into a site, there are bigger problems than just being able to insert the <mark> element, like the insertion of <script> elements that many attackers already do today.

Comment 3 Nick Levinson 2009-04-24 08:10:06 UTC

Crackers often don't care about permission, but major-brand browser makers do. They'll need to offer HTML5 compliance for their business not to suffer, and that will legally be defined as general compliance with the W3C standard. Warranties, at least in the U.S., that apply to buying a computer or a lot of them include approximate compliance with the standards when the browser manufacturers offer HTML ability. Standards thus matter in law. If the standard grants permission, the major manufacturers can exploit it and may profit from it (the objection is not to profit per se but profit is a motivator likely to increase the volume of tagging). Crackers do plenty without permission, but what major-brand suppliers could also do with permission via their widely-distributed products would add massive quantities of exploitation.

Present practices in scripting are often not violations at all. Neither are the other tags that you mentioned.

The proposed standards for the element and for the , , and other similar elements differ in one critical aspect. None of the latter elements would depend on a user's current activity. Only the mark element would depend on that knowledge.

From the draft standards: "The span element doesn't mean anything on its own, but can be useful when used together with other attributes, e.g. class, lang, or dir." W3C Working Draft 23 April 2009, Section 4.6.18. "The b element represents a span of text to be stylistically offset from the normal prose without conveying any extra importance, such as key words in a document abstract, product names in a review, or other spans of text whose typical typographic presentation is boldened." Id., sec. 4.6.20. "The i element represents a span of text in an alternate voice or mood, or otherwise offset from the normal prose, such as a taxonomic designation, a technical term, an idiomatic phrase from another language, a thought, a ship name, or some other prose whose typical typographic presentation is italicized." Id., sec. 4.6.19. "The strong element represents strong importance for its contents." Id., sec. 4.6.5. "The em element represents stress emphasis of its contents." Id., sec. 4.6.4.

By contrast: "When used in the main prose of a document, it ["[t]he mark element"] indicates a part of the document that has been highlighted due to its likely relevance to the user's current activity." Id., sec. 4.6.7.

That requires knowing "the user's current activity."

Knowing could be by the method the article you referenced described, namely, if the user arrives at a page via a search engine the HTTP referer could be used in generating a page with any HTML content desired. That method is safe because, in order to apply it, the website owner's permission is needed in order to modify or generate a page according to the referer. No objection here.

Scripts in general cannot be on a page unless the website owner approves. As a website owner I don't object to scripts being in the standards. Some scripts are useful. If I don't want one, I don't put any on a page. And, so far, I don't, because I want my content to be as useful to people who turn their browsers off against all scripts. The same need for a site owner's permission is true of PHP and Perl. I'm planning to add Perl for a security purpose. Neither of those languages is being added without my consent.

But wouldn't limit how "the user's current activity" is to be known. It could be with permission, but it could be without permission. Nowadays, that's considered insecurity. Under HTML5, the definition of the mark element would give permission to know, and permission to apply the knowledge. The permission would not be because the mark element has to be present in a page's markup, it could be absent, but because HTML5 would specify that is for "the user's current activity." That purpose, because it is stated in HTML5, would be permission for adding to arbitrary locations within a page. The permission would not be limited to the owner or even the recipient of a page but would be permission for anyone (and the recipient should not have that permission unless they are aware that they, the recipient among typical amateur recipients, is doing it, and is not under the impression that the owner is doing it). Because it would be permission for anyone, injection of markup by arbitrary parties along the route of transmission would be permitted by HTML5. That would be a breach caused by the standards.

HTML5 should not be consent to inject anything by anyone without the owner's consent. It is. It should not be.

If the element is taken away, the span element appears quite sufficient for all purposes that the mark tag would otherwise serve legitimately. The span element can, for example, have the class attribute. Therefore, the element should be removed from HTML5 altogether or it should be redefined to be just a presentational alternative to , , , and without knowledge of a "user's current activity."

If there's still some need for as presently defined, then the ability to defeat intermediary interjection is a necessity, and for that the addmark meta keyword is a solution. Creating addmark as an attribute directly in the mark element won't suffice while the standard would allow the mark element to be added by someone else, who presumably would never use addmark, never asking the site owner for an opinion. But a meta element can serve as a per-page block, albeit requiring more coding just to opt out from undesired effects.

Right now, getting rid of completely is the ideal solution. A good alternative is to redefine it to remove anything like "the user's current activity" from possible relevance. The only other good alternative is a metatag with addmark. One of these choices is needed.

Thank you.

--
Nick

Comment 4 Lachlan Hunt 2009-04-24 13:06:34 UTC

I really do not understand where your confusion is coming from, but you seem to be having difficulty comprehending what the spec actually means.  I'm not really sure how I can clarify the issue further to help you understand why HTML5 is not granting any permission as you have tried to describe and why there is no security issue here.

What kind of activities you think the phrase "the user's current activity" is referring to?  It's actually meant to refer to activities the user is performing on the site or in the web application. i.e. things that the site is aware of.  It *does not* refer to activities that the site cannot possibly know.

For example, a user of an online calendar application be checking for upcoming parties in a particular month that meet come search criteria.  In this case, the user's activity is searching and filtering calendar eventssomething the web application will know based on the user's actionsand the application may mark the relevant events.

Comment 5 Nick Levinson 2009-04-25 06:41:43 UTC

The calendar example is fine. It doesn't worry me and that's for exactly the reason you give.

The standard, however, goes farther in its words. What might have been agreed upon verbally in meetings won't matter once the standard is promulgated as reliable for designers of user agents. What the standard says is what matters. (An analogy in U.S. law is that the plain words of a statute are to be applied in disposing of a case on its facts and only if the plain words do not provide necessary guidance may legislative intent be examined, so only then may legislative committee reports and pre-enactment floor debates be considered, which means that the original sponsors' hopes and expectations are irrelevant until the text is found to be unclear in context. That usually means that drafters' intentions never get considered and only the official words matter.)

The words of section 4.6.7: "Another example of the mark element is highlighting parts of a document that are matching some search string. If someone looked at a document, and the server knew that the user was searching for the word 'kitten', then the server might return the document with one paragraph modified as follows: . . . . ." Insofar as the search string is only from a search created as part of a website's internal search function, including an on-site search box supplied by Yahoo or Google, then your interpretation that 4.6.7 is safe and does not provide 3d parties with adverse permission is valid.

But 4.6.7 just talks of "some" search string, i.e., pretty much any search string, and so there's no limitation that the search string must have been crerated at the website owner's website. It could have been created at an external search engine or anywhere else before the URL arrives at a destination website for page retrieval. It could have been created without the user realizing it. Most users have no idea what a search string is.

And "matching some search string" from unlimited points of creation is only "[a]nother example" of marg tagging. It is not the limit. The limit is defined earlier in the same section: "The mark element represents a run of text in one document marked or highlighted for reference purposes, due to its relevance in _another context_. . . . When used in the main prose of a document, it indicates a part of the document that has been highlighted due to its _likely relevance to the user's current activity_." (emphases added). There is no limit that means can be based only on activity within the same website. "[A]nother context" is any other context. The "user's current activity" has to be known or the provision is meaningless, and the standard presumption is that every provision has meaning until shown otherwise. So the "user's current activity" has to be known in "another context", wherever that may be.

As to what kind of nefarious use third-party modification would support, injection of advertising is the likeliest to be common, with the ads being not very distinct, so users think they're supplied by the website. A local, professionally-written, newspaper article the other day reported that 5 of 12 results on the first page of Google results shared a certain characteristic; the problem is that Google doesn't put 12 results on a page, they put 10 and maybe 2 ads (you can get 12 if you opt for 20 or more results but then you wouldn't have a "first" page, you'd have only one page, so that's not what happened). So the reporter didn't know the difference between a result and an ad, even though the engines label ads. Some days I have to angle my head while using an LCD with Yahoo/Google search results to figure out whether a result is really an ad, because the color differentiation has gotten fairly subtle. And I know this stuff. Most users don't.

If you are right about the drafters' intent, they need to tighten their wording in the standard. In that case, I'm not sure what would do that won't. So it appears that exists specifically for third-party use, which means permission for third parties is part of the intent as well as implicit in the wording. If it's not meant to be, rewriting is required and I favor it.

Without a third-party role, seems largely presentational. If is meant only to be more easily recognized as presentational than , the standard can be rewritten to say that.

Thank you.

--
Nick

Comment 6 Nick Levinson 2009-06-08 04:36:17 UTC

We should add the risks of libel, defamation of character, interference with an advantageous business relationships, and other legal liabilities arising because of third-party content mistakenly attributed to the website owner, who may not have any knowledge of what is appearing in a user's browser.

This could offend and lose visitors; it could also lead to a lawsuit. Both would be misdirected, but the website owner might have to prove to a court that they didn't have the content complained of. Since the owner can't see the user's browser and since the browser's additions will likely be intermittent or on irregular rotation, the website owner may be reduced to speculating on where the content appeared, a court may not buy speculation, and the court may order the website owner to come in to testify as to what content appeared on certain days. The plaintiff will swear it was there and the defendant will deny it. I, as a website owner, don't want to be in the defendant's chair.

For you to sue whomever ran the third-party message requires your figuring out who ran it, and legally Microsoft may be off the hook if it didn't go through them.

Even without a lawsuit, you may have to explain your way out of the consequences of something you never saw and can't find. Your general disclaimer may not be enough to avoid losing some of your visitors or losing a lawsuit for monetary damages.

Since can do the legitimate functions of , should be dropped or controlled with meta elements as proposed. I suggest dropping as the simpler solution.

Thanks.

--
Nick

Comment 7 Kia Kroas 2009-06-19 13:44:51 UTC

I believe I see the source of the confusion.

Take this example: I am a blogger and read something on example.com. I like it and would like to comment on it, but the original document is 500 pages long. Therefore, I can only take a snippet of it.

Without the full 500 pages of the original context, the readers of my blog do not know the main points of the document. My abstract/summary of the document would have to emphasize (what I believe are) the main points the author wanted to reach out. Currently, such emphasis is created through the various font styling elements such as <b>, <i>, <strong>, <em>, <span> ... etc or combinations of them. 

To be clear, the emphasis would only be on my blog. There is no way the <mark> element can be used to tag or vandalize someone else's content through my server. (HTML is only markup. It's not some magical scripting.)

The purpose of <mark> (to the best of my knowledge) is to add semantic reference for the browser or whatever parser is analyzing the page. As Lachlan Hunt points out, emphasis elements are already widely used. And as you noted, the <span> element already is used for these purposes. Consider <mark> as an extended, special-usage <span>

Comment 8 Nick Levinson 2009-06-21 19:59:34 UTC

Your blog example is fine. So are some of the intentions behind mark, so far as discussed. So is the desire for a tag that has a clearer name than . So is the desire not to use , , etc. in mixed ways in your pages.

It's possible to have both classless and classed elements properly styled; I did it in IE5.5 (b & b.test with different colors), so I assume it can be done with other elements and browsers, but I understand the convenience of having a separate element for certain purposes rather than classes for common elements, and mark is fine for that.

The blog example is fine because, from the viewpoint of page authoring, you're creating an original HTML page whether you invented every word or quoted Jeanne d'Arc, and so you could be inserting markup anyway. (What's verbatim in various emphases is not an HTML issue.) You, as the blog owner, would essentially be adding markup to your own page. That's fine. Using mark to help you is fine.

The problem is shown in the kitten example in the HTML5 draft, section 4.6.7. It says, "Another example of the mark element is highlighting parts of a document that are matching some search string. If someone looked at a document, and the server knew that the user was searching for the word 'kitten', then the server might return the document with one paragraph modified as follows: [example includes kitten twice in running text]."

Search strings come from two places: inside and outside a website. Websites that offer their own search functions produce internal search strings, and applying that search string to the text in a document on that same website is essentially doing as the website owner intended, the whole process being within the same website. That's fine.

But external search strings present a problem. A search in Bing or another search engine produces a search string that matches a string in a page. HTML5 proposes that the search string be used to highlight a matching string on a page by adding a mark element. The HTML5 proposal applies to a search string whether it is internal or external to the host server. Under HTML5, any server may apply the mark tag, not just the website owner's hosting server.

The reason the external search string is a problem is that its acceptability for determining a user's interest in kittens or anything else means that any external string can justify a server adding the mark element to an internal string. There's no way to do that except by servers not under the control of the website owner. HTML5 would give permission to do that. So Microsoft's Bing search engine could return a list of links and then through frames (as some major search engines do now in some modes) present the actual pages with mark tags added. Granted they could add markup anyway; they already do sometimes, but so far, to my knowledge, they state an explanation at the top, putting even newbie users on notice without requiring them to remember the meaning of double-underlines or other abstract symbols. And they do it by copying the page, reformatting it to their style, and then marking it up with an explanation. For that, they don't need HTML permission. HTML5 would add permission to all server owners, and, given Microsoft's past history with their browsers, someone big's likely to take advantage. And not just someone big.

Intentions, I've been assured, wouldn't allow this. Our good intentions would suffice except for one thing: Browser designers are not going to ask us or look at the bug report, and they don't have to. The specification will be all they'll need. Students of programming will almost never see intentions. They'll see standards. If the plain words of the standards grant permission, the behind-the-scenes discussion about what was really meant will be out of sight and ignored.

So far, no one's justified third-party markup as if on the original, other than to say it's being done already, and I don't think they meant that it's a good idea. Some who disagree on that point agree that users should be able to tell when markup was added by someone else besides the original site owner. If a newbie can tell the difference in ownership, I'm willing to accept third-party markup at the user's display. So those of us in this discussion essentially agree that third-party markup as if it's in the original should not be allowed. We've differed on whether the standard allows it, not on whether it should. And that's solvable.

I think all that's needed is to change "server" to "website owner's server" and permission for other servers to add markup would be gone.

Another solution would be to define "server" in a more global context so that in this section it could only mean the website owner's server. I didn't see that.

Thanks.

--
Nick

Comment 9 Ian 'Hixie' Hickson 2009-06-28 10:28:12 UTC

How are third party servers going to add <mark> elements anyway? Even if the spec gave them permission, how could they do it?

It's not clear to me why the clarification you are requesting is necessary.

Comment 10 Nick Levinson 2009-06-29 02:56:49 UTC

There's no risk to the website owner's server. The risk is to a server at the user's end but that is not under the end user's control, or to the user's terminal when the user doesn't understand system details and simply is looking at a website.

Cache technology allows storing an incoming page for a second, long enough for the browser or receiving server to add the markup, and a browser may support multiple caches simultaneously (I understand some do). Proxy technology, even if no proxy today allows writing but only reading, could have writing added, and a proxy does not require a separate server if not meant for security but only for modifying inbound files. While the methods can be used for any purpose, many such purposes are illegal. This use would be legal.

Institutions could also use cache or proxy technology to render undesirable text virtually invisible to employees and to members of the public using their public terminals. A user could use the source command to see it in raw form but almost no one ever will, almost none even remember such a command, and some institutions disable some of those commands for security and support reasons.

A public library near me already removed ads from Yahoo's email inbox page. The library staff told me they do that because users get confused. I saw the inbox pages. I saw page-editing can be done at a receiving server without slowing the downloading time at all noticeably. The particular method was to block page element replacements that came from non-Yahoo domains, which is arguably a crude way to edit, but other kinds of edits appear well within technological reach at the page-receiving end of a download.

I'm told Firefox already offers bad-word replacement. I assume that's via an extension  or some such addition. If that's under a user's control, fine. If the user at least knows the page is being edited, that may be acceptable. But if it does it, technical means already are in use.

In Yahoo's search engine, I used a view-as-HTML link to view a document to which highlighting for my keywords were probably added in less than a second. While the URL of the HTML-equivalent document differed from the original doc's URL, that's probably for business and legal reasons, not because the technology made it impossible to claim the URL was the same, as retrieving a file through a cache or a proxy doesn't show the URL of the cache or the proxy but of the remote origin.

MS's SmartTags, as far as I know, were not much criticized for slowing page display, if at all. They had the technology and were fast enough.

The receiving server or terminal is where this can happen, and for that the techniques are feasible. And the subtlety is easy enough, too, so that most users don't know the difference between original and received by looking at it.

Thanks.

-- 
Nick

Comment 11 Ian 'Hixie' Hickson 2009-06-29 04:55:58 UTC

There is no possible way to prevent a user agent running on the user's behalf from doing whatever the user wants it to do, including adding annotations.

If the user installs software without knowing what that software does, there is nothing we can do about that. The software can't know whether the user knows about what it does or not.


> A public library near me already removed ads from Yahoo's email inbox page. The
> library staff told me they do that because users get confused. I saw the inbox
> pages. I saw page-editing can be done at a receiving server without slowing the
> downloading time at all noticeably. The particular method was to block page
> element replacements that came from non-Yahoo domains, which is arguably a
> crude way to edit, but other kinds of edits appear well within technological
> reach at the page-receiving end of a download.

There is no way to stop this.


> In Yahoo's search engine, I used a view-as-HTML link to view a document to
> which highlighting for my keywords were probably added in less than a second.
> While the URL of the HTML-equivalent document differed from the original doc's
> URL, that's probably for business and legal reasons, not because the technology
> made it impossible to claim the URL was the same, as retrieving a file through
> a cache or a proxy doesn't show the URL of the cache or the proxy but of the
> remote origin.

It is in fact for technical reasons; there is no way technically for Yahoo! to affect the contents of a page at a URL it does not control.

Only the user agent's own software, or software on the user's network (e.g. the library proxy server) can change the page, and those tools can change the page regardless of what we put in the HTML page, there is no way for the originating server to stop this (and nor should there be, since that would mean that it would prevent users from doing what they wanted to the page).

Comment 12 Nick Levinson 2009-06-29 08:59:51 UTC

There's little or no technological control, but legal permission can be denied, and, for the most part, HTML standards already deny similar permissions to users. The public library, for example, is unlikely to do what it believes to be illegal. Yahoo could have pulled the doc into a cache, written to it, and passed it to the user; the user would have known the URL not of the cache itself but of the point of origin, just as happens now when a user views a doc from a multi-day cache set up to save bandwidth and doesn't know about the cache because they typed the address into the address bar themselves and think the page came immediately from that address. Yahoo doesn't do that with the view-as-HTML files, but not for want of technology. They'd run into legal problems, viz., copyright, if they didn't attribute their changes to themselves and what wasn't changed to the original publishers.

A user may opt to annotate, of course. But when they don't know they are and sophistication is required to know that their computer is applying a tag, we're asking too much of most users. The popularity and utility of the Internet depends on widespread acceptance, which means inevitably most users won't be that sophisticated about computers. Usability is this issue and that's already part of HTML authoring. Accounting for users' understandings is reflected, for example, in standardizing link colors and link underlining. Many sites discard those standards, which is their right to do, but because they do the result is that viewers' understanding of links is even more tenuous, and they assume all the links belong to the site owner, incorrectly.

We're not responsible for educating consumers to that degree. But inserting a tool by which a third party can make a user's perception of what they see that much more fragile goes beyond providing a language by which site designers can offer their content and people can read it (and, selectively, interact via forms, scripts, etc., consistent with site owners' intentions (third-party markup of forms and scripts might be something else to think about, too)).

"(. . . [N]or should there be [a "way for the originating server to stop" "software on the user's network [from] . . . chang[ing] . . . the page"], since that would mean that it would prevent users from doing what they wanted to the page)." (Hixie, supra.)

They have to be limited. Many companies have customer contracts online, and I'm pretty sure they don't want users changing them without permission, e.g., by restyling them into white fonts on white backgrounds. If an online store sells an item for $123, they don't want a user's employer restyling the rightmost digit of every item price into invisibility so the unaware user proves, through, say, a screen photograph, that the item is only $12. Some courts publish case decisions and legal forms online, and doubtless want them left unchanged. A site owner cannot give up all control over their site or the Internet will be less useful to them in making offers and concluding business. If the user wants to make changes, they have to be responsible for them, and that's denied if anyone else can intervene without the end user's knowledge.

Thanks.

--
Nick

Comment 13 Ian 'Hixie' Hickson 2009-06-29 09:46:09 UTC

> There's little or no technological control, but legal permission can be denied,
> and, for the most part, HTML standards already deny similar permissions to
> users. The public library, for example, is unlikely to do what it believes to
> be illegal.

This isn't the right forum to discuss legalities; I encourage you to take this up with your legislature.


> Yahoo could have pulled the doc into a cache, written to it, and
> passed it to the user; the user would have known the URL not of the cache
> itself but of the point of origin, just as happens now when a user views a doc
> from a multi-day cache set up to save bandwidth and doesn't know about the
> cache because they typed the address into the address bar themselves and think
> the page came immediately from that address. Yahoo doesn't do that with the
> view-as-HTML files, but not for want of technology.

Actually, it _is_ for want of technology. There is no technical way to do what you describe. If there was, the Web would not be sustainable, as criminals could steal everyone's bank details overnight.


> We're not responsible for educating consumers to that degree. But inserting a
> tool by which a third party can make a user's perception of what they see that
> much more fragile goes beyond providing a language by which site designers can
> offer their content and people can read it (and, selectively, interact via
> forms, scripts, etc., consistent with site owners' intentions (third-party
> markup of forms and scripts might be something else to think about, too)).

I have no idea what you are saying here, sorry.


> "(. . . [N]or should there be [a "way for the originating server to stop"
> "software on the user's network [from] . . . chang[ing] . . . the page"], since
> that would mean that it would prevent users from doing what they wanted to the
> page)." (Hixie, supra.)
> 
> They have to be limited.

No, sorry, we are not limiting what users can do. That is counter to the entire philosophy of the Web. The user must be able to have ultimate control over the content he downloads.


> Many companies have customer contracts online, and I'm
> pretty sure they don't want users changing them without permission, e.g., by
> restyling them into white fonts on white backgrounds.

You can do that today in a multitude of ways (e.g. using Firebug, using user style sheets, using Opera's cache editor, using a rewriting proxy, using the user preference for colours, etc).

It is absolutely imperative and core to the whole principle of the Web and HTML in particular that users be able to adapt the pages they use in ways that work for them. Blind users listen to pages using speech synthesis, users with poor eyesight make the pages black-and-white only with large fonts, different computers use different fonts to view the Web, users of mobile phones shrink pages to fit their small screens, etc.


> If an online store sells
> an item for $123, they don't want a user's employer restyling the rightmost
> digit of every item price into invisibility so the unaware user proves,
> through, say, a screen photograph, that the item is only $12.

They might not want this, but that's their problem. It's possible today and HTML5 will not change this.


> Some courts
> publish case decisions and legal forms online, and doubtless want them left
> unchanged. A site owner cannot give up all control over their site or the
> Internet will be less useful to them in making offers and concluding business.

Site owners do not have control over their sites today. They don't have to give any control up, because they don't have it in the first place.


> If the user wants to make changes, they have to be responsible for them, and
> that's denied if anyone else can intervene without the end user's knowledge.

Welcome to the Web. This is how it is. Your network administrator can change what you see unless you use TLS (end-to-end encryption), and the administrator of your local machine can change anything at all regardless.

Comment 14 Nick Levinson 2009-06-30 07:31:13 UTC

That the Internet is outside of law has always been a myth. E.g., you're not free to copy from Wikipedia without limit. Most of its text is protected by licenses that reserve rights (<http://en.wikipedia.org/wiki/Wikipedia:Copyrights>). Most or all open-source licenses reserve rights. Moreover, many websites impose terms for use of their sites. The legal ground generally is that the site is a chattel and use that violates the terms is a trespass on the chattel. E.g. (without judging legal quality of notice): Google (<http://www.google.com/accounts/TOS>, e.g., sections 2.1, 5.5, & 8.2 (& 8.1)) forbids "modify[ing]" Google's content. Apple (<http://www.apple.com/legal/terms/site.html>) says ". . . make no modifications to any such information . . . .".

Your comment that "Site owners do not have control over their sites today. They don't have to give any control up, because they don't have it in the first place." is only true technologically and only in part. Owners are legally responsible for content (as for libel) and rightly so since they have major technological control with which to meet their legal obligations. I think you overreached on a few points and this is one.

Besides that no user or creator can have ultimate or complete control of any content, the issue here is third-party control. Firebug, being with Firefox, and Opera are presumptively under the user's or site creator's control. Users presumptively have control over user style sheets and users' color settings, B&W, TTS, platform choice, and platform-specific fonts, and these being in those hands is good. While many of these can be misused by third parties, HTML 4.01 does not legally support that but v5 will. Thus the relevance of law in combination with technology. Law can't be escaped.

If anyone wants new legislation from a legislature, they likely would be computer and browser firms and retailers seeking exemption from existing law, and volume contracts make that unlikely. Given the laws already in place, for the specific issue at hand the proper venue is the W3C.

Of your tech points:

Yahoo can copy a doc from example.com into its own proxy or cache, store the example.com URL with the doc, and then when the doc is copied or moved from the cache or proxy Yahoo can report it as coming from the example.com URL. That's what proxied networks and caching browsers do now. If you visit un.org and try to visit it again an hour later, with normal settings your browser will retrieve from your cache but show un.org in your address bar. Nothing from the original URL gets into the cache or proxy without technical means to retrieve from the original URL. Copying from an original URL is not made easier by caching or proxying along the way. How would a cache or proxy, without more, permit anyone to copy everyone's bank details?

Thanks for mentioning a rewriting proxy. Since it can edit a URL, that supports my argument. Neither users nor site creators usually control proxies. A proxy that can substitute a URL by one algorithm can probably do so by another, becoming a third party's technical mechanism for retrieving from a URL other than what the user thinks, such as from a third party's cache where markup is applied.

I withdraw the angle on scripts and forms. I think the effect on them is essentially the same as on the rest and no more and no less an issue. I also withdraw the $123-to-$12 type of tactic because it's most likely to succeed as a DoS attack, itself probably illegal regardless of HTML.

Thanks.

--
Nick

Comment 15 Ian 'Hixie' Hickson 2009-06-30 07:32:38 UTC

I'm confused as to what you are proposing now. Could you briefly summarise what you think should change in the spec?

Comment 16 Nick Levinson 2009-07-01 08:44:36 UTC

I propose 2 alternatives for section 4.6.7:

-- Edit the example that depends on knowing the user's search string (the "kitten" explanation) to require that the markup be performed only inside the original page owner's server. Replace "and the server knew" with "and the original page owner's server knew". The next mention of "the server" should stay as it is, as it would inherit the constraint on the first mention. Also, allow per-page opt-in by the page author via a meta tag to permit anyone else to add mark tags anywhere else. Without an opt-in on a page, other servers thus could not legally add the markup to that page. Section 4.6.7 does not need more wording to accommodate this opt-in, because sec. 4.2.5 (meta) and the MetaExtensions Wiki will be sufficient, and I can review the Wiki for relevant names.

-- Or delete the example (the line explaining it and the "kitten" text).

Either of these will preserve mark as semantically more specific and thus often more useful than span, which remains as a general fallback tag. The definition and other examples for mark thus remain intact.

The main explanatory paragraph for sec. 4.6.7, before the examples, may remain as it is, because "likely relevance" does not convey what "knew" does. "[L]ikely relevance" can be guessed before any specific user ever approaches a computer to access any page, with the guess applied by the original page author, whereas "[knowledge]" of a user's specific search must follow a search, which is not the page author's time frame. With the kitten case neutralized or deleted, the main paragraph won't have the adverse effect.

Thank you.

--
Nick

Comment 17 Nick Levinson 2009-07-15 10:08:36 UTC

A feature of the proposal is that internal searches would still be supported with the mark tag. If a website has its own search engine, the user's search string could be applied within the website to a script which generates or regenerates a page to highlight the words being searched for.

Because the search engine is internal, supporting the script will not require giving the rest of the world similar support.

If the search function appears to be internal but is actually external, viz., run from another server by a different owner, similar script support can be agreed on between the original site owner and the search engine owner by which the contracted search engine sends the search string to the original website, the owner of which may apply the string to support a script. The agreement could even allow the engine's firm to apply mark tags itself if the original site owner agrees.

Yahoo's, Google's, and kindred search engines -- even when the search field is on the original site's pages -- would not be able to do this without the original site owner's consent, if, as I understand their operation, the search field only sends the search string to the external engine which then operates like a standard search engine, because it does not let the original site copy the search string first and generate pages for the user. And I think Google, and probably Yahoo, avoid pages that are generated after a search.

Engines offering site-based search boxes with paid offsite processing might offer such realtime data-sharing so owners can generate pages responding to searches.

Thanks.

-- 
Nick

Comment 18 Ian 'Hixie' Hickson 2009-08-09 22:03:30 UTC

> -- Edit the example that depends on knowing the user's search string (the
> "kitten" explanation) to require that the markup be performed only inside the
> original page owner's server. Replace "and the server knew" with "and the
> original page owner's server knew". The next mention of "the server" should
> stay as it is, as it would inherit the constraint on the first mention.

I think it's clear that "the server" here is the server serving the page.


> allow per-page opt-in by the page author via a meta tag to permit anyone else
> to add mark tags anywhere else. Without an opt-in on a page, other servers thus
> could not legally add the markup to that page.

Other pages already can't add any markup to the page. Servers can't just arbitrarily affect each other.

Marking WONTFIX since I think the requested fix is unnecessary in context, and would in fact lead to readers wondering if another server could have been meant if the clause wasn't present (as it isn't in many other cases in the spec).

Comment 19 Nick Levinson 2009-08-12 15:35:07 UTC

Both are solvable.

> I think it's clear that "the server" here is the server serving the page.

> . . . I think the requested fix . . .
> would in fact lead to readers wondering if another server could have been meant
> if the clause wasn't present (as it isn't in many other cases in the spec).

It should be clear not only in this thread, but in the standard itself. It arguably is by implication, but that gets tricky. One solution is a general definition in a glossary, such as: "Server means the page owner's server unless the context clearly indicates otherwise."

>> allow per-page opt-in by the page author via a meta tag to permit anyone else
>> to add mark tags anywhere else. Without an opt-in on a page, other servers thus
>> could not legally add the markup to that page.

> Other pages already can't add any markup to the page. Servers can't just
> arbitrarily affect each other.

This was never about any other server going into the page owner's server. That was noted long ago. This was about marking up the page as it travels to the user but without the user knowing that changes due to the markup were not in the original. Most users won't know. Thus, the bar.

However, a page owner may want to permit Microsoft or anyone else to add Activities or other effects of whatever flavor. A per-page opt-in provision would allow that. I don't think many designers would exercise it but it would settle any controversy that Microsoft should be allowed to improve our websites without telling us by providing page owners with a tool to permit it. If an opt-in method should allow more specificity about what is opted into, that can be developed easily enough now or later, but a simple and comprehensive opt-in is a good starting point. It would serve as legal permission only. It would not cause anything technologically. Implementation would be up to the user agent. What would come from  the page owner is permission for the UA to implement, as it should, or there'd be no right for the UA to implement the unwanted mark, in the legal sense that there is a right or duty to implement b, em, strong, abbr, and other elements.

I'm open to further ideas without reopening now. Most major website owners will want their sites delivered as sent (with the usual exceptions for, e.g., the user taking responsibility for making changes or for disability-responsive access), so, if there are any better solutions, they're of interest.

Thanks.

-- 
Nick

Comment 20 Nick Levinson 2009-08-14 14:52:43 UTC

Correcting myself: <mark> has another value beyond having more specific semantic value than <span>. Under section 10.6.2, a UA should show the presence of <mark>ed content even in parts of a document not visible, such as by highlighting parts of a scroll bar. It also encourages UAs to move quickly (to cycle) from mark to mark in a document. I saw it before but hadn't acknowledged it earlier in this discussion, and it's a useful provision. This is per <http://www.w3.org/TR/html5/single-page/>, as accessed 8-12-09.

The basic need remains as before. Section 10.6.2 does not obviate the danger implied by the kitten example.

Thanks.

-- 
Nick

Comment 21 Nick Levinson 2009-08-21 15:37:07 UTC

That mark is intended to allow third-party markup of a page in transit or as served without the user knowing is supported by this from a chat log (<http://krijnhoetmer.nl/irc-logs/whatwg/20080218#l-98> & <http://krijnhoetmer.nl/irc-logs/whatwg/20080218#l-99>, respectively, both as accessed 8-21-09):

* * * * *
# [01:29] <Philip`> "I also have some <mark>kitten</mark>s" - that makes me wonder whether "<mark>kittens</mark>" would be acceptable (from a cleverer search engine that detects words with the same basic meaning, rather than doing substring matches, and it might get <mark>young cat</mark> too) 
# [01:31] <Philip`> (I believe it is perfectly acceptable, but the example makes me wonder anyway, so maybe the example could say <mark>kittens</mark> to be clear that it's not meant to be strict about anything)
* * * * *

In one of the IRC logs also was a comment about not reading what's longer than a screenful. When a bunch of people make erroneous claims, especially when they're conflicting claims, a single informative reply to all of them will usually be longer.

There also seems to be one commenter's view that the way the Internet was long ago defines how it's used today. Its growth includes website owners, probably a supermajority of whom require stability of what they send, perhaps not in the 1970s but definitely now, and the Web has to include them. Tell them that intermediaries may edit what's shown to users without users knowing and that commentator will get a surprising answer.

Thank you.

-- 
Nick

Comment 22 Maciej Stachowiak 2010-03-14 13:18:43 UTC

This bug predates the HTML Working Group Decision Policy.

If you are satisfied with the resolution of this bug, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

This bug is now being moved to VERIFIED. Please respond within two weeks. If this bug is not closed, reopened or escalated within two weeks, it may be marked as NoReply and will no longer be considered a pending comment.

Comment 23 Nick Levinson 2010-03-28 18:29:33 UTC

I'm folding this issue into Bug 6606 (by entering this one as a duplicate of 6606 even though this isn't a duplicate per se since that seems to be the closest method) because requiring that compliant browsers be reasonably accurate in rendering websites and not add, edit, or hide arbitrary content without site owners' consent is a necessity. Otherwise, they should not claim to be compliant and we site owners face significant liabilities -- both.

Thank you.

*** This bug has been marked as a duplicate of bug 6606 ***