1833 – Wrong ISO-8859-1 enconding behaviour on "Direct input"

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1833 - Wrong ISO-8859-1 enconding behaviour on "Direct input"

Summary: Wrong ISO-8859-1 enconding behaviour on "Direct input"

Status:	CLOSED FIXED

Alias:	None

Product:	Validator
Classification:	Unclassified
Component:	Parser (show other bugs)
Version:	0.7.0
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	---
Assignee:	Terje Bless
QA Contact:	qa-dev tracking

URL:	http://rodando.innox.com.mx
Whiteboard:
Keywords:

Duplicates (3):	1920 2275 2690 (view as bug list)
Depends on:
Blocks:

Reported:	2005-08-08 18:22 UTC by Rafael
Modified:	2018-10-29 06:12 UTC (History)
CC List:	4 users (show)

See Also:

Attachments

Description Rafael 2005-08-08 18:22:34 UTC

When using the "Direct input" method to specify the HTML to validate, it doesn't
respect the enconding (though it does detect it, as it shows it)  This doesn't
happend when validating by URL or File upload

* Notes:
  - the encoding was ISO-8859-1, and
  - the invalid char was "Ú" (this seems to happen only with uppercase letters)

--couldn't think on a severity other than "normal"

Comment 1 Bj 2005-08-08 19:14:04 UTC

The correct behavior here would be to have no encoding selection for direct 
input and treat the content as if it had charset=utf-8 in an HTTP header. This 
is because you submit a sequence of characters through the browser, not a 
sequence of octets. And the input form should be clear that UTF-8 is expected 
(e.g., through use of accept-charsets), I don't think we should allow for 
anything else.

Comment 2 Olivier Thereaux 2005-09-21 09:04:48 UTC

Rafael, thank you for your report.
This issue has been fixed in cvs, and will be in our next release.

Comment 3 Olivier Thereaux 2005-09-21 13:07:11 UTC

*** Bug 2275 has been marked as a duplicate of this bug. ***

Comment 4 Jeppe H 2005-09-21 13:44:02 UTC

I do not agree that the content should be treated as utf-8. In my opinion it
would seriously degrade the use of "Direct input", since you cannot validate
content containing other encoding than utf-8.

From a user perspective I don't see why the "Direct input" should be any less
usable than file upload or URL validation. The "Direct input" should of
course(!) respect the encoding of the content and the extended direct input
interface should  contain the possibility of overriding the encoding type in the
same way the other two validation options do.

Comment 5 Olivier Thereaux 2005-09-21 22:09:12 UTC

(In reply to comment #4)
> I do not agree that the content should be treated as utf-8. In my opinion it
> would seriously degrade the use of "Direct input", since you cannot validate
> content containing other encoding than utf-8.

Yes, it should. Regardless of what encoding the document you are trying to validate is in, since the 
form on the validator's site is in utf-8, when you copy and paste your source into the form's text area, 
then it "becomes" utf-8.

See also Comment #1.

> From a user perspective I don't see why the "Direct input" should be any less
> usable than file upload or URL validation. 

Direct input is different from the other methods because there is no "server" sending the content along 
with HTTP information, it's just a text string being sent with a form. It means that the logic to process 
this content is slightly different from the other input methods. However, your assumption that it would 
have any effect on usability is based on a misunderstanding.

Comment 6 Jeppe H 2005-09-22 08:03:26 UTC

> Yes, it should. Regardless of what encoding the document you are trying to
validate is in, since the 
> form on the validator's site is in utf-8, when you copy and paste your source
into the form's text area, 
> then it "becomes" utf-8.
> 
> See also Comment #1.

So, basically, what you are saying is that because the form on the validator's
site is utf-8, then it should NOT be possibly to have your input validated in
another encoding? Again, I would like to point out that to the user, there
should be no difference between the three different validation methods.

What I suggest you do is this:
When using "direct input" in 'simple mode' the validator should try to read the
encoding from within the content (from the meta element). If there is no meta
elmenent specifying an encoding, the validator should use utf-8.

When using "direct input" in 'extended mode' it should be possible to specify
what encoding you want the validator to use.

This way, "direct input" would follow the methodology of the "file upload" and
"URL" validation.

> Direct input is different from the other methods because there is no "server"
sending the content along 
> with HTTP information, it's just a text string being sent with a form. It
means that the logic to process 
> this content is slightly different from the other input methods. However, your
assumption that it would 
> have any effect on usability is based on a misunderstanding.

Misunderstanding? If I try to validate content that validates with an iso-8859-1
encoding in "URL validation" or "file upload", but suddenlig fails when using
"direct input" [because the form uses utf-8] I would say it does indeed have an
impact on usability! I don't see how you can say this is based on a
misunderstanding.

To sum up, I think you should give the user the choice of specifying an encoding
on "direct input", the same way it works on the extended upload and URL
validation. Furthermore I think you should specify that content on "direct
input" in 'simple mode' defaults to utf-8.

Comment 7 Olivier Thereaux 2005-09-22 09:01:38 UTC

(In reply to comment #6)
> So, basically, what you are saying is that because the form on the validator's
> site is utf-8, then it should NOT be possibly to have your input validated in
> another encoding? Again, I would like to point out that to the user, there
> should be no difference between the three different validation methods.

Please read comment #1 and comment #6 carefully, 
and try the development instance of the validator at 
http://qa-dev.w3.org/wmvs/0.7/

Let me try to explain once more:
When validating by uri or file upload you are transfering a file 
(i.e a sequence of bytes) for which it is necessary to know the 
encoding. This is done either with the HTTP headers that the 
server sends, or by trying to parse the document and find the 
<meta> information for charset.

When validating by direct input you are not transfering a file, 
you are transfering a series of characters that have been 
entered into a form field of a page in utf-8. It does NOT mean 
that the original content has to be utf-8, it means that your 
browser, automatically, will paste the content as utf-8 characters. 

Even if the original content was, say, iso-latin-1, the final string 
of characters sent to the validator will be utf-8, automatically, 
thanks to the browser. So the validator, which usually (when 
validating by URI) tries to find out what the encoding of the 
document is and transcodes it internally into utf-8 must NOT, 
in this case, believe what the <meta> says. Eventually, what 
you do not seem to understand is that there MUST be a difference 
in logic for the validator so that there is no difference, in the end, 
for the user. Which is our goal as it is your. 

I understand and sympathise with your concern, but the bug has 
been fixed in our development code, as you can see by testing 
the instance I have mentioned above (and below). So, regardless 
of whether my explanations are clear enough or not, regardless 
of whether you agree with or understand the logic, there is really 
no reason to keep arguing about it...

http://qa-dev.w3.org/wmvs/0.7/

Thank you.

Comment 8 Jeppe H 2005-09-22 10:17:05 UTC

> Let me try to explain once more:

Thanks for your explaination. I was not aware that you had fixed the bug, and it
seemed to me that you wanted to leave it as it was. However, I just tried out
your developer instance - it seems to work fine.

Also FYI, I'm not trying to argue, I'm merely trying to say, that it would be
nice to have the option of specifying an encoding.

Thanks for your time.

Comment 9 Olivier Thereaux 2005-10-18 07:31:06 UTC

*** Bug 1920 has been marked as a duplicate of this bug. ***

Comment 10 Elisabeth Freeman 2005-10-20 19:46:18 UTC

Well I have to say that this is going to be INCREDIBLY confusing for people who are trying to validate 
their files, especially those who don't read these bug reports (or for those new to HTML, who, even if 
they did read these, would be completely confused).

The issue is that the form as it is now would indicate that there are three alternatives to validating your 
code.  But these alternatives are NO LONGER EQUIVALENT.

If you are going to have different behavior for an uploaded file as you do for a copy and pasted file, 
then you need to indicate in the form that these two things cause DIFFERENT BEHAVIOR.

This is extremely confusing.  As an author who is writing a book, including this validator as an 
explanation of how to determine if your code is correct, having to explain to your MOM why her code 
validates if she pastes it in but doesn't if she uploads it... is well, pretty frickin' impossible.

Is this service for expert users?  Is it for new users?  Ask yourself - who is this service for?  Are you 
"fixing" these issues because it makes YOU feel better?  Or because it actually helps people?

I'm very upset about this.  It's caused me many hours of reworking a chapter in my book all because 
you think that pasted text should be treated differently, when in fact, it will make no sense to 95% of 
HTML writers out there.

Comment 11 Bj 2005-10-20 20:03:12 UTC

I don't think this will affect many users. First please note that file upload 
and validating by web site address weren't equivalent in this regard. Second, 
the result will only be different if the document the user might paste, upload 
or validate by address does not correctly declare the character encoding, or if 
there are browser / system bugs on the user's client machine. I don't think 
this will be very common, and even if, there is little we can do about it, all 
we've done here is to align the Validator with the technical reality, having a 
choice for encoding for textarea validation did not work before, it was just a 
way to cause the Validator to produce incorrect or misleading results.

Octets versus Bytes and HTML and Form I18N are indeed complex and sometimes 
confusing topics, but that's not really our fault...

Comment 12 Olivier Thereaux 2005-10-20 21:18:07 UTC

(In reply to comment #10)

> The issue is that the form as it is now would indicate that there are three alternatives to validating your 
> code.  But these alternatives are NO LONGER EQUIVALENT.

Yes, they are...

In addition to the good answer Bjoern made, there is one thing I want to point out: the recent changes are 
here *precisely* to make the behaviour equivalent for the end user. Whether the internal processing is 
different is none of the user's concern.

Comment 13 Elisabeth Freeman 2005-10-23 22:17:46 UTC

I don't understand how they are equivalent for the end user.  If I upload a file without a meta tag and try 
to validate, I will see a warning about Character Encoding.

if I paste in the *exact same file* and submit using the form, then I don't see this warning.

How is this equivalent??

Comment 14 Olivier Thereaux 2005-10-24 06:19:28 UTC

(In reply to comment #13)
> I don't understand how they are equivalent for the end user.  If I upload a file without a meta tag and 
try 
> to validate, I will see a warning about Character Encoding.
> 
> if I paste in the *exact same file* and submit using the form, then I don't see this warning.

This comment is certainly showing a problem in the user interface. However, it is not exactly relevant to 
this bug. This bug was about how the data sent to the direct input is handled internally by the validator, 
not about the (inevitable, see below) slight differences in messages between the different input 
methods.

When you validate by URL, the validator has a number of sources from which to draw the character 
encoding:
- the HTTP Content-type header
- (if XHTML) the XML processing instructions
- the meta http-equiv element

if none of these sources gave any satisfying result, the validator throws a warning, and tries to validate 
with a default character encoding.

Same goes for the file upload interface, except that there is no "server" serving the document.

For the direct input interface, there is no "document" being served, just a string of text being pasted in 
a form on a page in utf-8, therefore the text is pasted as utf-8, and parsed as utf-8. Since there is no 
need for character encoding detection, there will be no warning about the lack of character encoding 
information.

Perhaps there should be, with all validation results for direct input and file upload, a note explaining: 
"Your document has been validated using the validator's direct input / file upload interface. When 
online, this document will need to declare its character encoding. You can set up the declaration of 
character encoding by either setting up the Web server properly, or by declaring the character encoding 
within each documents. (read more). We encourage you to check the document again when it is online."

Comment 15 Bj 2006-01-10 17:57:54 UTC

*** Bug 2690 has been marked as a duplicate of this bug. ***