This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 21577 - NFC issues reported wrong way: wrong char highlighted, wrong total amount
Summary: NFC issues reported wrong way: wrong char highlighted, wrong total amount
Status: REOPENED
Alias: None
Product: HTML Checker
Classification: Unclassified
Component: General (show other bugs)
Version: unspecified
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Michael[tm] Smith
QA Contact: qa-dev tracking
URL: http://www.cs.tut.fi/~ jkorpela/test/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-04 08:27 UTC by Jukka K. Korpela
Modified: 2015-08-23 07:34 UTC (History)
1 user (show)

See Also:


Attachments

Description Jukka K. Korpela 2013-04-04 08:27:08 UTC
When the HTML5 validator issues warnings about deviations from Normalization Form C (NFC), it highlights the last character of the text run, not the offending character. It also reports the number of warnings as too large.

These problems are not present in validator.nu: it highlights the entire text run, and it does not report the number of warnings.

Example:

<!doctype html>
<title>Χαίρε· Hello world</title>

Excerpt from validator output, when using direct input:

QUOTE
Validation Output: 4 Warnings

Below is a list of the warning message(s) produced when checking your document.

    Warning Line 2, Column 25: Text run is not in Unicode Normalization Form C.

    <title>Χαίρε· Hello world</title>
UNQUOTE

Here the letter “d” or “world” appears in red, and the column number 25 refers to “d”, too. There are no other warnings issued, yet the total number of warnings is reported as 4. (Perhaps the validator counts informative messages as warnings, for the purposes of calculating this total? There are 3 informative messages in this case.)

Flagging the last character of a text run is more confusing in real-life situations where the run is all Greek. In the context where I originally met this issue, things were confusing since the last character of the run was ά, Greek alpha with tonos, which *could* have been in non-NFC form (but wasn’t).
Comment 1 Michael[tm] Smith 2013-04-20 19:16:09 UTC
I can't reproduce this either with validator.nu or the W3C validator. Neither emits any errors or other messages for your <!doctype html><title>Χαίρε· Hello world</title> case.
Comment 2 Jukka K. Korpela 2013-04-20 19:53:27 UTC
The character in my demo case is “·” U+0387 GREEK ANO TELEIA, not “·” U+00B7 MIDDLE DOT. I have added a URL of an online version of the case for clarity.

The bug is triggered by any non-NFC character and even by a character reference denoting such a character, e.g. &#x387;.
Comment 3 Michael[tm] Smith 2013-04-21 01:20:05 UTC
(In reply to comment #2)
> The character in my demo case is “·” U+0387 GREEK ANO TELEIA, not “·” U+00B7
> MIDDLE DOT. I have added a URL of an online version of the case for clarity.

Ah, thanks

> The bug is triggered by any non-NFC character and even by a character
> reference denoting such a character, e.g. &#x387;.

OK

(In reply to comment #0)
> When the HTML5 validator issues warnings about deviations from Normalization
> Form C (NFC), it highlights the last character of the text run, not the
> offending character. It also reports the number of warnings as too large.

When I try the document at http://www.cs.tut.fi/~ jkorpela/test/nfc.html8 with both validator.nu and http://validator.w3.org/nu/ I get exactly the same result as far as the highlighting and number of warnings.

> These problems are not present in validator.nu: it highlights the entire
> text run, and it does not report the number of warnings.

http://validator.w3.org/nu/ also highlights the entire text run and does not report the number of warnings...

> Excerpt from validator output, when using direct input:
> 
> QUOTE
> Validation Output: 4 Warnings
> 
> Below is a list of the warning message(s) produced when checking your
> document.

Ah, so you mean you're checking the document using http://validator.w3.org/

Please use http://validator.w3.org/nu/ directly instead.

There are a number of known issues with the post-processing that the legacy validator does on output from the validator.nu backend. Those issues are not likely to ever be fixed. I don't maintain the code for it and have never even touched the code directly. It was all set up before my time. And actually, nobody is actively maintaining it now. Not for a year or two now.

Anyway, the long-term solution to this problem is that we just need to move the http://validator.w3.org/nu/ UI to become http://validator.w3.org/ and finally retire the current http://validator.w3.org/ to http://validator.w3.org/legacy or something. It's long overdue. If it were completely up to me to decide, we'd have already done that a long time ago. In them mean time, I can't commit to investigating any problems in the reports from http://validator.w3.org/ that aren't reproducible at http://validator.w3.org/nu/