This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6915 - Encoded unicode chars errors as "broken URI fragments"
Summary: Encoded unicode chars errors as "broken URI fragments"
Status: RESOLVED INVALID
Alias: None
Product: LinkChecker
Classification: Unclassified
Component: checklink (show other bugs)
Version: 4.5
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Ville Skyttä
QA Contact: qa-dev tracking
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-05-18 14:19 UTC by Boujin
Modified: 2011-10-17 18:32 UTC (History)
0 users

See Also:


Attachments

Description Boujin 2009-05-18 14:19:51 UTC
Encoded unicode chars in a link anchor are incorrectly described as "broken URI fragments". (It seems the program does not recognize encoded chars in an anchor.)

Example:
http://www.domain.org/folder/file#espa%C3%B1a



Error output:
=============================================================================
    Status: 200 OK

    Some of the links to this resource point to broken URI fragments (such as index.html#fragment). 
=============================================================================



But encoded unicode chars in other parts of the url do not output error. Example:
http://www.domain.org/espa%C3%B1a/espa%C3%B1a



P.S.
I am using http://validator.w3.org/checklink which currently is version 4.5. Unfortunately, only version 4.4 is available at the bug report drop down menu.
Comment 1 Ville Skyttä 2009-05-18 20:58:46 UTC
The set of allowed characters depends on the type of the document where the target of the fragment identifier (not the link) is, and what the target is.  For example, if it's an "id" attribute, the set of characters is quite restricted and for example for <a name="..."> there are differences between HTML 4.x and XHTML 1.x what the allowed characters are: http://www.w3.org/TR/xhtml1/#C_8

No working, real URL to check was provided so it isn't possible to dig deeper into your particular case, but for compatibility I would suggest sticking with the allowable characters for ID and NAME types from HTML 4: http://www.w3.org/TR/html4/types.html#type-name

It is quite possible that the link checker indeed has issues with some things that should be allowed though, so I'm leaving this bug open for now.  A working link to real, public document with which such issues can be reproduced would be nice though.

(Version 4.5 is in the drop down menu now, thanks for noting its absence.)
Comment 2 Boujin 2009-05-19 08:44:37 UTC
I am using xhtml 1.0 sent as text/html and 1.1 sent as application/xhtml+xml. I am using content negotiation depending on which the user agent accepts. I use "id" inside an "a".

Example 1 with "id":
<a href="/espa%C3%B1a/espa%C3%B1a" id="espa&#241;a">Espa&#241;a</a>

Example 2 with "#":
<a href="/espa%C3%B1a/espa%C3%B1a#espa%C3%B1a">Espa&#241;a</a>

(Note different encoding for char in text and uri.)

Both validate using the W3C Markup Validator. Unfortunately, Example 2 fails the W3C Link Checker stating error mentioned in my previous post.



I have read your links where it is stated, "Note that the collection of legal values in XML 1.0 Section 2.3, production 5 is much larger than that permitted to be used in the ID and NAME types defined in HTML 4."

html's "id" (and "name") allow: letters (a-z), digits, hyphens, underscores, colons, and periods.

Unfortunately, I have failed to find a list of valid encodings for the "id" in xhtml. Nevertheless, since xhtml's list is "much larger", then I guess the previously mentioned anchor is correct?