This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1524 - propose new function fn:escape-html-uri
Summary: propose new function fn:escape-html-uri
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 1.0 (show other bugs)
Version: Last Call drafts
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Ashok Malhotra
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 1327
  Show dependency treegraph
 
Reported: 2005-07-08 13:56 UTC by Joanne Tong
Modified: 2005-09-29 11:32 UTC (History)
1 user (show)

See Also:


Attachments

Description Joanne Tong 2005-07-08 13:56:15 UTC
This function would perform URI escaping as currently defined by the 
Serialization specification.  If this function is adopted, then the 
Serialization specification would reference this function definition when 
describing URI-escaping in the character expansion phase.  


--------------------------------------------------

fn:escape-html-uri($uri as xs:string?) as xs:string 

This function escapes all characters except printable characters of the US-
ASCII coded character set, specifically octet ranging from 32 to 126 
(decimal).  The effect of the function is to escape a URI according to how html 
user agents would handle attribute values that expect URIs.  Each character in 
$uri to be escaped is replaced by an escape sequence, which is formed by 
encoding the character as a sequence of octets in UTF-8, and then representing 
each of these octets in the form %HH, where HH is the hexadecimal 
representation of the octet.  This function must always generate hexadecimal 
values using the upper-case letters A-F.

If $uri is the empty sequence, returns the zero-length string.

Note:

	The behavior of this function corresponds to the recommended handling 
of non-ASCII characters in URI attribute values as described in Appendix B.2.1
[HTML 4.0]

--------------------------------------------------

Thanks,
Joanne
Comment 1 Felix Sasaki 2005-07-22 04:54:47 UTC
(In reply to comment #0)
> This function would perform URI escaping as currently defined by the 
> Serialization specification.  If this function is adopted, then the 
> Serialization specification would reference this function definition when 
> describing URI-escaping in the character expansion phase.  
> 
> 
> --------------------------------------------------
> 
> fn:escape-html-uri($uri as xs:string?) as xs:string 
> 
> This function escapes all characters except printable characters of the US-
> ASCII coded character set, specifically octet ranging from 32 to 126 
> (decimal).  The effect of the function is to escape a URI according to how html 
> user agents would handle attribute values that expect URIs.  Each character in 
> $uri to be escaped is replaced by an escape sequence, which is formed by 
> encoding the character as a sequence of octets in UTF-8, and then representing 
> each of these octets in the form %HH, where HH is the hexadecimal 
> representation of the octet.  This function must always generate hexadecimal 
> values using the upper-case letters A-F.
> 
> If $uri is the empty sequence, returns the zero-length string.
> 
> Note:
> 
> 	The behavior of this function corresponds to the recommended handling 
> of non-ASCII characters in URI attribute values as described in Appendix B.2.1
> [HTML 4.0]

In the serialization specification, you refer to XLink 1.0. In this
specification, you say that URI escaping is defined in terms of the
serialization specification, but you also define it in terms of HTML 4.0. I'm a
little bit confused by this. Could you clarify?
In a previous comment, we pointed out that XLink 1.1 defines escaping in terms
of IRI. Could you imagine to refer to IRI, section 3.1, for the URI escaping?
Thank you for your reply in advance. 

-- Regards, Felix Sasaki.

> 
> --------------------------------------------------
> 
> Thanks,
> Joanne
Comment 2 Michael Kay 2005-07-22 17:05:55 UTC
Here's an attempt to clarify this confusing subject.

Currently, the serialization specification, when describing URI escaping for the
HTML output method, does indeed contain a reference to XLink; but the detailed
algorithm described is actually by design identical to that described in
Appendix B.2.1 of the HTML 4.01 specification:

http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#non-ascii-chars

People have often asked why we escape non-ASCII characters rather than escaping
the characters listed in the XLink specification; it seems useful therefore to
reference the HTML algorithm rather than the XLink algorithm, since that is the
one we are using. (The practical reason for choosing this algorithm is that
using the XLink algorithm doesn't work: in particular, it breaks many Javascript
URIs in typical browsers).

This proposal (which the WGs have accepted) makes the algorithm which is
currently built-in to the serializer available as a user-callable function, so
that applications can invoke it when they need it and use a different algorithm
when they don't. As a result of this proposal, there is a new F+O function which
refers to the HTML 4.01 specification, and the serialization specification will
refer to this new F+O function to describe the default serialization behavior.

Does this make things clearer?

Michael Kay (personal response)
Comment 3 Ashok Malhotra 2005-08-10 23:34:06 UTC
On the joint WG telcon om 7/12/2005 we decided to add this function
based on the text provided by Joanne Tong.  Joanne was asked to provide example
which she sent privately to me.
Comment 4 Felix Sasaki 2005-08-17 04:38:47 UTC
(In reply to comment #2)
> Here's an attempt to clarify this confusing subject.
> 
> Currently, the serialization specification, when describing URI escaping for 
the
> HTML output method, does indeed contain a reference to XLink; but the detailed
> algorithm described is actually by design identical to that described in
> Appendix B.2.1 of the HTML 4.01 specification:
> 
> http://www.w3.org/TR/1999/REC-html401-19991224/appendix/notes.html#non-ascii-
chars
> 
> People have often asked why we escape non-ASCII characters rather than 
escaping
> the characters listed in the XLink specification; it seems useful therefore to
> reference the HTML algorithm rather than the XLink algorithm, since that is 
the
> one we are using. (The practical reason for choosing this algorithm is that
> using the XLink algorithm doesn't work: in particular, it breaks many 
Javascript
> URIs in typical browsers).
> 
> This proposal (which the WGs have accepted) makes the algorithm which is
> currently built-in to the serializer available as a user-callable function, so
> that applications can invoke it when they need it and use a different 
algorithm
> when they don't. As a result of this proposal, there is a new F+O function 
which
> refers to the HTML 4.01 specification, and the serialization specification 
will
> refer to this new F+O function to describe the default serialization behavior.
> 
> Does this make things clearer?
> 
> Michael Kay (personal response)

Sorry for the late reply. Yes, this makes things clearer. Thank you very much 
for your explanatation.

Felix Sasaki