This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 8310 - script block's source initialization: please honor the specified charset and type
Summary: script block's source initialization: please honor the specified charset and ...
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: All All
: P3 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords: NE
Depends on:
Blocks:
 
Reported: 2009-11-16 00:32 UTC by Philippe Verdy
Modified: 2010-10-04 13:55 UTC (History)
5 users (show)

See Also:


Attachments

Description Philippe Verdy 2009-11-16 00:32:41 UTC
About this section:

"If the load was successful
Initialize the script block's source as follows:

If the script is from an external file
The contents of that file, interpreted as string of Unicode characters, are the script source.

For each of the rows in the following table, starting with the first one and going down, if the file has as many or more bytes available than the number of bytes in the first column, and the first bytes of the file match the bytes given in the first column, then set the script block's character encoding to the encoding given in the cell in the second column of that row, irrespective of any previous value:

Bytes in Hexadecimal	Encoding
FE FF	UTF-16BE
FF FE	UTF-16LE
EF BB BF	UTF-8
This step looks for Unicode Byte Order Marks (BOMs).

The file must then be converted to Unicode using the character encoding given by the script block's character encoding.

If the script is inline and the script block's type is a text-based language
The value of the DOM text attribute at the time the "running a script" algorithm was first invoked is the script source.

If the script is inline and the script block's type is an XML-based language
The child nodes of the script element at the time the "running a script" algorithm was first invoked are the script source."

----

This description clearly breaks the definition of the <script> element as a way to embed or reference any extranl data that is not part of the document's content flow. Notably, it forces all scripts to be text-based (even though this could as well be a binary-encoded image (loaded from an external file or URL).

My opinion is that all these steps should be taken ONLY if the script block's type is embedded inline in the document (not loaded separately), in which case the detection of BOM's is clearly undesirable, or if it is loaded from an external file or URL whose type is text-based (its computed MIME type starts by "text/" or maps to a text-based protocol such as "application/xml").

In other words, the specified value for the script's "type" attribute must still be hononed if it is present, as well as the specified value for the script's "charset" attribute when it is also present.

Forcing the detection of BOMs when the specified charset does not have to be "guessed", is a bug, notably because the script's content could be any kind of text data which may legitimately start by the suggested bytes in the specified non-Unicode-based charset where it will legally represent one or more actual and significant characters.

On the opposite, if the specified charset is one of the suggested UTF (Unicode-based), the detection of BOMs may be used to check the conditions by which the charset may be safely modified into one of the others ; in other words, BOM detection CAN be, and on fact SHOULD be used ONLY IF:

- there is no charset specified, OR

- the specified charset is an Unicode-approved and compatible UTF, and at least one of UTF-8, UTF-16, UTF-32, and ONLY if these charsets allow the presence of BOMs, so NOT if the specified charset is UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE

The detection of BOMs is then possible also for compressed Unicode transforms like BOCU-8 or SCSU (which MAY also be optionally supported by browsers, independantly of the separate support for transport-layer protocols that can use deflate/gzip/compress algorithms on any document type and using any possible charset) and possibly some other large Asian charsets like GB-18030, or recent versions of HKCS or JISX (where BOMs are also representable because they can now fully map the UCS bijectively, and may be used as if they were a UTF), provided that their respective encoding of BOMs are distinct between each charset.
Comment 1 Ian 'Hixie' Hickson 2010-01-06 08:25:17 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: see diff given below
Rationale: I changed the spec to handle external resources the same way as internal resources, instead of treating all external resources as plain text.