Re: [css-text] Control characters from Brad Kemper on 2014-06-27 (www-style@w3.org from June 2014)

From: Brad Kemper <brad.kemper@gmail.com>
Date: Fri, 27 Jun 2014 08:18:46 -0700
To: Jonathan Kew <jfkthame@gmail.com>
Cc: Koji Ishii <kojiishi@gluesoft.co.jp>, Anne van Kesteren <annevk@annevk.nl>, Zack Weinberg <zackw@panix.com>, fantasai <fantasai.lists@inkedblade.net>, "www-style@w3.org" <www-style@w3.org>
Message-Id: <C364EADC-0AC0-4187-860C-87526DE5E274@gmail.com>

> On Jun 27, 2014, at 2:27 AM, Jonathan Kew <jfkthame@gmail.com> wrote:
> 
> On 27/6/14 09:49, Koji Ishii wrote:
> 
>>> Of course, you still need to define how those control characters
>>> are rendered, erroneous or not.
>> 
>> Yes, this is the text we have now[1]. Your quick review is invaluable
>> for us, please let us know if any.
>> 
>>> Control characters (Unicode class Cc) other than tab (U+0009), line
>>> feed (U+000A), and carriage return (U+000D) are ignored for the
>>> purpose of rendering. (As required by [UNICODE], unsupported
>>> Default_ignorable characters must also be ignored for rendering.)
> 
> IMO, it would be better to require the presence of spurious control characters (i.e. other than tab, linefeed, return) to be rendered visibly - e.g. as "hexbox" glyphs or inverse-colored ^X sequences - rather than ignored.
> 
> The presence of such characters within the text degrades functionality by interfering with operations such as search, indexing, copy/paste to other environments, etc. Their presence is typically the result of broken authoring tools/workflows, but as long as browsers ignore them for rendering, authors generally remain unaware that their data is bad, and readers will usually be unaware that their searches, etc., may be missing content they would have expected to match.
> 
> I realize that making stray control characters visible will result in some pages (containing bad text) looking "worse" from an aesthetic point of view, but I don't believe this is such a widespread and serious problem that we should give up the battle and accept that the Web will forever hide these errors and leave the problem of polluted data unaddressed. If browser vendors would agree to make the CCs visible, and include this in the relevant specs, there'll be a spate of bug reports - as we've seen when we had them rendered as hexboxes in Firefox - but these can be redirected to the sites/authors concerned, and there will be significant pressure on authors and tool vendors to fix the underlying problems.
> 
> Although there'd no doubt be some short-term discontent, I think this would be significantly better for the long-term health of the web. Our concern should not -only- be to optimize the display of (a small minority of badly-authored) web pages of today; we should also be concerned for the quality and usability of web data in the future.

I disagree with the notion that we should use ugly and confusing rendering of unintentional characters as a weapon for punishing/scolding authors. If UNICODE says the characters should be ignored, then let's ignore them, and don't render them. It is not our place to use the threat bad rendering to coerce authors into fixing or preventing encoding errors. We should be forgiving of the problems, instead of trying to make them worse.

Received on Friday, 27 June 2014 15:19:16 UTC