Re: [css-text] Control characters

On 27/6/14 16:18, Brad Kemper wrote:
>
>
>> On Jun 27, 2014, at 2:27 AM, Jonathan Kew <jfkthame@gmail.com>
>> wrote:
>>
>> On 27/6/14 09:49, Koji Ishii wrote:
>>
>>>> Of course, you still need to define how those control
>>>> characters are rendered, erroneous or not.
>>>
>>> Yes, this is the text we have now[1]. Your quick review is
>>> invaluable for us, please let us know if any.
>>>
>>>> Control characters (Unicode class Cc) other than tab (U+0009),
>>>> line feed (U+000A), and carriage return (U+000D) are ignored
>>>> for the purpose of rendering. (As required by [UNICODE],
>>>> unsupported Default_ignorable characters must also be ignored
>>>> for rendering.)
>>
>> IMO, it would be better to require the presence of spurious control
>> characters (i.e. other than tab, linefeed, return) to be rendered
>> visibly - e.g. as "hexbox" glyphs or inverse-colored ^X sequences -
>> rather than ignored.
>>
>> The presence of such characters within the text degrades
>> functionality by interfering with operations such as search,
>> indexing, copy/paste to other environments, etc. Their presence is
>> typically the result of broken authoring tools/workflows, but as
>> long as browsers ignore them for rendering, authors generally
>> remain unaware that their data is bad, and readers will usually be
>> unaware that their searches, etc., may be missing content they
>> would have expected to match.
>>
>> I realize that making stray control characters visible will result
>> in some pages (containing bad text) looking "worse" from an
>> aesthetic point of view, but I don't believe this is such a
>> widespread and serious problem that we should give up the battle
>> and accept that the Web will forever hide these errors and leave
>> the problem of polluted data unaddressed. If browser vendors would
>> agree to make the CCs visible, and include this in the relevant
>> specs, there'll be a spate of bug reports - as we've seen when we
>> had them rendered as hexboxes in Firefox - but these can be
>> redirected to the sites/authors concerned, and there will be
>> significant pressure on authors and tool vendors to fix the
>> underlying problems.
>>
>> Although there'd no doubt be some short-term discontent, I think
>> this would be significantly better for the long-term health of the
>> web. Our concern should not -only- be to optimize the display of (a
>> small minority of badly-authored) web pages of today; we should
>> also be concerned for the quality and usability of web data in the
>> future.
>
> I disagree with the notion that we should use ugly and confusing
> rendering

Then create beautiful and clear glyphs for them! :)

 > of unintentional characters as a weapon for
> punishing/scolding authors.

What is "ugly and confusing", IMO, is when browsers display the data

   <U+0048 U+0001 U+0065 U+0002 U+006C U+0003 U+006C U+0004 U+006F>

such that it appears to read "Hello", yet when a user searches for the 
string "Hello" they'll fail to find it; it will be indexed separately; 
it will be mangled by screen-readers; etc., etc.

> If UNICODE says the characters should be
> ignored, then let's ignore them, and don't render them.

Control characters are NOT considered default-ignorable in Unicode. If 
you search for Default_Ignorable_Code_Point in 
http://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt, 
you'll see that neither C0 nor C1 controls are included.

> It is not our
> place to use the threat bad rendering to coerce authors into fixing
> or preventing encoding errors. We should be forgiving of the
> problems, instead of trying to make them worse.

This isn't "trying to make them worse". It's trying to encourage and 
facilitate the creation of cleaner data by making irregularities visible.

JK

Received on Friday, 27 June 2014 15:50:27 UTC