This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 17994 - The list of named character references at http://www.w3.org/TR/html5/named-character-references.html (8.5 Named character references) should also be available in an easy-to-parse format (e.g. plain text or json). This will allow developers to use it with
Summary: The list of named character references at http://www.w3.org/TR/html5/named-ch...
Status: RESOLVED WORKSFORME
Alias: None
Product: WHATWG
Classification: Unclassified
Component: HTML (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: Unsorted
Assignee: Ian 'Hixie' Hickson
QA Contact: contributor
URL: http://www.whatwg.org/specs/web-apps/...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-18 07:30 UTC by contributor
Modified: 2012-09-24 03:09 UTC (History)
4 users (show)

See Also:


Attachments

Description contributor 2012-07-18 07:30:55 UTC
This was was cloned from bug 14993 as part of operation convergence.
Originally filed: 2011-11-29 09:17:00 +0000

================================================================================
 #0   contributor@whatwg.org                          2011-11-29 09:17:44 +0000 
--------------------------------------------------------------------------------
Specification: http://www.w3.org/TR/html5/
Multipage: http://www.whatwg.org/C#top
Complete: http://www.whatwg.org/c#top

Comment:
The list of named character references at
http://www.w3.org/TR/html5/named-character-references.html (8.5 Named
character references) should also be available in an easy-to-parse format
(e.g. plain text or json).  This will allow developers to use it without
having to parse the HTML page.
For example, the Unicode consortium provides several lists as plain text:
http://www.unicode.org/Public/6.0.0/ucd/

Posted from: 86.50.68.215
User agent: Mozilla/5.0 (Ubuntu; X11; Linux i686; rv:8.0) Gecko/20100101 Firefox/8.0
================================================================================
 #1   David Carlisle                                  2011-11-29 09:44:59 +0000 
--------------------------------------------------------------------------------
Note the list is derived from the same source as the xml entities spec from the sources at

http://www.w3.org/2003/entities/2007xml/

in particular unicode.xml in that directory has all the information (but it has a lot of other information too, so might not be quite what you are looking for)

the entities are available in dtd declaration format (so more or less plain text) as

http://www.w3.org/2003/entities/2007/htmlmathml-f.ent

although that only has the XML-compatible ones (not the special case html rules that allow some common entity names to be used without a trailing ";")

It is essentially trivial to generate the same list in other formats by modifying the xslt that extracts htmlmathml-f.ent, which is available in the xml source directory, so I had assumed people would rather do that and generate exactly the format they want (text, json, python, whatever) but if there is a generally useful format that I should generate I have no objection to adding that to the build and putting the generated files up at 

http://www.w3.org/2003/entities/2007

Also I'd need to know whether it was desired to list the ones without a trailing ;


(Or maybe Ian will pick this up and I can do nothing, I don't mind:-)

David
================================================================================
 #2   Ian 'Hixie' Hickson                             2011-12-07 23:05:03 +0000 
--------------------------------------------------------------------------------
Since I already have a script generating the table, I don't mind adjusting the script to also generate something else. I need to know exactly what people want, though.
================================================================================
 #3   Ian 'Hixie' Hickson                             2011-12-09 22:12:23 +0000 
--------------------------------------------------------------------------------
Status: Did Not Understand Request
Change Description: no spec change
Rationale: need to know what information people want and in what format.
================================================================================
 #4   Ezio Melotti                                    2012-02-24 11:24:38 +0000 
--------------------------------------------------------------------------------
A JSON file with the character reference names and the characters they represent would be good.

E.g.:
{"AElig;": "\u00c6",
 "Aacute;": "\u00c1",
 "AMP;": "&",
 ...}
================================================================================
 #5   David Carlisle                                  2012-02-24 14:48:08 +0000 
--------------------------------------------------------------------------------
something not totally unlike this?

http://www.w3.org/2003/entities/2007/htmlmathml.json
================================================================================
 #6   Ezio Melotti                                    2012-02-28 11:10:47 +0000 
--------------------------------------------------------------------------------
That would work for me.  Note that unlike the list you linked the trailing ';' should be included where necessary in the list of HTML5 references (the list at
http://www.w3.org/TR/html5/named-character-references.html includes both references with and without the ';').
I also noticed that in your list the "DotDot" entry (and a few others) is equivalent to " \u20DC" (with a leading space), whereas the entry in the HTML5 list only mentions U+020DC (without mentioning the leading space).  This is a combining character, so the reason for the extra space might be to prevent it to combine with the previous character.  I don't know if this should be the same for the HTML5 list too though.
================================================================================
 #7   David Carlisle                                  2012-02-28 11:59:14 +0000 
--------------------------------------------------------------------------------
(In reply to comment #6)
> That would work for me.  Note that unlike the list you linked the trailing ';'
> should be included where necessary in the list of HTML5 references (the list at
> http://www.w3.org/TR/html5/named-character-references.html includes both
> references with and without the ';').

I wouldn't want to put the ; in the names (it's a syntactic feature that html lets you omit the ; in some cases but the name of the entity doesn't have the ;
(it would also make it a lot harder to use that data in xml) However There would be no problem in having an additional json array that listed the ones that didn't need ;. Actually I don't think uniocde.xml has that information, all the rest of the html entity list is extracted from that file, but the additional ones without
are currently added during that extraction process. I should probably record that list in the source file anyway, for consistency,

> I also noticed that in your list the "DotDot" entry (and a few others) is
> equivalent to " \u20DC" (with a leading space), whereas the entry in the HTML5
> list only mentions U+020DC (without mentioning the leading space).  This is a
> combining character, so the reason for the extra space might be to prevent it
> to combine with the previous character. 

Yes it is to ensure the resulting documents meed the "w3c normalisation form" (in one of the charmod drafts that never progressed to recommendation status) that said that entities should never start with a combining character, so that entity expansion and unicode normalisation can be performed in either order.
There are 4 such cases, documented here:

http://www.w3.org/2003/entities/2007doc/Overview.html#chars_math-multiple-tables

> I don't know if this should be the
> same for the HTML5 list too though.

I thought that this had been raised before but I don't see it in an existing bug.
================================================================================