slanted W3C logo
Cover page images (keys)

Characterizing ill-formed XML on the web

Liam Quin, <liam@w3.org>

What

  1. A review of the Amsterdam XML on the Web Corpus with emphasis on well-formedness
  2. Recommendations: should we change XML?

XML at W3C

Is it still a sentence when it’s?

Background

Is it a problem?

XML 1

A data object is an XML document if it is well-formed, as defined in this specification. [XML]

XML 2

XML 3

bad text

I shall no more to sea, to sea, here shall I dye ashore.

screenshot of browser attempting to display a shakespeare play but requiring correct grammar

Bad XML 1

Good Steven

Bad Steven

html!

Perl

Most common WF

  58447 rss                 120 page                 52 famille
  41747 urlset              109 timetable            51 top_article
   6901 feed                100 content              51 TEXT
   4556 XML_DIZ_INFO         94 mostknowntracks      50 NewsML
   3368 OpenSearchDescr      87 MD_Metadata          50 chapter
   2026 html                 80 playlist             47 gallery
   2025 rdf:RDF              76 dblp                 47 GA
    741 ead                  73 doc                  46 topic
    686 art                  72 pagina               46 objeto
    594 xml                  70 score-partwise       46 dosier
    510 TEI.2                69 rFactorXML           45 project
    445 Module               67 META                 45 plist
    379 w:wordDocument       66 data                 45 link:linkbase
    373 article              66 codeBook             44 publications
    351 sitemapindex         65 StoreExport          44 div
    326 root                 65 object               43 noticia
    296 metadata             64 tv                   43 games
    284 document             64 posts                42 timeline
    223 custombuttons        61 udhr                 42 plantilla
    205 rsd                  61 lom                  41 XML
    171 book                 60 MEMO_NOTICES         41 news
    155 Workbook             60 cooliris-quick       40 DayExtra
    147 SearchPlugin         59 Archive              40 dataset
    145 manifest             58 language             39 PAPER
    141 hexML                57 story                39 forum
    133 TEI                  56 un:pbmstore          38 SectionTest
    130 rfc                  55 ROOT                 38 ns:educationInf
    122 ONIXMessage          55 Results              37 SIRDOC
    122 dataroot             55 phpmyfaq             37 section
    120 Production_Data      55 kitweb               37 rfcdesc
  

Most common non-WF

  11015 rss                  19 issuer_qreport        7 score-partwise
   5079 html                 17 Osejs                 7 poet2
   1387 feed                 15 TABLE                 7 playlist
    786 XML_DIZ_INFO         15 doc                   7 page
    491 urlset               15 comm                  7 items
    331 !DOCTYPE             14 link                  7 irclog
    326 rdf:RDF              14 Language              7 channel
    209 br                   14 ArticleSet            7 album
    157 OpenSearchDescr      13 TEI.2                 6 ?xml
    117 ?xml-stylesheet      12 COLUMNA               6 table
     71 HTML                 11 root                  6 sect1
     65 yml_catalog          11 INVENTOR              6 SearchPlugin
     54 OBJETO               11 bacon                 6 scenario
     51 article              10 VFPData               6 INFORME_ANUAL
     48 document             10 noticias              6 hexML
     45 rFactorXML           10 main                  6 dwsync
     43 body                 10 DOCUMENT              6 CV
     42 xml                  10 content               6 catalogue
     40 script                9 rfc                   6 book
     38 Envelope              9 Materia               6 ANUNCIO
     37 NewsML                9 fn:news               5 xbel
     36 data                  9 !doctype              5 unit
     33 rayma                 8 url                   5 tt
     31 chapter               8 tax:taxonx            5 mm:style
     28 div                   8 site                  5 IDIOMAS
     26 Module                8 price                 5 head
     25 topic                 8 news                  5 b
     23 languages             7 yuduBook              5 a
     20 record                7 title                 4 w:wordDocument
     19 tree                  7 SHOP                  4 tmx
  

Compare

good vs bad [svg with layers]

Fish or Fowl?

this is no fish, but an Islander that hath lately suffered by a Thunderbolt!

Conclusions

Thank You

The young Weisskunig instructed in the Black Arts
Made by Hans Burgkmair the Elder in 1516. based on Der Weisskunig.

Questions, Discussion, Comments

Photo Credits

1486 Printer of the Erwählung Maximillians / Electio Maximilliani
Photo by elmastudio/flickr

Triton-like & other sea-monsters - Carolingian miniature folio-124r
The Stuttgart Psalter, in the Ernest T. DeWald Collection at the Württembergische Landes-Bibliothek; image flickr/petrus.agricola

1957 Zündapp Janus car, photo Georg Schwalbach (GS1311) / flickr

Diary of a Dark Priest - photo Liam Quin / www.fromoldbooks.org

Henry VIII eyes - engraving, fromoldbooks.org

The young Weisskunig instructed in the Black Arts (XML?); image flickr/Kintzertorium