W3C XML HTML

A Handy Line Breaking Algorithm

Essentially per Arjun Ray's suggestion of 20Feb2000:

  1. These are immediately followed by a newline:
    1. The generic identifier of a start-tag.
    2. The generic identifier of an end-tag.
    3. The target of a PI.
  2. Each attribute specification is on a separate line (i.e. ends with a #xA.)
  3. These all start on a new line:
    1. The '>' or '/>' of a start-tag (as a consequence of Rules 1 and 2).
    2. The '>' of an end-tag (from Rule 1).
    3. The '?>' terminating a PI, usually by the insertion of an immediately preceding #xA.

Example

Taking the examples from the pyxie article

<Person>
  <?A4TypeSetter PageBreak?>
  <Surname>McGrath</Surname>
  <Given>Sean</Given>
  <e-mail type="internet">sean@digitome.com</e-mail>
</Person>

becomes

<Person
>&#10;<?A4TypeSetter PageBreak
?>&#10;  <Surname
>McGrath</Surname
>&#10;  <Given
>Sean</Given
>&#10;  <e-mail
type="internet"
>sean@digitome.com</email
>&#10;</Person
>

Example 2:

<?xml version="1.0" encoding="us-ascii"?>
<!DOCTYPE foo SYSTEM "http://www.digitome.com/foo.dtd">
<!-- This document has a <foo> element -->
<foo not="not">
<![CDATA[
Although this looks like another <foo> start-tag
it is not.
]]>
&#x20;Hello
</foo>

becomes

<?xml version="1.0" encoding="us-ascii"
?><!DOCTYPE foo
SYSTEM "http://www.digitome.com/foo.dtd">&#10;<!-- This document has a <foo> element 
-->&#10;<foo
not="not"
>&#10;Although this looks like another &lt;foo&gt; start-tag&#10;it is not.
]]>
&#x20;Hello
</foo>

Element Counting, grepping

counting the number of foo elements:

grep '<foo$' my.xml

finding 'not' as an attribute name:

grep '^not=' my.xml

the general case, in perl:

while(<>){
  if(s/^(-->|\?>|>)//){ # end of markup
    if(s/^([^<]+)//){
      my($data) = $1;
      # process data
    }
  }
  elsif(s/^SYSTEM \"([^\"]+)\">/){
    my($sysid) = $1; # may need unescaping?
  }
  elsif(s/^PUBLIC \"([^\"]+)\"/){
    my($pubid) = $1; # may need unescaping?
  }

  if(s/^<(\S+)/){ # start tag
    my($name) = $1;
  }
  elsif(s/^([a-zA-Z_][^=]*)=\"([^\"]+)\"){ #attribute
    my($name, $val) = ($1, $2);
    # note that $val still has escaped &lt;s and such
  }
  elsif(s,^</(\S+),,){ # end tag
    my($name) = $1;
  }
  elsif(s,^<--(.*),,){ # comment
    my($comment) = $1;
  }
  elsif(s,^<?(.*),,){ # processing instruction
    my($pi) = $1;
  }
}

asdf

Comparison with PYX

TODO

see also: XMLWriter.py, a SAX Handler that implements a tweak on this algorithm (plus some indentation stuff).

History

Mar. 15, 2000
Pyxie in xml.com by Sean McGrath
see also: Pyxie - XML Processing Library for Python
Feb 20 2000
Comments on the WD - A proposed alternative Arjun Ray (Sun, Feb 20 2000)
(my reply of 23 Mar)
@@when?
comments from Paul G. about how arbortext stuff does line breaking
@@when?
tidy release -asxml to turn HTMl into XHTML, -i flag for indented output
@@when?
Amaya is released with XHTML support

Dan Connolly, Aug 2000
$Revision: 1.11 $ of $Date: 2001/01/04 18:56:50 $ by $Author: connolly $