A Handy Line Breaking Algorithm

Essentially per Arjun Ray's suggestion of 20Feb2000:

  1. These are immediately followed by a newline:
    1. The generic identifier of a start-tag.
    2. The generic identifier of an end-tag.
    3. The target of a PI.
  2. Each attribute specification is on a separate line (i.e. ends with a #xA.)
  3. These all start on a new line:
    1. The '>' or '/>' of a start-tag (as a consequence of Rules 1 and 2).
    2. The '>' of an end-tag (from Rule 1).
    3. The '?>' terminating a PI, usually by the insertion of an immediately preceding #xA.


Taking the examples from the pyxie article

  <?A4TypeSetter PageBreak?>
  <e-mail type="internet">sean@digitome.com</e-mail>


>&#10;<?A4TypeSetter PageBreak
?>&#10;  <Surname
>&#10;  <Given
>&#10;  <e-mail

Example 2:

<?xml version="1.0" encoding="us-ascii"?>
<!DOCTYPE foo SYSTEM "http://www.digitome.com/foo.dtd">
<!-- This document has a <foo> element -->
<foo not="not">
Although this looks like another <foo> start-tag
it is not.


<?xml version="1.0" encoding="us-ascii"
?><!DOCTYPE foo
SYSTEM "http://www.digitome.com/foo.dtd">&#10;<!-- This document has a <foo> element 
>&#10;Although this looks like another &lt;foo&gt; start-tag&#10;it is not.

Element Counting, grepping

counting the number of foo elements:

grep '<foo$' my.xml

finding 'not' as an attribute name:

grep '^not=' my.xml

the general case, in perl:

  if(s/^(-->|\?>|>)//){ # end of markup
      my($data) = $1;
      # process data
  elsif(s/^SYSTEM \"([^\"]+)\">/){
    my($sysid) = $1; # may need unescaping?
  elsif(s/^PUBLIC \"([^\"]+)\"/){
    my($pubid) = $1; # may need unescaping?

  if(s/^<(\S+)/){ # start tag
    my($name) = $1;
  elsif(s/^([a-zA-Z_][^=]*)=\"([^\"]+)\"){ #attribute
    my($name, $val) = ($1, $2);
    # note that $val still has escaped &lt;s and such
  elsif(s,^</(\S+),,){ # end tag
    my($name) = $1;
  elsif(s,^<--(.*),,){ # comment
    my($comment) = $1;
  elsif(s,^<?(.*),,){ # processing instruction
    my($pi) = $1;


Comparison with PYX


see also: XMLWriter.py, a SAX Handler that implements a tweak on this algorithm (plus some indentation stuff).


Mar. 15, 2000
Pyxie in xml.com by Sean McGrath
see also: Pyxie - XML Processing Library for Python
Feb 20 2000
Comments on the WD - A proposed alternative Arjun Ray (Sun, Feb 20 2000)
(my reply of 23 Mar)
comments from Paul G. about how arbortext stuff does line breaking
tidy release -asxml to turn HTMl into XHTML, -i flag for indented output
Amaya is released with XHTML support

Dan Connolly, Aug 2000
$Revision: 1.11 $ of $Date: 2001/01/04 18:56:50 $ by $Author: connolly $