Essentially per Arjun Ray's suggestion of 20Feb2000:
Taking the examples from the pyxie article
<Person> <?A4TypeSetter PageBreak?> <Surname>McGrath</Surname> <Given>Sean</Given> <e-mail type="internet">sean@digitome.com</e-mail> </Person>
becomes
<Person > <?A4TypeSetter PageBreak ?> <Surname >McGrath</Surname > <Given >Sean</Given > <e-mail type="internet" >sean@digitome.com</email > </Person >
Example 2:
<?xml version="1.0" encoding="us-ascii"?> <!DOCTYPE foo SYSTEM "http://www.digitome.com/foo.dtd"> <!-- This document has a <foo> element --> <foo not="not"> <![CDATA[ Although this looks like another <foo> start-tag it is not. ]]>  Hello </foo>
becomes
<?xml version="1.0" encoding="us-ascii" ?><!DOCTYPE foo SYSTEM "http://www.digitome.com/foo.dtd"> <!-- This document has a <foo> element --> <foo not="not" > Although this looks like another <foo> start-tag it is not. ]]>  Hello </foo>
counting the number of foo elements:
grep '<foo$' my.xml
finding 'not' as an attribute name:
grep '^not=' my.xml
the general case, in perl:
while(<>){
if(s/^(-->|\?>|>)//){ # end of markup
if(s/^([^<]+)//){
my($data) = $1;
# process data
}
}
elsif(s/^SYSTEM \"([^\"]+)\">/){
my($sysid) = $1; # may need unescaping?
}
elsif(s/^PUBLIC \"([^\"]+)\"/){
my($pubid) = $1; # may need unescaping?
}
if(s/^<(\S+)/){ # start tag
my($name) = $1;
}
elsif(s/^([a-zA-Z_][^=]*)=\"([^\"]+)\"){ #attribute
my($name, $val) = ($1, $2);
# note that $val still has escaped <s and such
}
elsif(s,^</(\S+),,){ # end tag
my($name) = $1;
}
elsif(s,^<--(.*),,){ # comment
my($comment) = $1;
}
elsif(s,^<?(.*),,){ # processing instruction
my($pi) = $1;
}
}
asdf
see also: XMLWriter.py, a SAX Handler that implements a tweak on this algorithm (plus some indentation stuff).