HTML (and XML) utilities

Some simple C programs. Find them in this directory.

count [HTML-file]
An example of using the parser (but not the tree module), to count elements and attributes. The count currently uses a simple linear search. To use it for real on large quantities of HTML, the algorithm should probably be changed to use a hash table.
htmlclean [HTML-file]
Basically a test program for the HTML parser: it reads a file and dumps the tree to stdout, which results in all missing tags being inserted. Note: empty elements are written the XML-way: <BR />. Can be run on its own output.
normalize [-x] [-l length] [-i indent] [HTML-file]
Pretty-prints an HTML file to stdout. Options are: -x for XML-compatible mark-up; -l N to set the maximum length of lines; -i N to set the additional indent for each level of nesting. Can be run on its own output.
num [options] [HTML-file]
Number all or some H* headers, in many different styles. Output to stdout. Try "num -?" for options. Insert CLASS="no-num" in the headers that you don't want numbered. Can be run on its own output.
pipe [-l] [HTML-file]
Parse an HTML file and output it in a form similar to what nsgmls does. However, missing end tags are not added, and the file is not "corrected" in any way. Option -l inserts line numbers in the output. Example: to clean up a file, infer missing tags, and produce pipe-ready output, use the pipeline "htmlclean file | pipe".
toc [-l low] [-h high] [HTML-file]
Insert a table of contents, inserts IDs in headers, and <A NAME=...> with the same ID as target anchors. Modified source is written to stdout. The place of the table is indicated by <!--toc--> in the source, or by the pair <!--begin-toc--> and <!--end-toc--> (Whatever is between those two comments is discarded.) Options -l and -h indicate the lowest and highest numbers of headers to include in the ToC. Numbers can be from 1 to 6. Can be run on its own output.
multitoc [-x] [-s text ] [-e text ] [-l low | -h high | -b base | HTML-file]+
Reads multiple files and generates single ToC on stdout.
unent [file]
Expand entities to UTF-8 sequences. Input must be in UTF-8, with &-entities defined by HTML 4; output will be in UTF-8 without those entities. Unknown entities are not expanded.
xml2asc, asc2xml
This pair of programs translates UTF-8 files to ASCII files with &#-entities and vice-versa.
wls
List all links in one or more HTML files.