HTML (and XML) utilities

Some simple C programs. Find them in the original directory.

count [HTML-file]: An example of using the parser (but not the tree module), to count elements and attributes. The count currently uses a simple linear search. To use it for real on large quantities of HTML, the algorithm should probably be changed to use a hash table.
htmlclean [HTML-file]: Basically a test program for the HTML parser: it reads a file and dumps the tree to stdout, which results in all missing tags being inserted. Note: empty elements are written the XML-way: <BR />. Can be run on its own output. (tidy is a more powerful tool.)
normalize [-x] [-l length] [-i indent] [HTML-file]: Pretty-prints an HTML file to stdout. Options are: -x for XML-compatible mark-up; -l N to set the maximum length of lines; -i N to set the additional indent for each level of nesting. Can be run on its own output.
num [options] [HTML-file]: Number all or some H* headers, in many different styles. Output to stdout. Try "num -?" for options. Insert CLASS="no-num" in the headers that you don't want numbered. Can be run on its own output.
pipe [-l] [HTML-file]: Parse an HTML file and output it in a form similar to what nsgmls does. However, missing end tags are not added, and the file is not "corrected" in any way. Option -l inserts line numbers in the output. Example: to clean up a file, infer missing tags, and produce pipe-ready output, use the pipeline "htmlclean file | pipe".
toc [-l low] [-h high] [HTML-file]: Insert a table of contents, inserts IDs in headers, and <A NAME=...> with the same ID as target anchors. Modified source is written to stdout. The place of the table is indicated by  in the source, or by the pair  and  (Whatever is between those two comments is discarded.) Options -l and -h indicate the lowest and highest numbers of headers to include in the ToC. Numbers can be from 1 to 6. Can be run on its own output.
multitoc [-x] [-s text ] [-e text ] [-l low | -h high | -b base | HTML-file]+: Reads multiple files and generates single ToC on stdout.
unent [file]: Expand entities to UTF-8 sequences. Input must be in UTF-8, with &-entities defined by HTML 4; output will be in UTF-8 without those entities. Unknown entities are not expanded.
xml2asc, asc2xml: This pair of programs translates UTF-8 files to ASCII files with &#-entities and vice-versa.
wls [-l] [-r] [-h] [-b base] {HTML-file]: List all links in one or more HTML files.
htmlprune: Removes all elements (with their contents) with a class of "exclude" (or any other class given on the command line).
incl [-x] [-a attrib] [-c class] [-b base] [file-or-URL]: Expand included files: elements with a certain attribute (by default a "class" attribute including the keyword "include") are replaced by the file named by the element's content. The file may be a URL as well.
cite [-b base] [-p template] [-a auxfile] bibfile [HTML-file]: Look for bibliographic refernces of the form [[key]] and replace them with a hyperlink to a bibliography, according to a certain replacement template. Uses a refer-style database of citations.
mkbib [-a auxfile] [-s sep] bibfile [HTML-file]: Create a bliography. (Companion program for cite.) Take the list of keys found by cite and a template for a bibliography and insert the full citations in the template. Uses a refer-style database of citations.