HTML (and XML) utilities
Some simple C programs. Find them in the original directory.
- count [HTML-file]
- An example of using the parser (but not the tree module), to count
elements and attributes. The count currently uses a simple linear
search. To use it for real on large quantities of HTML, the algorithm
should probably be changed to use a hash table.
- htmlclean [HTML-file]
- Basically a test program for the HTML parser: it reads a file and
dumps the tree to stdout, which results in all missing tags being
inserted. Note: empty elements are written the XML-way:
<BR />. Can be run on its own output. (tidy is a more
powerful tool.)
- normalize [-x] [-l length] [-i indent] [HTML-file]
- Pretty-prints an HTML file to stdout. Options are: -x
for XML-compatible mark-up; -l N to set the
maximum length of lines; -i N to set the
additional indent for each level of nesting. Can be run on its own
output.
- num [options] [HTML-file]
- Number all or some H* headers, in many different styles. Output to
stdout. Try "num -?" for options. Insert CLASS="no-num" in the headers
that you don't want numbered. Can be run on its own output.
- pipe [-l] [HTML-file]
- Parse an HTML file and output it in a form similar to what nsgmls
does. However, missing end tags are not added, and the file is not
"corrected" in any way. Option -l inserts line numbers in the output.
Example: to clean up a file, infer missing tags, and produce pipe-ready
output, use the pipeline "htmlclean file | pipe".
- toc [-l low] [-h high] [HTML-file]
- Insert a table of contents, inserts IDs in headers, and <A
NAME=...> with the same ID as target anchors. Modified source is written
to stdout. The place of the table is indicated by <!--toc--> in the
source, or by the pair <!--begin-toc--> and <!--end-toc-->
(Whatever is between those two comments is discarded.) Options
-l and -h indicate the lowest and
highest numbers of headers to include in the ToC. Numbers can be from 1
to 6. Can be run on its own output.
- multitoc [-x] [-s text ] [-e text ] [-l low | -h high | -b
base | HTML-file]+
- Reads multiple files and generates single ToC on stdout.
- unent [file]
- Expand entities to UTF-8 sequences. Input must be in UTF-8, with
&-entities defined by HTML 4; output will be in UTF-8 without those
entities. Unknown entities are not expanded.
- xml2asc, asc2xml
- This pair of programs translates UTF-8 files to ASCII files with
&#-entities and vice-versa.
- wls [-l] [-r] [-h] [-b base] {HTML-file]
- List all links in one or more HTML files.
- htmlprune
- Removes all elements (with their contents) with a class of "exclude"
(or any other class given on the command line).
- incl [-x] [-a attrib] [-c class] [-b base] [file-or-URL]
- Expand included files: elements with a certain attribute (by default a
"class" attribute including the keyword "include") are replaced by the
file named by the element's content. The file may be a URL as well.
- cite [-b base] [-p template] [-a auxfile] bibfile [HTML-file]
- Look for bibliographic refernces of the form [[key]] and replace them
with a hyperlink to a bibliography, according to a certain replacement
template. Uses a refer-style database of citations.
- mkbib [-a auxfile] [-s sep] bibfile [HTML-file]
- Create a bliography. (Companion program for cite.) Take the list of
keys found by cite and a template for a bibliography and insert the full
citations in the template. Uses a refer-style database of
citations.