HTML (and XML) utilities
Some simple C programs. Find them in this directory.
- count [HTML-file]
- An example of using the parser (but not the tree module), to count
elements and attributes. The count currently uses a simple linear
search. To use it for real on large quantities of HTML, the algorithm
should probably be changed to use a hash table.
- htmlclean [HTML-file]
- Basically a test program for the HTML parser: it reads a file and
dumps the tree to stdout, which results in all missing tags being
inserted. Note: empty elements are written the XML-way:
<BR />. Can be run on its own output.
- normalize [-x] [-l length] [-i indent] [HTML-file]
- Pretty-prints an HTML file to stdout. Options are: -x
for XML-compatible mark-up; -l N to set the
maximum length of lines; -i N to set the
additional indent for each level of nesting. Can be run on its own
output.
- num [options] [HTML-file]
- Number all or some H* headers, in many different styles. Output to
stdout. Try "num -?" for options. Insert CLASS="no-num" in the headers
that you don't want numbered. Can be run on its own output.
- pipe [-l] [HTML-file]
- Parse an HTML file and output it in a form similar to what nsgmls
does. However, missing end tags are not added, and the file is not
"corrected" in any way. Option -l inserts line numbers in the output.
Example: to clean up a file, infer missing tags, and produce pipe-ready
output, use the pipeline "htmlclean file | pipe".
- toc [-l low] [-h high] [HTML-file]
- Insert a table of contents, inserts IDs in headers, and <A
NAME=...> with the same ID as target anchors. Modified source is written
to stdout. The place of the table is indicated by <!--toc--> in the
source, or by the pair <!--begin-toc--> and <!--end-toc-->
(Whatever is between those two comments is discarded.) Options
-l and -h indicate the lowest and
highest numbers of headers to include in the ToC. Numbers can be from 1
to 6. Can be run on its own output.
- multitoc [-x] [-s text ] [-e text ] [-l low | -h high | -b
base | HTML-file]+
- Reads multiple files and generates single ToC on stdout.
- unent [file]
- Expand entities to UTF-8 sequences. Input must be in UTF-8, with
&-entities defined by HTML 4; output will be in UTF-8 without those
entities. Unknown entities are not expanded.
- xml2asc, asc2xml
- This pair of programs translates UTF-8 files to ASCII files with
&#-entities and vice-versa.
- wls
- List all links in one or more HTML files.