HTML (and XML) utilities
Some simple C programs:
- count [HTML-file]
-
An example of using the parser (but not the tree module), to
count elements and attributes. The count currently uses a
simple linear search. To use it for real on large quantities
of HTML, the algorithm should probably be changed to use a
hash table.
- htmlclean [HTML-file]
-
Basically a test program for the HTML parser: it reads a file
and dumps the tree to stdout, which results in all missing
tags being inserted. Note: empty elements are
written the XML-way: <BR />. Can be run on its own
output.
- normalize [-x] [-l length] [-i indent] [HTML-file]
-
Pretty-prints an HTML file to stdout. Options are:
-x for XML-compatible mark-up; -l
N to set the maximum length of lines;
-i N to set the additional indent
for each level of nesting. Can be run on its own output.
- num [options] [HTML-file]
-
Number all or some H* headers, in many different styles.
Output to stdout. Try "num -?" for options. Insert
CLASS="no-num" in the headers that you don't want numbered.
Can be run on its own output.
- pipe [-l] [HTML-file]
-
Parse an HTML file and output it in a form similar to what
nsgmls does. However, missing end tags are not added, and the
file is not "corrected" in any way. Option -l inserts line
numbers in the output. Example: to clean up a file, infer
missing tags, and produce pipe-ready output, use the pipeline
"htmlclean file | pipe".
- toc [-l low] [-h high] [HTML-file]
-
Insert a table of contents, inserts IDs in headers, and <A
NAME=...> with the same ID as target anchors. Modified
source is written to stdout. The place of the table is
indicated by <!--toc--> in the source, or by the pair
<!--begin-toc--> and <!--end-toc--> (Whatever is
between those two comments is discarded.) Options
-l and -h indicate the
lowest and highest numbers of headers to include in the ToC.
Numbers can be from 1 to 6. Can be run on its own output.
- unent [file]
-
Expand entities to UTF-8 sequences. Input must be in UTF-8,
with &-entities defined by HTML 4; output will be in UTF-8
without those entities. Unknown entities are not expanded.
- xml2asc, asc2xml
-
This pair of programs translates UTF-8 files to ASCII files
with &#-entities and vice-versa.