html_analyzer-0.02 README

This file contains information outlining
the types of processing performed
by the html_analyzer software as
well as copyright, disclaimer, and
funding information.   [Edited in
part by TimBL to change confusing
use of the word "tag" for anchor
contents. Plain text sections not


The code is available via anonymous ftp from in the directory /Mosaic/misc in compress'd and gzip'd forms.


This directory contains the software to perform analysis of HyperText MarkUp Language (HTML) databases. Specifically, all .html files are processed to It is important to note that the validation process checks http & relative addressed links only. This is primarily because these are the only types of links being used by our systems. The second check we have termed "completeness" and the third check "consistency". We

believe that there ought to exist a one-to-one correspondence between a link and a anchor contents such that every occurrence of a anchor content describes only one link destination, that all occurrences of that anchor contents in the HTML database are hyperlinks, and that there is but one way to get to a portion of document pointed to by a link. This means every time a user sees an anchor contents, it will always point to the same section of a document. It also means that each section of document will only have one phrase pointing to it. We hypothesize that such a correspondence is necessary to create a clear internal representation of the associations in the HTML database to the user.


	Below is a portion of HTML text extracted from a fictional database. 
A rudimentary understanding of HTML is assumed.  For more information on
HTML anonymous ftp to and look in the /pub/www/doc for the 
latest files on HTML and related papers.  Another really good source of 
information of HTML can be found via anonymous ftp to in
the /Web/mosaic-papers directory. 

<TITLE>Guide Demo V.1</TITLE>
<H1>Guide Demo V.1</H1>

List of informative files to view: <P>
<A NAME=1 HREF="">
Description of this demo</A> <P>

<A NAME=2 HREF="">
More Info Frame</A> <P> 

<A NAME=3 HREF="">
Free Text Frame</A> <P> 
<A NAME=4 HREF="">
Free Text Frame</A> <P>

The first link contains a description of this demo.  The second link
points to a frame that provides more information.  The third link
has a different tag that the second link, but points to the same place. 
The fourth tag is the same as the third tag, but points to a different 

NOTE: for clarity, the following terms refer to the above constructs.

ANCHOR:	the text bounded by <A and </A>.  The last anchor in the above example
	<A NAME=3 HREF="">
	Free Text Frame</A>

LINK:	the location of the file being referenced (i.e that which is double
	quoted following the HREF= token.  The LINK in the last anchor is:

TAG:	the series of words that describe the LINK.  This is what will get
	displayed to the user as a hyperlink.  The TAG field follows the
	"> and comes before the </A>.  In the above example, the last anchor
	has the tag: Free Text Frame.
	Here's how the software handles the above text.  First, the wrapper 
calls recurse_fcn() which calls extract-references() recursively on
all *.html files in the specified directory hierarchy. extract-references()
performs the task of splitting the text into a non-HTML file, and a file 
containing the links.  Next, hmtl_analyzer() is called.  html_analyzer() reads
the information gathered about the links into two skiplist ADTs.  It then
attempts to confirm http and relative addressed links specified by the
HREF.  If it is unable to do so (server not running, file not found, etc), a
warning message is posted to the user with the suspected reason for failure. 
Next, the program processes each tag, looking for occurrences of itself in the
non-HTML file.  If occurrences are found, violating the principle of
completeness, warning messages are provided notifying the user of the file
that contained the non-HTML'ed tag.  Finally, the links and tags are processed
to ensure a one-to-one correspondenced exists.  Warning messages
are provided that inform the user of inconsistant relations.

	In the above example, there is no file: /u/CIMS/Demo_Description.html
located on, a httpd server listening on port 1729.  The
first series of tests will discover this and notify the user as such:

WARNING:  Failed in checking:
   In file: Home.html at line: 6 
   With tag of: Description of this demo

Next, html_analyzer finds out that the tag used to describe this link occurred
elsewhere with out a link.  This could have been in another file.  The user is
given a list of the file(s) that need to have that tag made into a hyperlink.  
The output will look something like this:

WARNING: These filenames contain the tag:
  Description of this demo
 Without a link to:


Done Verifying completeness.

Next, the user will be informed that more than one tag is used to
describe the link to /u/CIMS/More_info.html on nsidc1.  In this case, both the
"More Info Frame" and the "Free Text Frame" point to the same file.  One of 
them needs to go.  To aid the HTML db maintainers with the task of deciding 
which tag to remove, the software informs the use of the number of occurrences
of each tag in the database.  The output would look something like this:

WARNING: Link used inconsistently.
  occurs 1 time with tag: Free Text Frame
  as in file: temp/t.html, but also
  occurs 1 time with tag: More Info Frame
  as in file: temp/t.html

Done Verifying consistency of links.

Next, the user will be informed that the tag "Free Text Frame" points to
more than one file.  This is easily corrected by changing the name of the
tag that is used least often to another name.  The output would look like this: 

WARNING: Tag used inconsistently.
  TAG: Free Text Frame
  occurs 1 times with href:
  as in file: temp/t.html, but also
  occurs 1 time with href:
  as in file: temp/t.html

Done Verifying consistency of tags.


	To run the html_analyzer after it has been installed (see the
file INSTALLION in this directory) type:

html_analyzer [-val] [-com] [-con] directory [path of repository]

	The -val, -com, and -con turn off the validation, completeness, and
consistency tests.  Only the name of a directory can be specified to check.  
If a directory is specified, all .html files within the directory hierarchy
will be processed.  If a file is specified, only that file will be processed.
Multiple filenames are not allowed as command line arguements.  The path of 
the temporary repository (default is /var/tmp) can be used if /var/tmp is
full or not desirable.  A directory (/html_analyzer) is created in this
repostiory to store the temporary files generated by execution.


	The libhttp directory was modified from the libwww directory from  
xmosaic-1.0, an HTML client developed by the National Center for Super-
Computing Applications.  This code is available from
in the Web directory.  This library was initially distributed by 
Tim Berners-Lee at the European Laboratory for Particle Physics (CERN).
Please see the file Copyrights in this directory for more information on
the copyrights that exist to this portion of code.

	The Regents of the University of Colorado claim copyright on the
other portions of the distribution.  

	This distribution of the software may be freely distributed, used, 
and modified but may not be sold as a whole nor in parts without permission 
of the copyright owners of the parts.


	This software is provided as is.  The Laboratory for Atmospheric
and Space Physics (LASP) and the author are not responsible for support 
of this distribution.


	Development of this software was funded by the NASA Earth Observing 
System Project under NASA contract NAS5-32392.


version 0.02 from 0.01:

0) converted CHECK_HTML_DB and GET_ANCHORS to c code.
1) added verification of relative addressed links.
2) added one-to-many check of tags to links ( previously: many-to-one check
   of links to tags)
3) cleaned up 


	Here's a list of things that could be done to improve the html_analyzer:

0) add validation capabilities for other servers, e.g. gopher, ftp, etc.  This
   is really simple, but is not required for our usage.  Let me know if you
   want to do this but do not know how.
1) create a program to automatically prune anchors that no longer point to 
   valid files.  This entails some tricky questions as to how automated this
   process needs to be.  In other words, it might be nice for the user to
   have the option of specifying the correct location of the file and have the
   software make the changes to the HREFs as needed AS WELL as provide the
   user with the option of having the software remove all anchors pointing to
   the no-longer existent file.  Let me know if your interested in this option,
   this seems like the next logical addition to the software. 
2) add a linked list to the data struct of the skiplist that points to a list
   of other files that have the same anchor and tag.  This will enable more
   sophisticated analysis, e.g. enable option 1) above by producing a list of
   files that point to a document for pruning purposes, etc.
3) add statistical analysis of the HTML db i.e. number of anchors per document,
   number of links to a document, list of files that point to a document, etc.
4) perform empirical study to confirm the hypothesis of the importance on a 
   one-to-one correspondence between anchors and tags. [I might do this this
   fall if time allows].


	The purpose of this distribution is to further the development of
HTML database creation and maintainance utilities.  Comments, questions, 
and REVISIONS are indeed welcome.  Also, a minder is installed in the Makefile
that will add you to a user's list.  If you do not want to be added, feel
free to remove the comands that make it so. 


James E. Pitkow                               Septembe 16, 1993
Undergraduate Research Assistant
Laboratory for Atmospheric and Space Physics
Campus Box 590
Boulder, CO 80309 USA