This file contains information outlining the types of processing performed by the html_analyzer software as well as copyright, disclaimer, and funding information. [Edited in part by TimBL to change confusing use of the word "tag" for anchor contents. Plain text sections not edited]
believe that there ought to exist a one-to-one correspondence between a link and a anchor contents such that every occurrence of a anchor content describes only one link destination, that all occurrences of that anchor contents in the HTML database are hyperlinks, and that there is but one way to get to a portion of document pointed to by a link. This means every time a user sees an anchor contents, it will always point to the same section of a document. It also means that each section of document will only have one phrase pointing to it. We hypothesize that such a correspondence is necessary to create a clear internal representation of the associations in the HTML database to the user.
Below is a portion of HTML text extracted from a fictional database. A rudimentary understanding of HTML is assumed. For more information on HTML anonymous ftp to info.cern.ch and look in the /pub/www/doc for the latest files on HTML and related papers. Another really good source of information of HTML can be found via anonymous ftp to ftp.ncsa.uiuc.edu in the /Web/mosaic-papers directory. <TITLE>Guide Demo V.1</TITLE> <H1>Guide Demo V.1</H1> <BODY> List of informative files to view: <P> <A NAME=1 HREF="http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html"> Description of this demo</A> <P> <A NAME=2 HREF="http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html"> More Info Frame</A> <P> <A NAME=3 HREF="http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html"> Free Text Frame</A> <P> <A NAME=4 HREF="http://nsidc1.colorado.edu:1729/u/CIMS/Even_more_info.html"> Free Text Frame</A> <P> The first link contains a description of this demo. The second link points to a frame that provides more information. The third link has a different tag that the second link, but points to the same place. The fourth tag is the same as the third tag, but points to a different file.<P> </BODY> NOTE: for clarity, the following terms refer to the above constructs. ANCHOR: the text bounded by <A and </A>. The last anchor in the above example is: <A NAME=3 HREF="http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html"> Free Text Frame</A> LINK: the location of the file being referenced (i.e that which is double quoted following the HREF= token. The LINK in the last anchor is: http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html TAG: the series of words that describe the LINK. This is what will get displayed to the user as a hyperlink. The TAG field follows the "> and comes before the </A>. In the above example, the last anchor has the tag: Free Text Frame. Here's how the software handles the above text. First, the wrapper calls recurse_fcn() which calls extract-references() recursively on all *.html files in the specified directory hierarchy. extract-references() performs the task of splitting the text into a non-HTML file, and a file containing the links. Next, hmtl_analyzer() is called. html_analyzer() reads the information gathered about the links into two skiplist ADTs. It then attempts to confirm http and relative addressed links specified by the HREF. If it is unable to do so (server not running, file not found, etc), a warning message is posted to the user with the suspected reason for failure. Next, the program processes each tag, looking for occurrences of itself in the non-HTML file. If occurrences are found, violating the principle of completeness, warning messages are provided notifying the user of the file that contained the non-HTML'ed tag. Finally, the links and tags are processed to ensure a one-to-one correspondenced exists. Warning messages are provided that inform the user of inconsistant relations. In the above example, there is no file: /u/CIMS/Demo_Description.html located on nsidc1.colorado.edu, a httpd server listening on port 1729. The first series of tests will discover this and notify the user as such: WARNING: Failed in checking: http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html In file: Home.html at line: 6 With tag of: Description of this demo FILE NOT LOCATED AS INDICATED! Next, html_analyzer finds out that the tag used to describe this link occurred elsewhere with out a link. This could have been in another file. The user is given a list of the file(s) that need to have that tag made into a hyperlink. The output will look something like this: VERIFYING COMPLETENESS... WARNING: These filenames contain the tag: Description of this demo Without a link to: http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html t.html Done Verifying completeness. Next, the user will be informed that more than one tag is used to describe the link to /u/CIMS/More_info.html on nsidc1. In this case, both the "More Info Frame" and the "Free Text Frame" point to the same file. One of them needs to go. To aid the HTML db maintainers with the task of deciding which tag to remove, the software informs the use of the number of occurrences of each tag in the database. The output would look something like this: VERIFYING CONSISTENCY OF LINKS... WARNING: Link used inconsistently. HREF: http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html occurs 1 time with tag: Free Text Frame as in file: temp/t.html, but also occurs 1 time with tag: More Info Frame as in file: temp/t.html Done Verifying consistency of links. Next, the user will be informed that the tag "Free Text Frame" points to more than one file. This is easily corrected by changing the name of the tag that is used least often to another name. The output would look like this: VERIFYING CONSISTENCY OF TAGS... WARNING: Tag used inconsistently. TAG: Free Text Frame occurs 1 times with href: http://nsidc1.colorado.edu:1729/u/CIMS/Even_more_info.html as in file: temp/t.html, but also occurs 1 time with href: http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html as in file: temp/t.html Done Verifying consistency of tags.
To run the html_analyzer after it has been installed (see the file INSTALLION in this directory) type: html_analyzer [-val] [-com] [-con] directory [path of repository] The -val, -com, and -con turn off the validation, completeness, and consistency tests. Only the name of a directory can be specified to check. If a directory is specified, all .html files within the directory hierarchy will be processed. If a file is specified, only that file will be processed. Multiple filenames are not allowed as command line arguements. The path of the temporary repository (default is /var/tmp) can be used if /var/tmp is full or not desirable. A directory (/html_analyzer) is created in this repostiory to store the temporary files generated by execution.
The libhttp directory was modified from the libwww directory from xmosaic-1.0, an HTML client developed by the National Center for Super- Computing Applications. This code is available from ftp.ncsa.uiuc.edu in the Web directory. This library was initially distributed by Tim Berners-Lee at the European Laboratory for Particle Physics (CERN). Please see the file Copyrights in this directory for more information on the copyrights that exist to this portion of code. The Regents of the University of Colorado claim copyright on the other portions of the distribution. This distribution of the software may be freely distributed, used, and modified but may not be sold as a whole nor in parts without permission of the copyright owners of the parts.
This software is provided as is. The Laboratory for Atmospheric and Space Physics (LASP) and the author are not responsible for support of this distribution.
Development of this software was funded by the NASA Earth Observing System Project under NASA contract NAS5-32392.
version 0.02 from 0.01: 0) converted CHECK_HTML_DB and GET_ANCHORS to c code. 1) added verification of relative addressed links. 2) added one-to-many check of tags to links ( previously: many-to-one check of links to tags) 3) cleaned up
Here's a list of things that could be done to improve the html_analyzer: 0) add validation capabilities for other servers, e.g. gopher, ftp, etc. This is really simple, but is not required for our usage. Let me know if you want to do this but do not know how. 1) create a program to automatically prune anchors that no longer point to valid files. This entails some tricky questions as to how automated this process needs to be. In other words, it might be nice for the user to have the option of specifying the correct location of the file and have the software make the changes to the HREFs as needed AS WELL as provide the user with the option of having the software remove all anchors pointing to the no-longer existent file. Let me know if your interested in this option, this seems like the next logical addition to the software. 2) add a linked list to the data struct of the skiplist that points to a list of other files that have the same anchor and tag. This will enable more sophisticated analysis, e.g. enable option 1) above by producing a list of files that point to a document for pruning purposes, etc. 3) add statistical analysis of the HTML db i.e. number of anchors per document, number of links to a document, list of files that point to a document, etc. 4) perform empirical study to confirm the hypothesis of the importance on a one-to-one correspondence between anchors and tags. [I might do this this fall if time allows].
The purpose of this distribution is to further the development of HTML database creation and maintainance utilities. Comments, questions, and REVISIONS are indeed welcome. Also, a minder is installed in the Makefile that will add you to a user's list. If you do not want to be added, feel free to remove the comands that make it so. Thanks, James E. Pitkow Septembe 16, 1993 Undergraduate Research Assistant Laboratory for Atmospheric and Space Physics Campus Box 590 Boulder, CO 80309 USA pitkow@aries.colorado.edu