Webbot Command Line Syntax

The generic syntax is:

         webbot [ options ] [ URI [ keywords ] ]

Check here the following options

Getting Help
User Interactions - how much or how little
Configuration File
Setting Basic Constraints for Traversal
Robots.txt and HTML META tags
Regular Expressions based Constraints
Selecting Breath First (BFS) or Depth First Search (DFS)
Handling HTTP redirections
Using the HTTP/1.1 Persistent Cache
Verifying or Downloading Inlined Images and other objects
SQL Based Logging using MySQL
Regular Ascii Text Logging
Distribution and Statistics Features
Other Options

Options

The order of the options is not important and options can in fact be specified on either side of any URI. Currently available options are:-

Getting Help

-v [ a | b | c | g | p | s | t | u ]

Verbose mode: Gives a running commentary on the program's attempts to read data in various ways. As the amount of verbose output is substantial, the -v option can now be followed by zero, one or more of the following flags (without space) in order to differentiate the verbose output generated:

a: Anchor relevant information
b: Bindings to local file system
c: Cache trace
g: SGML trace
p: Protocol module information
s: SGML/HTML relevant information
t: Thread trace
u: URI relevant information

The -v option without any appended options shows all trace messages. An example is "-vpt" showing thread and protocol trace messages

-version

Prints out the version number of the robot and the version number of libwww and exits.

User Interactions

Various options can determine how verbose and chatty the robot should be

-n: Non-interactive mode - don't ask the user anything.
-q: Somewhat quiet mode.
-Q: Really quiet mode.
-ss: Print out date and time for start and stop for the job.

Configuration File

-r <address>: Rule file, a.k.a. configuration file is a set of rules and configuration options that can be used to map URLs, and to set up other aspects of the behavior of the command line tool. Note that the address must be specified as a URI - and in fact it can be located on HTTP servers etc. as need be. File URIs are parsed relative to the current folder, so a rule file address of "rules.conf" will point to a file in the location where this tool it started. If a local file then the file suffix must be ".conf" - otherwise the media type must be application/x-www-rules.

Setting Basic Constraints for Traversal

These are some very simple constrants that can always be used when running the webbot.

-depth [ n ]: Limit jumps to n hops from the start page. The n-1 link is checked using a HEAD request. The default value is 0 which means that only the start page is searched. A value of 1 will cause the start page and all pages directly linked from that start page to be checked.
-prefix [ URI ]: Define a URI prefix for all URIs - if they do not match the prefix then they are not checked. The rejected URIs can be logged to a separate file.

Robots.txt and HTML META tags

There are situations where you may not want the robot to behave as a robot but more as a link checker in which case you may consider using these options:

-norobotstxt: If you for some reason don't want the robot to check for a robots.txt file then add this command line option
-nometatags: If you for some reason don't want the robot to check for HTML robots related META tags then add this command line option

Regular Expressions based Constraints

Using regular expressions reguires that you link against a regex library handling regular expressions - see the installation instructions for details. When using regular expressions, you can control the constraints much more efficiently - both to decide which URIs should be followed and to decide whether the webbot should use HEAD or GET when checking the links.

-exclude [ regex ]: Allows you to define a regular expression of which URIs should be excluded from the traversal. The rejected URIs can be logged to a separate file. This can be used to exclude specific parts of the URI space, for example all URIs containing "/old/": -exclude "/old/"
-check [ regex ]: Check all URIs that match this regular expression with a HEAD method instead of a GET method. This can be used to verify links but avoiding downloading large distribution files like this: -check "\.gz$|\.Z$|\.zip$", for example.
-include [ regex ]: Allows you to define a regular expression of which URIs should be included in the traversal

Breath First or Depth First Search

The webbot can perform either a Depth First Search (DFS) or a Breadth First Search (BFS). The default is DFS where the robot issues new requests as soon as they are encountered. To change to the BFS algorithmn, use the "-bfs" flag:

-bfs: Use Breadth First Search (BFS) instead of Depth First Search (DFS)

Handling HTTP Redirections

By default, the webbot doesn't follow HTTP redirections - it only registers them in the log files. However, by using the -redir option, it actually follows the redirections if the redirected address fulfills the traversing constraints.

-redir [ redirectioncode ]: Follow HTTP redirections. If no redirectioncode is given then follow all known redirections (301, 302, 303, 307). If you just want a single type of redirection to be followed then indicate that number as the redirectioncode, for example -redir 302.

Checking Inlined Images

The webbot can check inlined images as well as normal hyperlinks. You can control this using the following flags:

-img: Test include inlined images using a HEAD request
-saveimg: Saving the inlined images on local disk or pump them to a black hole. This is primarily to emulate a GUI client's behavior using the robot
-alt [ file ]: Specifies a Referer Log Format style log file of all inlined images without or with an empty an ALT tag.
-imgprefix [ URI ]: Define a URI prefix for all inlined image URIs - if they do not match the prefix then they are not checked. The rejected URIs can be logged to a separate file.

SQL Based Logging using MySQL

Using SQL based logging requires that you have linked against a MySQL library. See the installation instructions for details. I like the Web interface provided by www-sql which makes it easy to access the logged data. The data is stored in four tables within the same database (the default name is "webbot"):

uris: An index that maps URIs to integers so that they are easier to refer to
requests: Contains information from the request including the request-URI, the method, and the resulting status code.
resources: Contains information of the resource like content-type, content-encoding, expires, etc.
links: Contains information about which documents point to which documents, the type of the link etc. The type can either be implicit like "referer" or "image", or it can be explicit like "stylesheet", "toc", etc.

The command line options for handling the SQL logging are as follows:

-sqlserver [ srvrname ]: Specify the mysql server. The default is "localhost".
-sqldb [ dbname ]: Specify the database to use. The default is webbot. Note that webbot creates its own set of tables for handling the logs.
-sqluser [ usrname ]: Use this to specify the user that we are connection to the database as. The default is "webbot".
-sqlpassword [ usrpswd ]: Use this to specify the password needed to connect to the database. The default is empty string.
-sqlrelative [ relroot ]: If you want to make the URI entries in the database relative then you can specify the root to which they should be made relative. This can for example be used to built the database on another machine than is normally running the service. On heavy loaded sites, it is often a good idea to have an internal test server running which can be used to build the database as it does take some resources.
-sqlexternals: Use this flag if you want all links that have been filtered because they didn't fulfill the constraints to be logged as well in the same table as all other URIs.
-sqlclearlinks: Clears the links table before starting the traversal.
-sqlclearrequests: Clears the requests table before starting the traversal.
-sqlclearresources: Clears the resources table before starting the traversal.
-sqlclearuris: Clears the uris table before starting the traversal.

Regular Ascii Text Logging

This set of log files are dumped in normal ASCII format into local files

-404 [ file ]: Specifies a Referer Log Format style log file of all links resulting in a 404 (Not Found) status code
-l [ file ]: Specifies a Common Log File Format style log file with a list of visited documents and the result codes obtained.
-negotiated [ file ]: Specifies a log file of all URIs that where subject to content negotiation.
-referer [ file ]: Specifies a Referer Log Format style log file of which documents points to which documents
-reject [ file ]: Specifies a log file of all the URIs encountered that didn't fulfill the constraints for traversal.

Distribution and Statistics Features

Note that if you are using SQL based logging then the set of statistics that can be drawn directly from the database is very high.

-format [ file ]

Specifies a log file of which media types (content types) were encountered in the run and their distribution

-charset [ file ]

Specifies a log file of which charsets (content type parameter) were encountered in the run and their distribution

-hit [ file ]

Specifies a log file of URIs sorted after how many times they were referenced in the run

-lm [ file ]

Specifies a log file of URIs sorted after last modified date. This gives a good overview of the dynamics of the web site that you are checking.

-rellog [ file ]

Specifies a log file of any link relationship found in the HTML LINK tag (either the REL of the REV attribute) that has the relation specified in the -relation parameter (all relations are modelled by libwww as "forward"). For example "-rellog stylesheets-logfile.txt -relation stylesheet" will produce a log file of all link relationships of type "stylesheet". The format of the log file is

"<relationship> <media type> <from-URI> --> <to-URI>"

meaning that the from-URI has the forward relationship with to-URI.

-title [ file ]

Specifies a log file of URIs sorted after any title found either as an HTTP header or in the HTML.

Persistent Cache

The webbot can use the persistent cache while traversing the web site which may cause a significant performance optimization. These are the command line options:

-cache: Enable the libwww persistent cache
-cacheroot [ dir ]: Where should the cache be located? The default is /tmp/w3c-cache
-cache_size [ size ]: How big should the cache be in Megs, default value is 20
-validate: Force validation using either the etag or the last-modifieddate provided by the server
-endvalidate: Force end-to-end validation by adding a max-age=0 cache control directive

Other Options

-delay [ n ]: Specify the write delay in milliseconds for how long we can wait until we flush the output buffer when using pipelining. The default value is 50 ms. The longer delay, the bigger TCP packets but also longer response time.
-nopipe: Do not use HTTP/1.1 pipelining (but still use persistent connections). The default for this option can be set using the configure script under installation.
-single: Single threaded mode. If this flag is set then the browser uses blocking, non interruptible I/O in interactive mode. Non-interactive mode always uses blocking I/O.
-timeout [ n ]: Timeout in seconds on open connections. If we don't get a reply within n secs then about the request. Default timeout is 20 secs.

URI

The URI is the hypertext address of the document at which you want to start the robot.

keywords

Any further command line arguments are taken as keywords. Keywords can be used as search tokens in an HTTP request-URI encoded so that all spaces are replaced with "+" and unsafe characters are encoded using the URI "%xx" escape mechanism. An example of a search query is

  webbot http://... "RECORD=ID" "par1=a" "par2=b" "par3=c" "par4=d"

Henrik Frystyk Nielsen,
@(#) $Id: CommandLine.html,v 1.28 1999/05/04 13:18:52 frystyk Exp $