W3C Log Validator

Log Validator - Manual - Creating new modules

The Log Validator's modular design

The Log Validator is a very flexible tool and most of its behaviour can be changed by leveraging its modular design: the only fixed thing in its behaviour is, in fact, that its input is a Web server log (or a simple list of URIs) and a configuration file. The rest is controlled by its process and output modules, and since such modules are easy to create and use, anyone with some knowledge of the Perl language is able to extend or re-use the Log Validator for a great variety of purposes.

In order to be able to create new modules for the Log Validator, the first important thing is to understand its modular design:

W3C::LogValidator, is the LogValidator's Core module, which communicates with other modules (configuration, process and output) and redirects information. It is responsible for parsing logs, keeps the result in memory (actually a temporary Berkeley DB file) and feeds the list of URLs to process modules, which in turn send back the results under a particular form (a results hash). Then the central module forwards the results given by all the process modules to an output module and exits.
Process modules take as input a list of URIs and their number of hits, sort this list, process it according to an arbitrary algorithm, and finally send back the processed subset of the list back to the Core module.

Examples of Process modules:
- W3C::LogValidator::Basic: is a very simple module used to sort the URIs and outputs them back without modification, hence giving the sorted list of most popular documents on the list.
- W3C::LogValidator::HTMLValidator: sorts the URIs it takes as input, filters the list to only process (X)HTML documents, validates them (using an external validator via an HTTP interface) and sends back as output the list of the most popular invalid ones.
Output modules receive as input the processed sub-lists (as many as the number of process modules called in the current LogValidator session) and render or send them in a particular format.

Examples of output modules:
- W3C::LogValidator::Output::Raw: outputs an text plain report
- W3C::LogValidator::Output::HTML: outputs an html report
- W3C::LogValidator::Output::Mail: sends a mail with the results to the listed maintainer of the Web site

How to Create a module

Download the stable code archive or checkout the CVS code
If you downloaded the archive, uncompress it.
go to the samples directory inside the uncompressed archive or in your CVs checkout of the CVS code.
Open the source code for sample modules and start editing, while following this documentation

Creating a process module

The process module receives a configuration hash. From this hash, it can extract a few things including:

the location of the temporary Berkeley DB file which has the list of URIs with their respective hits.
the parameter MaxInvalid. This parameter can be chosen by the user to limit the number of results sent back by the process Module.
the verbosity level
parameter(s) specific to this module. The author of this module can choose itself. Example: server used for validation. (Of course if you were to add such parameters you should document it, so that users can use them in their configuration file.)

Once all this information has been extracted, the module processes the information and creates results hash that will be passed to the core module..

We are editing the NewModule.pm module in the samples directory.


# Copyright (c) YYYY the World Wide Web Consortium :
#       Keio University,
#      European Research Consortium for Informatics and Mathematics 
#       Massachusetts Institute of Technology.
# written by Firstname Lastname <your@address.mail> for W3C

Replace YYYY with the current year and add your name and your email address.

The following code is the beginning of the constructor for the module, you are not supposed to modify it.

#
# $Id: Manual-Modules.html,v 1.8 2006/06/29 00:46:05 ot Exp $

package W3C::LogValidator::MyProcessModule;
use strict;
use warnings;


require Exporter;
our @ISA = qw(Exporter);
our %EXPORT_TAGS = ( 'all' => [ qw() ] );
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our @EXPORT = qw();
our $VERSION = '0.1';

###########################
# usual package interface #
###########################
our $verbose = 1;
our %config;

sub new
{
        my $self  = {};
        my $proto = shift;
        my $class = ref($proto) || $proto;
	# mandatory vars for the API
	$self->{URIS}   = undef;

Here you can set up your internal parameters for this module. In the following example we set up a list of known extensions for the type of document we want to process, allowing us to filter the list of URIs before processing.

	# internal stuff here
	$self->{AUTH_EXT} = ".html .xhtml .phtml .htm /";

This is where you start processing the configuration hash passed to you by the core module. Standard parameters include:

$config{verbose}controls the verbosity of your output.
- set to 0, your code should be quiet. This means that except for the results hash that the process_listsubroutine in your module returns, there should be no output
- 1 is the default value. When the verbosity is set to 1, the module may send to the standard output basic information about the tasks it is working on, including "announcing itself".
- 2 is for verbose. The module may send detailed information about the tasks it is working on and the data it is processing.
- 3 is for debug, and if the verbosity is set at that level, the module should send the same information as it does for level 2, in addition to debug information.
$config{MaxInvalid}controls how many results at most your processing module is supposed to return. For example, if you are creating a validation module, users will configure this parameter to only receive the list of the $config{MaxInvalid}most popular invalid documents. Note that if this parameter is set to 0, your module is supposed to return a full list.
$config{tmpfile}is where the temporary Berkeley DB file with the list of URIs and their number of hits is stored. We will see later how it is used.

	# don't change this
        if (@_) {%config =  %{(shift)};}
	if (exists $config{verbose}) {$verbose = $config{verbose}}

Internal parameters for this module may have been configured through the configuration file, but you may want to assign default fallback values, as follows (example taken from the W3C::LogValidator::HTMLValidatorcode):

        $config{ValidatorHost} = "validator.w3.org" if (! exists $config{ValidatorHost});
        $config{ValidatorPort} = "80" if (!exists $config{ValidatorPort});
        $config{ValidatorString} = "/check\?uri=" if (!exists $config{ValidatorString});
        $config{ValidatorPostString} = "\;output=xml" if (!exists $config{ValidatorPostString});

ending the constructor.

	bless($self, $class);
        return $self;
}

#########################################
# Actual subroutine to check the list of uris #
#########################################

Moving on to the main subroutine for the process modules. Process modules must includethis process_listsubroutine.

sub process_list
{
	my $self = shift;
	my $max_invalid = undef;
	if (exists $config{MaxInvalid}) {$max_invalid = $config{MaxInvalid}}

Here we have an example of getting a parameter ( $config{MaxInvalid}) from the configuration hash passed to our module by the core W3C::LogValidatormodule.

Below is the code that handles the temporary DB file. The technique is to tie it back (read-only, only the core module is supposed to modify this) to a hash, as follows:

	print "Now Using the CHANGEME module :\n" if $verbose;
	use DB_File;                                                                  
        my $tmp_file = $config{tmpfile};
	my %hits;                                                                     
	tie (%hits, 'DB_File', "$tmp_file", O_RDONLY) ||                              
	die ("Cannot create or open $tmp_file");

You will probably want to sort the list of URIs before starting to process them. If you wish to have a sorted list by hits, you can use the following code:

	my @uris = sort { $hits{$b} <=> $hits{$a} } keys %hits;

You are now free to do whatever you want with the sorted list. Use your imagination!

You are not even limited to processing the list directly, you could send it to an externam program or service on the Web. For example, a spelling checker, a WAI validator, etc.

	# do what pleases you!
	print "Done!\n" if $verbose;

   untie %hits;

When you are done, you may untie the DB file and the URI/hits hash.

This subroutine's output is a hash, which structure is explained below:


   my %returnhash;
# the name of the module
	$returnhash{"name"}="CHANGEME";                                                  
#intro string
	$returnhash{"intro"}="An intro string for the module's results";
#Headers for the result table
	@{$returnhash{"thead"}}=["Header1", "Header2", "..."] ;
# data for the results table
	@{$returnhash{"trows"}}=
	[
	 ["data1", "data2", "..."]
	 ["etc", "etc", "etc"]
	 ["etc", "etc", "etc"]
	 ["etc", "etc", "etc"]
	];
#outro string
	$returnhash{"outro"}="An outre string for the module's results. Usually the conclusion";
	return %returnhash;
}

You may of course create subroutines that will be called by the main subroutine process_list.

# internal routines
#sub foobar
#{
#   my $self = shift;
#   ...
#}

End of the code proper.

package W3C::LogValidator::CHANGEME;

1;

Do not forget to replace the relevant bits in the embedded documentation for your module.

__END__

=head1 NAME

W3C::LogValidator::CHANGEME

=head1 SYNOPSIS


=head1 DESCRIPTION

This module is part of the W3C::LogValidator suite, and ....

=head1 AUTHOR

you <your@address>

=head1 SEE ALSO

W3C::LogValidator, perl(1).
Up-to-date complete info at http://www.w3.org/QA/Tools/LogValidator/

=cut

@@ add some explanation on how to include this newly coded module to an existing Log Validator installation

Creating an output module

We are editing the NewOutputModule.pmmodule in the samples directory.

# Copyright (c) YYYY the World Wide Web Consortium :
#       Keio University,
#       European Research Consortium for Informatics and Mathematics
#       Massachusetts Institute of Technology.
# written by Firstname Lastname <your@address.mail> for W3C

Replace YYYY with the current year, and add your name and your email address.

package W3C::LogValidator::Output::MyOutputModule;
use strict;

Above, replace "MyOutputModule" with the actual name of your output module.

Below is the standard constructor code for the module, do not modify it unless you really know what you are doing.

###########################
# usual package interface #
#     don't modify        #
###########################

require Exporter;
our @ISA = qw(Exporter);
our %EXPORT_TAGS = ( 'all' => [ qw() ] );
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our @EXPORT = qw();
our $VERSION = '0.1';


our %config;
our $verbose = 1;

sub new
{
        my $self  = {};
        my $proto = shift;
        my $class = ref($proto) || $proto;
	# configuration for this module
	if (@_) {%config =  %{(shift)};}
	if (exists $config{verbose}) {$verbose = $config{verbose}}
        bless($self, $class);
        return $self;
}

Now we are working on the part of the code you will have to modify. This code is organized with two subroutines:

sub output: creates an output string out of the results hash
sub finish: post process the output string if necessary

Output modules must includethose two subroutines.

#############################
# first subroutine is output #
#   create output string    #
#############################

First subroutine sub output

You create the result string by using the different entries in the results hash, including:

name(of the processing module from which we got the results) [string]
intro(free text sent by the processing module) [string]
thead(the headers of the result table) [n list of strings]

example:
```
["hits", "uri", "validation_errors"]
```

trows(rows of the result table) [n headers x m results array of strings]

example:

[["50", "http://www.example.org/foo/", "12"], 
 ["2", "http://www.example.org/bar/", "42"], 
 ...]

outro(free text sent by the processing module) [string]

You are free to do whatever you want in this subroutine provided you return the string. Below is an example that concatenates all the information into one string.

sub output
{
	my $self = shift;
	my %results;
	my $outputstr ="";

# you create the result string by using the different entries 
# in the results hash, including name (of the module), intro (text)
# thead (the headers of the result table), trows (rows of the result table)
# and outro

#sample code for a full-text tabbed result table below
	if (@_) {%results = %{(shift)}}
	$outputstr= "
************************************************************************
Results for module ".$results{'name'}."
************************************************************************\n";
	$outputstr= $outputstr.$results{"intro"}."\n\n" if ($results{"intro"});
	my @thead = @{$results{"thead"}};
	while (@thead)
	{
	   my $header = shift (@thead);    
	   $outputstr= $outputstr."$header   ";
	}
	$outputstr= $outputstr."\n";
	my @trows = @{$results{"trows"}};
	while (@trows)
	{
	   my @row=@{shift (@trows)};
	   my $tcell;
	   while (@row)
	   {
	       $tcell= shift (@row);   
	       chomp $tcell;
	       $outputstr= $outputstr."$tcell   ";
	   }
	   $outputstr= $outputstr."\n";
	}
	$outputstr= $outputstr."\n";
	$outputstr= $outputstr.$results{"outro"}."
************************************************************************\n\n" if ($results{"outro"});

This is the end of the example, now we can return the string.

# the subroutine returns the output string
	return $outputstr; 
}

The finishsubroutine does whatever action is needed with the output string like "print" or send as e-mail or whatever you like note that for saving to file, the main module has an option for that already. Therefore in most cases, you will just have to print.

################################################################
# finish does whatever action is needed with the output string #
#   like "print" or send as e-mail or whatever you like        #
# note that for saving to file, the main module has an option  #
#               for that already, just "print"                 #
################################################################

sub finish
{
# well for this output it's not too difficult :)
	my $self = shift;
	if (@_) 
	{ 
	   my $result_string = shift;
	   print $result_string;
	}
}

Do not forget to replace the relevant bits in the embedded documentation for your module.


package W3C::LogValidator::Output::MyOutputModule;

1;

__END__

=head1 NAME

W3C::LogValidator::Output::MyOutputModule  Short Description


=head1 DESCRIPTION

This module is part of the W3C::LogValidator suite, and ...

=head1 AUTHOR

Firstname Lastname <your@mail.address>

=head1 SEE ALSO

W3C::LogValidator, perl(1).
Up-to-date complete info at http://www.w3.org/QA/Tools/LogValidator/
=cut