checklink: patch to add "--omit" option, to ignore certain URLs

checklink.pl lacks the ability to ignore parts of a web hierarchy.
Ignoring everything under a certain URL can be desirable when it contains a
large or infinite number of pages.  (As an example of the latter, consider
dynamically generated pages that link to other dynamically generated
pages.)  Using the --depth argument is a partial workaround, but sometimes
I wish to check every link under a hierarchy, without seeing any reports
for a certain portion of it.

The below patch adds this functionality via an --omit option to checklink.pl.

					-Michael Ernst
					 mernst@csail.mit.edu


cd ~/bin/share/
diff -u -b -r /g2/users/mernst/bin/share/checklink.pl-orig /g2/users/mernst/bin/share/checklink.pl
--- /g2/users/mernst/bin/share/checklink.pl-orig	Fri Feb  6 11:54:10 2004
+++ /g2/users/mernst/bin/share/checklink.pl	Sun Feb  8 08:57:59 2004
@@ -165,6 +165,7 @@
     User              => undef,
     Password          => undef,
     Base_Location     => '.',
+    Omit_Location     => undef,
     Masquerade        => 0,
     Masquerade_From   => '',
     Masquerade_To     => '',
@@ -356,6 +357,7 @@
              'r|recursive'     => sub { $Opts{Depth} = -1
                                           if $Opts{Depth} == 0; },
              'l|location=s'    => \$Opts{Base_Location},
+             'o|omit=s'        => \$Opts{Omit_Location},
              'u|user=s'        => \$Opts{User},
              'p|password=s'    => \$Opts{Password},
              't|timeout=i'     => \$Opts{Timeout},
@@ -414,6 +416,8 @@
                               By default, for example for
                               http://www.w3.org/TR/html4/Overview.html
                               it would be http://www.w3.org/TR/html4/
+  -o/--omit regexp            Do not check pages whose url matches the perl
+                              regexp.
   -n/--noacclanguage          Do not send an Accept-Language header.
   -L/--languages              Languages accepted$langs.
   -q/--quiet                  No output if no errors are found.  Implies -s.
@@ -792,6 +796,8 @@
 
   return undef if ($current eq $rel);     # Relative path not possible?
   return undef if ($rel =~ m|^(\.\.)?/|); # Relative path starts with ../ or /?
+  return undef if (defined($Opts{Omit_Location})
+                   && ($current =~ m/$Opts{Omit_Location}/));
   return 1;
 }
 
@@ -2165,6 +2171,11 @@
 L<http://www.w3.org/TR/html4/Overview.html> for example, it would be
 L<http://www.w3.org/TR/html4/>.
 
+=item B<-o, --omit regexp>
+
+Perl regexp for URLs of documents that should not be checked, even
+if they would otherwise be within scope.
+
 =item B<-n, --noacclanguage>
 
 Do not send an Accept-Language header.

Diff finished at Sun Feb  8 09:10:12

Received on Sunday, 8 February 2004 15:13:52 UTC