Re: first questions on validator.nu

Hi,

On Dec 6, 2007, at 09:19, olivier Thereaux wrote:

> * Installation *
>
> I find the building mechanism you adopted rather fascinating. The  
> fact that the build script goes and fetch all dependencies and files  
> automatically, and starts the servlet, is great. A downside of this  
> is that the number of dependencies downloaded is huge! About half a  
> gigabyte, with a number of jars present in multiple instances (ant,  
> xerces-impl). Have you thought of a way to keep the number of jars  
> to a minimum, perhaps by renaming them and keeping them all in a  
> single directory?

I considered it very briefly and figured that by downloading from the  
original distribution sites I don't need to consider what legal or  
maintenance obligations I'd have if I distributed the third-party code  
myself. For example, I don't need to find out which packages would  
require me to find complete corresponding source code and distributing  
that, too.

I did consider using Maven, but I figured that I'd run into trouble if  
even one of the dependencies weren't pre-Mavenized for me. Also, at  
the time I had zero experience with Maven. I now have and, indeed, it  
is the case that if I were using Maven for the whole thing, I'd have  
to take care of packaging some of the dependencies for Maven myself.

The size issue itself I hadn't considered. So far, I haven't touched  
any disk limits on any of the systems on which I've run the build  
script. Is the on-disk footprint of the dependencies directory a  
problem? Or the download size? With the disks and bandwidth available  
today, is it a problem that is worth addressing?

> * Opensourciness *
>
> In a discussion with Mike we were wondering if the tool could be  
> distributed as open source. I've seen licenses for the html5parser,  
> but apparently not for the whole. We were also wondering if the  
> dependencies were all OSS-friendly.

Disclaimer: IANAL, TINLA.

There is currently one data file that the software pulls in from the  
network at runtime (yeah, that's in itself bad) that I think might not  
be Open Source: the IANA language tag registry. I contacted the IANA  
and got a permission to distribute the registry file as-is. (I  
currently don't distribute it in any form.) I followed up asking about  
modifications but never got a reply. Therefore, the IANA language tag  
registry is potentially a non-Open Source invariant section. I could  
modify the software to read any designated file that is in the  
language tag registry format. However, the software performs RFC 4646  
language tag validation only if the data file is equivalent to the  
IANA registry file.

I think accepting even one invariant section is a slippery slope and a  
potential problem considering inclusion in software distributions that  
don't like slippery slopes. I don't really know what I should do about  
to IANA registry file. The ideal solution would be for the IETF to  
relax their licensing terms. I think using copyright to enforce the  
integrity of normative files is the wrong way to go. I think the IETF  
(and the W3C for that matter) should release their normative stuff  
Free as in Free Software and only require modified versions to carry a  
notice that they are modified versions.

Other than that, everything above the JDK should be Open Source as  
defined by OSI and Free Software as defined by the FSF. I am also  
pretty sure that all the *runtime* dependencies are also Free Software  
as defined by Debian. However, some packages used by Validator.nu have  
build/test-time dependencies that may not be Free Software in the  
Debian sense. Disclaimer: IANDD.

If you choose to use IcedTea on Linux, you should be able to make the  
stack from JDK and downwards Open Source, too. (I haven't tried  
running Validator.nu on IcedTea, but I have every reason to believe  
that it would run. I haven't tried gcj/Classpath, either, but I'm less  
confident about that option Just Working.)

As far as I can tell, one way to analyze Debian-compatibility is to  
analyze GPLv3-compatibility. Here's a not-legal-advice-I-am-not-a- 
lawyer overview I wrote earlier. Quoting the build script:

> dependencyPackages = [
>  ("http://www.nic.funet.fi/pub/mirrors/apache.org/commons/codec/binaries/commons-codec-1.3.zip 
> ", "c30c769e07339390862907504ff4b300"),

Apache License 2.0 => GPLv3-compatible
(JUnit needed for building tests for source; not needed at runtime.)

>  ("http://www.nic.funet.fi/pub/mirrors/apache.org/jakarta/httpcomponents/commons-httpclient-3.x/binary/commons-httpclient-3.1.zip 
> ", "1752a2dc65e2fb03d4e762a8e7a1db49"),

Apache License 2.0 => GPLv3-compatible
(JUnit needed for building tests for source; not needed at runtime.)

>  ("http://www.nic.funet.fi/pub/mirrors/apache.org/commons/logging/binaries/commons-logging-1.1.zip 
> ", "cc4d307492a48e27fbfeeb04d59c6578"),

Apache License 2.0 => GPLv3-compatible
(Weird build-time deps that Validator.nu doesn't need at runtime.)

>  ("http://download.icu-project.org/files/icu4j/3.6.1/ 
> icu4j_3_6_1.jar", "f5ffe0784a9e4c414f42d88e7f6ecefd"),

X Consortium-style => GPLv3-compatible

>  ("http://download.icu-project.org/files/icu4j/3.6.1/icu4j-charsets_3_6_1.jar 
> ", "0c8485bc3846fb8f243ed393f3f5b7f9"),

X Consortium-style => GPLv3-compatible

>  ("http://belnet.dl.sourceforge.net/sourceforge/jena/ 
> Jena-2.5.2.zip", "cd9c74f58b7175e56e3512443c84fcf8"),

3-clause BSD => GPLv3-compatible
(Build-time deps are Apache License 2.0, PD and a MIT-style one-off  
Sun license; not needed at runtime.)

Validator.nu only needs the IRI subpackage from Jena.

>  ("http://dist.codehaus.org/jetty/jetty-6.1.5/jetty-6.1.5.zip",  
> "c05153e639810c0d28a602366c69a632"),

Jetty itself is under Apache License 2.0 => GPLv3-compatible

Be warned, though, that some of the optional bits that Validator.nu  
does not use are under LGPL or the *CDDL*.

Note that the servlet API jar that comes with Jetty and that  
Validator.nu needs is claimed to have CDDL bits in it. FSF may have a  
GPLv3-compatible replacement files available in GNU Classpath. I bet  
Debian has figured out a way to deal with the servlet API jar by now.

>  ("http://mirror.eunet.fi/apache/logging/log4j/1.2.14/logging-log4j-1.2.14.zip 
> ", "6c4f8da1fed407798ea0ad7984fe60db"),

Apache License 2.0 => GPLv3-compatible
(CDDL build-time deps that Validator.nu doesn't need at runtime if you  
don't use logging to SMTP.)

>  ("http://mirror.eunet.fi/apache/xml/xerces-j/Xerces-J-bin. 
> 2.9.0.zip", "a3aece3feb68be6d319072b85ad06023"),

Apache License 2.0 + W3C Software Notice => GPLv3-compatible

>  ("http://ftp.mozilla.org/pub/mozilla.org/js/rhino1_6R5.zip",  
> "c93b6d0bb8ba83c3760efeb30525728a"),

MPL 1.1 or GPLv2 or later => GPLv3-compatible

>  ("http://download.berlios.de/jsontools/jsontools-core-1.5.jar",  
> "1f242910350f28d1ac4014928075becd"),

LGPL 2.1 or later => GPLv3-compatible
(Only used for HTML parser tests. Validator.nu doesn't need this at  
runtime.)

>  ("http://hsivonen.iki.fi/code/antlr.jar",  
> "9d2e9848c52204275c72d9d6e79f307c"),

Public Domain => GPLv3-compatible
(Only used for HTML parser tests. Validator.nu doesn't need this at  
runtime.)

>  ("http://www.cafeconleche.org/XOM/xom-1.1.jar",  
> "6b5e76db86d7ae32a451ffdb6fce0764"),

LGPL 2.1 (*no* "or later" option) => OK for Debian as standalone but  
*not* GPLv3-compatible
(Validator.nu doesn't need this at runtime. It is part of the extended  
feature set of the HTML parser and can be severed without harm to  
Validator.nu.)

>  ("http://www.slf4j.org/dist/slf4j-1.4.3.zip",  
> "5671faa7d5aecbd06d62cf91f990f80a"),

MIT => GPLv3-compatible
(Build-time deps include the concrete logging system you want to use-- 
log4j in this case.)

>  ("http://www.nic.funet.fi/pub/mirrors/apache.org/commons/fileupload/binaries/commons-fileupload-1.2-bin.zip 
> ", "6fbe6112ebb87a9087da8ca1f8d8fd6a"),

Apache License 2.0 => GPLv3-compatible

>  ("http://mirror.eunet.fi/apache/xml/xalan-j/xalan-j_2_7_0-bin.zip",  
> "ec42adbc83eb0d1354f73a600e274afe"),

Apache License 2.0 => GPLv3-compatible
It may have some Apache License 1.1 & MIT-style bits, but I don't grok  
what role those have. :-( I'm not sure what the GPLv3-compat status of  
Apache License 1.1 is.

>  ("http://mirror.eunet.fi/apache/ant/binaries/apache-ant-1.7.0- 
> bin.zip" , "ac30ce5b07b0018d65203fbc680968f5"),

Apache License 2.0 => GPLv3-compatible
You only need the core Ant plus Launcher to allow oNVDL+Xalan to run.  
However, Ant as a whole is likely to have *insane* build-time deps.  
Fortunately, Ant is already in Debian.

>  ("http://surfnet.dl.sourceforge.net/sourceforge/iso-relax/isorelax.20041111.zip 
> " , "10381903828d30e36252910679fcbab6"),

MIT => GPLv3-compatible

>  ("http://ovh.dl.sourceforge.net/sourceforge/junit/junit-4.4.jar",  
> "f852bbb2bbe0471cef8e5b833cb36078"),

CPL 1.0 => *not* GPLv3-compatible. Aargh!
Fortunately, this is not a run-time dep. An older and sufficient  
version of JUnit is already in Debian. I have to suspect this is  
against Debian's own policy, but that's not my fight. In any case, you  
could satisfy the build-time JUnit deps with stubs.

Note that I have to re-introduce a build-time dependency to MPL 1.0  
code in order to sync with upstream. Debian could address this with a  
one-class stub or with a quick patch in the future.

> moduleNames = [
>  "build",

MIT => GPLv3-compatible

>  "syntax",

MIT => GPLv3-compatible

>  "util",

MIT => GPLv3-compatible

>  "htmlparser",

MIT => GPLv3-compatible

>  "xmlparser",

GPLv2 or later with Library Exception  => GPLv3-compatible

>  "onvdl",

3-clause BSD  => GPLv3-compatible

>  "validator",

MIT => GPLv3-compatible


Quoting olivier Thereaux again:

> * Running as servlet *
>
> I found the class that runs the validator per se  
> (nu.validator.servlet.Main) which indeed works nicely, but it uses  
> its own standalone server (it wraps around jetty?). It would be nice  
> to have a way to run this as a servlet from an existing jetty/tomcat/ 
> jigsaw (I'm particularly interested in running an instance on  
> jigsaw). Is that possible? I couldn't find any doc on this yet.

nu.validator.servlet.Main makes the software more debuggable and helps  
me avoid XML situps. The Main class does these things:
  1) Initializes log4j.
  2) Installs the servlet.
  3) Installs servlet filters.
  4) Runs Jetty.

The servlet class (nu.validator.servlet.VerifierServlet) itself takes  
care of its initialization in a static {} block, so it should be  
fairly easy to load it in jigsaw with whatever mechanism jigsaw uses  
if you can manage to get jigsaw initialize log4j first and if jigsaw  
doesn't do crazy class unloading and reloading that would cause the  
static{} block to run repeatedly.

To make the servlet work easily in the configurations that I've used  
it in, it does some quick and dirty sniffing of its deployment context  
in order to dispatch between the HTML5 facet and the generic facet.  
I'd be happy to make this sniffing less dirty if you have suggestions  
about what to do.

The servlet filters are optional. They implement compression of output  
(filter from Jetty) and support for form-based file upload and  
<textarea>-based input.

> * All dynamic *
>
> I see that the interface is dynamic  
> (nu.validator.servlet.FormEmitter) Would it make sense to have this  
> as a static document? Having the main interface be a dynamic  
> resource would be costly for a high-traffic service.

I have considered caching the front page in RAM, but so far generating  
it dynamically hasn't been a real problem, so I have postponed it as  
premature optimization.

> Also, the lack of static documents served along with the servlet  
> means that stylesheets are hardcoded to pointing to http://hsivonen.iki.fi 
> , which is subpar.

I have deliberately decided against serving any files from the disk  
through the servlet to avoid security holes and to avoid doing work  
that is best left to Apache. I think serving style sheets and scripts  
from an Apache instance responding at another host name is quite  
adequate.

The hard-coded network locations are indeed uncool. That's not the  
only instance. There are
  * URL of the style sheet
  * URL of the JavaScript file
  * URL of the about page
  * URL of the WHATWG HTML5 spec (twice)
  * URL of the IANA language tag registry
  * URL of the WHATWG wiki page for microsyntax descriptions

When I finish writing this email, I'll start parametrizing these.

> * Code review / doc-a-thon *
>
> Have you done code reviews of the validator in the past?

The (X)HTML5 schema has gotten code review form fantasai. The Java  
code has not been subjected to review (before you taking a look now).

> A webcast or teleconference-based intro would be very cool, it would  
> be a great way to present the features and quite likely help write  
> doc on the fly / interest people in participating in code. I'm sure  
> we could organize something. Would you be interested?

I'm interested in showing people around the code and/or answering  
questions over a telecon. However, I think I won't be able to write  
docs in real time during a telecon without wasting other people's time.

> * SVG validation *
>
> knowing that validator.nu had RNG and nvdl capabilities, I was  
> particularly interested in seeing how it worked with SVG. I haven't  
> had time to extensively test with a lot of SVG.

I haven't tested the SVG schema a lot, either. Moreover, I haven't  
tested the NVDL part of oNVDL at all yet.

> I noticed it only has one SVG schema.

It's the one from Relaxed without any improvements or testing on my  
part. Eventually, I'd like to improve SVG validation.

> I wonder if it would be possible to preparse documents with SVG  
> media type and look for version and baseprofile attributes on the  
> root element, switching between SVG tiny, basic and full based on  
> that.

I'm not sure if NVDL could already do that. Currently, the automatic  
schema selection for XML documents is based on the root element  
namespace. This is implemented in  
nu.validator.servlet.BufferingRootNamespaceSniffer. Since the SVG  
baseProfile and version attributes occur on the root element if  
present, it would be relatively easy to make the sniffer sniff those  
attributes as well and to extend the schema selection mechanism to  
label schemas not only with the namespace URI but with other data as  
well.

I think SVG profiling is a spec misfeature, though, so I haven't been  
particularly keen on developing support for it. However, I can see why  
a W3C installation might want to support it, so I'd accept a [suitably  
licensed] patch.

Ideally, I'd like to see a single unified SVG spec whose latest  
version the validator would track. That is, I'd prefer an "SVG5"  
schema over profiles.

> I tried validating an SVG document with a number of foreign  
> namespace content in it (typical sodipodi/inkscape output) and found  
> that the validator.nu complained about these. Is it on purpose?

It's on purpose in the sense that, by design, RELAX NG schemas are  
white lists--not black lists. It isn't on purpose in the sense that I  
haven't really reviewed the SVG schema yet.

> I've heard a lot of arguments in favor of dropping anything in  
> namespaces not known by the validator, and/or using nvdl to validate  
> foreign namespace fragments. Is that something validator.nu can do,  
> is it planned? I'm certain this would be a fantastic tool for the  
> adoption of SVG.

It might be that the NVDL part of oNVDL allows something like this  
already. I haven't investigated yet. I have considered this issue,  
though, and I am planning on punching a hole in the XHTML5 schema for  
embedded RDF, for example. I guess the SVG schema should get that  
hole, too. I have also considered an option to filter out unknown  
namespaces, but I'm not sure if it is good to open such "anything  
goes" holes in a validator.

Thank you for your comments.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Friday, 7 December 2007 11:43:34 UTC