Xerophily: XSD Regex parser 
2008-08-08, rev. 2009-12-09

This directory contains Xerophily, a parser for regular expressions,
as they are defined by the XML Schema Definition Language (XSD),
versions 1.0 and 1.1.  (Xerophily is the property possessed by plants
well adapted to growing in dry, especially hot and dry, conditions.
It's also one of the few words I could find containing an 'X', an 'R',
and a 'P' for the keywords 'XSD', 'regex', and 'parser'.)  Xerophily
was prepared by C. M. Sperberg-McQueen, a first version in 2004 or
earlier, and the current version in May/June 2008.

Copyright in the code is held by the World Wide Web Consortium, which
licenses the code both under the W3C license and under the Lesser Gnu
Public License (LGPL).  The code in directory

  http://www.w3.org/XML/2008/03/xsdl-regex/

carries standard W3C license notices. For the LGPL version, see the
files in the subdirectory  

  http://www.w3.org/XML/2008/03/xsdl-regex/LGPL

Apart from the license differences, the two versions should be
identical.

In its current state, the parser has no user interface to speak of;
that may change.  If you are comfortable working with Prolog, have at
it.

This code has been developed and tested with SWI Prolog; most of it
should run with other Prologs, but the routines for reading schema
documents probably will need adjustment.

If SWI Prolog is in the path as swipl, and the code is in the
./xsdl-regex directory, then one simple way to run the parser on all
the patterns in a schema document test.xsd, and produce an annotated
version of the schema document out.xsd, is to issue the following
command from the command line.

  swipl -f ../xsdl-regex/load.pl -g "annotate_xsd('testdoc.xsd')" -t 'halt(13)' > out.xsd

For example, the annotated version of the dummy schema document with
the regexes from the Last Call draft of 20 June 2008 was produced
using

  swipl -f load.pl \\
        -g "annotate_xsd('regexes.20080620.xsd')" \\
        -t 'halt(13)' \\
      > regexes.20080620.annotated.xsd

Good luck.


Contents:

load.pl             loads everything needed.  This is what you should load, if
                    you want to run the parser.

readxsd.pl          Predicates for reading an XSD schema document and
                    annotating all the patterns in it, with parse trees.

                    Exports the following predicates (also some others, but
                    at this point I think the others are all cruft and should 
                    be ignored):
 
                      annotate_xsd(+Filename):  reads Filename, parses all
                        patterns using the default grammar (at the moment, 
                        that's the grammar of the Last Call draft of June 2008),
                        and writes an annotated copy of the schema document
                        to stdout.

                      annotate_xsd(+Filename,+List_of_grammars):  reads Filename,
                        parses patterns using the grammars specified,
                        and writes an annotated copy of the schema document
                        to stdout.
                    
                      annotate_xsd(+Inputfile,+List_of_grammars,+Outputfile): 
                        reads Inputfile, parses patterns using the grammars 
                        specified, and writes an annotated copy of the schema document
                        to Outputfile.

parseregex.pl       Predicates for running the parser on a single string.

		      regex(String,AST):  parses String using the default
                        grammar, binds AST to the corresponding abstract syntax 
                        tree.  The String may be given in several forms:  as 
                        a Prolog atom, as a double-quoted string (list of 
                        character codes), or as an SWI Prolog string.

		      regex(String,Grammar,AST):  parses String against grammar G,
                        binds AST to the corresponding abstract syntax tree.

		      ambig(String,G,ASTs):  succeeds if String is ambiguous
                        against grammar G, binds ASTs to a list of the corresponding 
                        abstract syntax trees.

		      ambig(String,Gs,ASTs):  succeeds if String is ambiguous
                        against any grammar in the list of grammars Gs, binds ASTs to a 
                        list of grammar + AST-list pairs.

		      allparses(String,Grammars,ASTs):  parses String against the
                        grammars in Grammars, returns a list of all parse trees.

		      divergent(String,Gs,ASTs):  succeeds if String has more than
                        on parse in the grammars given in Gs; fails if all grammars
                        in Gs produce the same parse.


regex.dcg.pl        the regex grammar itself, in definite-clause grammar form.
                    Variants are included for 1.0 First Edition, PER, 
                    Second Edition, and various drafts of 1.1.

                    The start symbol is regExp(Options,Expression), where 
                    Options is a set of grammar options which determines
                    precisely which grammar is used; the g_opts module
                    can and should be used to get a useful value for Options.
                    The Expression returned is an abstract syntax tree for
                    the parse.  Various utility routines elsewhere can
                    emit this in an XML form.

g_opts.pl           Utilities for managing grammar options.  The predicates most
                    likely to be useful to casual users are:

                      get_grammars(-Grammars):  unifies Grammars with a list of 
                        all the grammrs known to the system.  If you want to
                        check a schema document against all known grammars, 
                        you can say get_grammars(Gs), annotate_xsd(File,Gs).

                        At the moment, the known grammars are the following.
                        The following have code embedded in the parser to
                        enforce the non-grammatical constraints expressed in the 
                        prose of the spec (disambiguation rules, etc.)

                         1E       the grammar of 1.0 1E (May 2001)
                         PER      the grammar of the Proposed Edited Recommendation
                                  of June 2004
                         2E       the grammar of 1.0 2E (November 2004)
                         D4       the XSD 1.1 draft of July 2004
                         D5       the XSD 1.1 draft of February 2005 
                         D6       the XSD 1.1 draft of January 2006
                         LC1      the XSD 1.1 last-call draft of February 2006
                         D8       the XSD 1.1 status-quo text of early 2008
                         W        a change proposal of early 2008
                         W2       the XSD 1.1 last-call draft of June 2008 

                       There are also 'pure' versions of the grammars, which 
                       do NOT enforce the non-grammatical rules in the prose.
                       These were added to make it easy to check statements like
                       "As given, the grammar is ambiguous, so the prose rule is
                       needed to disambiguate it."  They are unlikely to be useful
                       outside the Working Group.

                         1Epure 
                         PERpure 
                         2Epure 
                         D6pure 
                         LC1pure 
                         D8pure 
                         Wpure 
                         W2pure

                    The other exported predicates are useful when writing predicates
                    to use the parser in various ways:

                      grammar(G):  true if G is an atom naming a grammar.  

                      default_grammar(G): true if G is the default grammar.

                      default_grammar(G, Opts):  true if G is the default grammar
                        and Opts is the set of grammar options for G.

                      grammar_option(Options, Option):  true if Option is among
                        the grammar options given in Option.  For example, if Opts
                        is bound to the options for a given grammar, then 
                        grammar_option(Opts,vbar(V)) will bind V to the appropriate 
                        value of the vbar option.  Used only inside of grammar rules.

                      get_grammar_options(+G,-Opts):  gets the options for the 
                        given grammar.  (Could also run backwards, but why?)

                      get_grammar_options(+G,Opts,pure):  get the options for the
                        pure variant of a grammar.

ast.pl              Utilities for working with the abstract syntax trees
                    returned by the parser.

guards.pl           Routines for checking the ad-hoc restrictions on
                    regexes (i.e. restrictions not written into the grammar).
                    Called from regex.dcg.      

lookahead.dcg.pl    Defines lookahead/2 and lookahead/3, utilities for 
                    handling lookahead in DCGs.  Copied very directly from
                    O'Keefe, Craft of Prolog, so no W3C copyright is claimed.

testgenerator1.pl   Generates test strings using a sort of kludged up random
                    string generator.  Makes maketest.pl obsolete.

                      teststring(N,S):  generates a random string S with N parts.
                        Each part is randomly selected from a list of strings;
                        each string in the list matches some non-terminal in the
                        grammar, or is a terminal specifically mentioned in the
                        grammar.  This makes it easier to get 'interesting' 
                        strings than a random selection of N characters.

                      teststring(N,A,S):  the same, but also returns an atom A,
                        which is convenient for some purposes.

license.pl          Defines the W3C license (for use with the W3C-licensed
                    copy of this code), and calls :- license(w3c).
                    Currently not clear whether every file needs to have
                    license(w3c) or license(lgpl) added.

ast_dot.xsl         Stylesheet for translating from an XML dump of the AST
                    to Graphviz 'dot' notation.  Used to generate at least
                    some of the images in ./images.
                    
show-asts.xsl       Stylesheet to translate abstract syntax trees into
                    HTML using nested lists.

images              A directory with .dot and .png files showing the structure
                    of various expression types. 

maketests.pl        An early attempt at generating test strings.  The file 
                    testgenerator1.pl is later and better.

re.xml              Mind-numbingly detailed summary of all the changes 
                    ever made to the regex grammar in published drafts
                    (and a few WG-internal proposals, too).  Some
                    discussion of test case generation.  Never completely
                    finished; some rough edges remain.  Useful for the 
                    dedicated, perhaps, but only for them.

recognize.pl        Appears to be replaced by parseregex.pl. Or vice
                    versa.