"""
bnf2turtle -- write a turtle version of an EBNF grammar
=======================================================

:Author: `Dan Connolly`_
:Version: $Revision: 1.14 $ of $Date: 2006/06/10 05:24:26 $
:Copyright: `W3C Open Source License`_ Share and enjoy.

.. _Dan Connolly: http://www.w3.org/People/Connolly/
.. _W3C Open Source License: http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231

Usage
-----

Invoke a la::

  python bnf2turtle.py foo.bnf >foo.ttl

where foo.bnf is full of lines like::

  [1] document ::= prolog element Misc*

as per the `XML formal grammar notation`_. The output is `Turtle`_ -
Terse RDF Triple Language::

  :document rdfs:label "document"; rdf:value "1";
   rdfs:comment "[1] document ::= prolog element Misc*";
   a g:NonTerminal;
    g:seq (
      :prolog
      :element
      [ g:star
        :Misc
       ]
     )
  .


.. _XML formal grammar notation: http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-notation
.. _Turtle: http://www.dajobe.org/2004/01/turtle/


Motivation
----------

Many specifications include grammars that look formal but are not
actually checked, by machine, against test data sets. Debugging the
grammar in the XML specification has been a long, tedious manual
process. Only when the loop is closed between a fully formal grammar
and a large test data set can we be confident that we have an accurate
specification of a language [#]_.


The grammar in the `N3 design note`_ has evolved based on the original
manual transcription into a python recursive-descent parser and
subsequent development of test cases. Rather than maintain the grammar
and the parser independently, our goal_ is to formalize the language
syntax sufficiently to replace the manual implementation with one
derived mechanically from the specification.


.. [#] and even then, only the syntax of the language.
.. _N3 design note: http://www.w3.org/DesignIssues/Notation3

Related Work
------------

Sean Palmer's `n3p announcement`_ demonstrated the feasibility of the
approach, though that work did not cover some aspects of N3.

In development of the `SPARQL specification`_, Eric Prud'hommeaux
developed Yacker_, which converts EBNF syntax to perl and C and C++
yacc grammars. It includes an interactive facility for checking
strings against the resulting grammars.
Yosi Scharf used it in `cwm Release 1.1.0rc1`_, which includes
a SPAQRL parser that is *almost* completely mechanically generated.

The N3/turtle output from yacker is lower level than the EBNF notation
from the XML specification; it has the ?, +, and * operators compiled
down to pure context-free rules, obscuring the grammar
structure. Since that transformation is straightforwardly expressed in
semantic web rules (see bnf-rules.n3_), it seems best to keep the RDF
expression of the grammar in terms of the higher level EBNF
constructs.

.. _goal: http://www.w3.org/2002/02/mid/1086902566.21030.1479.camel@dirk;list=public-cwm-bugs
.. _n3p announcement: http://lists.w3.org/Archives/Public/public-cwm-talk/2004OctDec/0029.html
.. _Yacker: http://www.w3.org/1999/02/26-modules/User/Yacker
.. _SPARQL specification: http://www.w3.org/TR/rdf-sparql-query/
.. _Cwm Release 1.1.0rc1: http://lists.w3.org/Archives/Public/public-cwm-announce/2005JulSep/0000.html
.. _bnf-rules.n3: http://www.w3.org/2000/10/swap/grammar/bnf-rules.n3

Open Issues and Future Work
---------------------------

The yacker output also has the terminals compiled to elaborate regular
expressions. The best strategy for dealing with lexical tokens is not
yet clear. Many tokens in SPARQL are case insensitive; this is not yet
captured formally.

The schema for the EBNF vocabulary used here (``g:seq``, ``g:alt``, ...)
is not yet published; it should be aligned with `swap/grammar/bnf`_
and the bnf2html.n3_ rules (and/or the style of linked XHTML grammar
in the SPARQL and XML specificiations).

It would be interesting to corroborate the claim in the SPARQL spec
that the grammar is LL(1) with a mechanical proof based on N3 rules.

.. _swap/grammar/bnf: http://www.w3.org/2000/10/swap/grammar/bnf
.. _bnf2html.n3: http://www.w3.org/2000/10/swap/grammar/bnf2html.n3  



Background
----------

The `N3 Primer`_ by Tim Berners-Lee introduces RDF and the Semantic
web using N3, a teaching and scribbling language. Turtle is a subset
of N3 that maps directly to (and from) the standard XML syntax for
RDF.



.. _N3 Primer: _http://www.w3.org/2000/10/swap/Primer.html




Colophon
--------

This document is written in ReStructuredText_. The examples
in the docstrings below are executable doctest_ unit tests.
Check them a la::

  python bnf2turtle.py --test


.. _ReStructuredText: http://docutils.sourceforge.net/docs/user/rst/quickstart.html
.. _doctest: http://www.python.org/doc/lib/module-doctest.html

"""

__version__ = "$Id: bnf2turtle.py,v 1.14 2006/06/10 05:24:26 connolly Exp $"

import re

def main():
    import sys
    if '--test' in sys.argv: _test()
    else:
        toTurtle(file(sys.argv[1]))
        

def toTurtle(lines):
    """print a turtle version of the lines of a BNF file
    """
    
    startTurtle()
    token = 0
    for r in eachRule(lines):
        if r == '@terminals': token = 1
        else:
            num, sym, expr = ruleParts(r)
            asTurtle(num, sym, expr, token, r)


def eachRule(lines):
    """turn an iterator over lines into an iterator over rule strings.

    a line that starts with [ or @ starts a new rule
    """
    
    r = ''
    for l in lines:
        if l.startswith("/*"): continue # whole-line comments only
        if l.startswith('[') or l.startswith('@'):
            if r: yield r
            r = l.strip()
        else:
            r += l.strip()
    if r: yield r


def ruleParts(r):
    """parse a rule into a rule number, a symbol, and an expression

    >>> ruleParts("[2]     Prolog    ::=           BaseDecl? PrefixDecl*")
    ('2', 'Prolog', (',', [('?', ('id', 'BaseDecl')), ('*', ('id', 'PrefixDecl'))]))

    """
    
    assert r.find(']') > 0
    num, r = r.split(']', 1)
    num = num[1:]
    rhs, r = r.split('::=', 1)
    rhs = rhs.strip()
    return (num, rhs, ebnf(r)[0])


def ebnf(s):
    """parse a string into an expression tree and a remaining string

    >>> ebnf("a b c")
    ((',', [('id', 'a'), ('id', 'b'), ('id', 'c')]), '')

    >>> ebnf("a? b+ c*")
    ((',', [('?', ('id', 'a')), ('+', ('id', 'b')), ('*', ('id', 'c'))]), '')

    >>> ebnf(" | x xlist")
    (('|', [(',', []), (',', [('id', 'x'), ('id', 'xlist')])]), '')

    >>> ebnf("a | (b - c)")
    (('|', [('id', 'a'), ('-', [('id', 'b'), ('id', 'c')])]), '')

    >>> ebnf("a b | c d")
    (('|', [(',', [('id', 'a'), ('id', 'b')]), (',', [('id', 'c'), ('id', 'd')])]), '')
    
    >>> ebnf("a | b | c")
    (('|', [('id', 'a'), ('id', 'b'), ('id', 'c')]), '')
    
    >>> ebnf("a) b c")
    (('id', 'a'), ' b c')
    
    >>> ebnf("BaseDecl? PrefixDecl*")
    ((',', [('?', ('id', 'BaseDecl')), ('*', ('id', 'PrefixDecl'))]), '')

    """

    e, s = alt(s)
    if s:
        t, ss = token(s)
        if t[0] == ')':
            return e, ss
    return e, s


def alt(s):
    """parse alt

    >>> alt("a | b | c")
    (('|', [('id', 'a'), ('id', 'b'), ('id', 'c')]), '')

    """

    args = []
    while s:
        e, s = seq(s)
        if not e:
            if args: break
            e = (',', []) # empty sequence
        args.append(e)
        if s:
            t, ss = token(s)
            if not t[0] == '|': break
            s = ss
    if len(args) > 1:
        return ('|', args), s
    else:
        return e, s


def seq(s):
    """parse seq

    >>> seq("a b c")
    ((',', [('id', 'a'), ('id', 'b'), ('id', 'c')]), '')

    >>> seq("a b? c")
    ((',', [('id', 'a'), ('?', ('id', 'b')), ('id', 'c')]), '')

    """

    args = []
    while s:
        e, ss = diff(s)
        if e:
            args.append(e)
            s = ss
        else: break
    if len(args) > 1:
        return (',', args), s
    elif len(args) == 1:
        return args[0], s
    else:
        return None, s


def diff(s):
    """parse diff

    >>> diff("a - b")
    (('-', [('id', 'a'), ('id', 'b')]), '')

    """

    e1, s = postfix(s)
    if e1:
        if s:
            t, ss = token(s)
            if t[0] == '-':
                s = ss
                e2, s = primary(s)
                if e2:
                    return ('-', [e1, e2]), s
                else:
                    raise SyntaxError
            
    return e1, s


def postfix(s):
    """parse postfix

    >>> postfix("a b c")
    (('id', 'a'), ' b c')

    >>> postfix("a? b c")
    (('?', ('id', 'a')), ' b c')
    """

    e, s = primary(s)
    if not e: return None, s

    if s:
        t, ss = token(s)
        if t[0] in '?*+':
            return (t[0], e), ss
        
    return e, s

def primary(s):
    """parse primary

    >>> primary("a b c")
    (('id', 'a'), ' b c')
    """

    t, s = token(s)
    if t[0] == 'id' or t[0] == "'" or t[0] == '[' or t[0] == '#':
        return t, s

    elif t[0] is '(':
        e, s = ebnf(s)
        return e, s

    else:
        return None, s


def token(s):
    """parse one token; return the token and the remaining string

    A token is represented as a tuple whose 1st item gives the type;
    some types have additional info in the tuple.
    
    >>> token("'abc' def")
    (("'", 'abc'), ' def')
    """
    # ' help emacs

    s = s.strip()
    if s.startswith("'"):
        l, s = s[1:].split("'", 1)
        return ("'", l), s
    elif s.startswith('"'):
        l, s = s[1:].split('"', 1)
        return ("'", l), s
    elif s.startswith("["):
        l, s = s[1:].split("]", 1)
        return ("[", l), s
    elif s.startswith("#"):
        i = re.match("\w+", s[1:]).end(0) + 1
        return (('#', s[:i]), s[i:])
    elif s[0].isalpha():
        i = re.match("\w+", s).end(0)
        return (('id', s[:i]), s[i:])
    elif s.startswith("@"):
        i = re.match("\w+", s[1:]).end(0) + 1
        return (('@', s[1:i]), s[i:])
    elif s[0] in '(?)*+|-':
        return ((s[0],) , s[1:])
    else:
        raise ValueError, "unrecognized token: %s" % s


##########
# turtle generation
#

def startTurtle():
    print "@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>."
    print "@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>."
    print "@prefix : <http://www.w3.org/2001/sw/DataAccess/rq23/parsers/sparql#>."
    print "@prefix g: <http://www.w3.org/2001/sw/DataAccess/rq23/parsers/grammar@@#>."
    print "#@@ case insensitivity of 'BASE' and the like is not expressed."
    
def asTurtle(num, sym, expr, isToken, orig):
    print
    print ':%s rdfs:label "%s"; rdf:value "%s";' % (sym, sym, num)
    print ' rdfs:comment "%s";' % esc(orig)
    if isToken: print " a g:Terminal;",
    else: print " a g:NonTerminal;",
    print

    ttlExpr(expr, indent='  ', obj=0)
    print "."


def ttlExpr(expr, indent, obj=1):
    op, args = expr

    if obj:
        bra = "[ "
        ket = " ]"
    else:
        bra = ket = ''

    if op == ',':
        print indent + bra + "g:seq ("
        for a in args:
            ttlExpr(a, indent + '  ')
        print indent + " )" + ket

    elif op == '|':
        print indent + bra + "g:alt ("
        for a in args:
            ttlExpr(a, indent + '  ')
        print indent + " )" + ket

    elif op == '-':
        print indent + bra + "g:diff ("
        for a in args:
            ttlExpr(a, indent + '  ')
        print indent + " )" + ket

    elif op == '?':
        print indent + bra + "g:opt "
        ttlExpr(args, indent + '  ')
        if ket: print indent + ket

    elif op == '+':
        print indent + bra + "g:rep "
        ttlExpr(args, indent + '  ')
        if ket: print indent + ket

    elif op == '*':
        print indent + bra + "g:star "
        ttlExpr(args, indent + '  ')
        if ket: print indent + ket

    elif op == 'id':
        if obj:
            print "%s:%s" % (indent, args)
        else:
            print "%sg:seq ( :%s )" % (indent, args)
    elif op == "'":
        print '%s"%s"' %(indent, esc(args))
    elif op == "[":
        print '%s%s g:matches "[%s]" %s' % (indent, bra, esc(args), ket)
    elif op == "#":
        assert not('"' in args)
        print '%s%s g:matches "\\\\x%s" %s' % (indent, bra, args[2:], ket)
    else:
        raise RuntimeError, op


def esc(st):
    r"""escape a string

    >>> esc(r'abc')
    'abc'

    >>> esc(r'[^"\\]')
    '[^\\"\\\\\\\\]'
    """
    
    if not ('"' in st or '\\' in st): return st
    s = ''
    for c in st:
        if c == '"' or c == '\\':
            s += '\\'
        s += c
    return s


def _test():
    import doctest
    doctest.testmod()

    
if __name__ == '__main__':
    main()

# $Log: bnf2turtle.py,v $
# Revision 1.14  2006/06/10 05:24:26  connolly
# handle empty alternative (at start of production, at least)
# handle /* */ comments (whole-line only)
# change unrecognized token from RuntimeError to ValueError
#
# Revision 1.13  2006/02/10 06:02:05  connolly
# updated doctest: rule labels are numerals, not ints
#
# Revision 1.12  2006/02/10 05:57:42  connolly
# document --test flag
#
# Revision 1.11  2006/02/10 05:56:23  connolly
# citing sources and such done-ish. rst sorta working
#
# Revision 1.10  2006/02/10 02:51:35  connolly
# added a license
# moved main to top
# started citing sources, trying to use rst; losing
#
# Revision 1.9  2006/02/09 23:19:14  connolly
# handle a ::= b rules
# RDF housekeeping: comment, label, namespaces
#
# Revision 1.8  2006/02/09 22:51:49  connolly
# include original rules as comments
# indentation tweaks
#
# Revision 1.7  2006/02/09 22:32:08  connolly
# fixed bug in diff
#
# Revision 1.6  2006/02/09 22:29:38  connolly
# precedence parsing unit tests pass
#
# Revision 1.5  2006/02/09 21:29:39  connolly
# docstrings with unit tests
#
# Revision 1.4  2006/02/09 21:02:07  connolly
# generates ttl that cwm parses
#
# Revision 1.3  2006/02/09 20:45:43  connolly
# walking and generating indented turtle sorta works;
# some high level turtle syntax errors to work out
#
# Revision 1.2  2006/02/09 19:23:03  connolly
# seems to be parsed
#
# Revision 1.1  2006/02/09 18:46:20  connolly
# seems to tokenize ok
#

