We transform XHTML to LaTeX and BibTeX to allow technical articles to be developed using familiar XHTML authoring tools and techniques.
Occasionally a web page turns the corner from a casually drafted idea to an article worthy of publication. Computer science conferences often require submissions using specific LaTeX styles; for example, the ISCW2004 submission instructions require that submitted papers be formatted in the style of the Springer publications format for Lecture Notes in Computer Science (LNCS). XSLT is a convenient notation to express a transformation from XHTML to LaTeX.
Tools to transform from LaTeX to HTML are commonplace, but there are far fewer to go the other way. A little bit of searching yielded some work[Gur00] that was designed to undo a transformation to XHTML. It used an odd XHTML namespace and exhibited various other quirks specific to reversing that transformation, but it provided quite a boost up the LaTeX learning curve[Mann94].
That code did not integrate with the BibTeX. In order to take advantage of automatic bibliography formatting traditionally provided by LaTeX styles, after studying the BibTeX format[Spen98] for a bit, xh2bibl.xsl was born.
Together with tradtional pdflatex and bibtex tools[tetex] and and XSLT processor such as xsltproc[XSLTPROC], this transformation can turn ordinary web pages with just a bit of special markup into camera-ready PDF in specialized LaTeX styles.
This article demonstrates the basic features. See:
They are produced ala:
$ make Overview.pdf xsltproc --novalid --stringparam DocClass llncs \ --stringparam Bib Overview --stringparam BibStyle splncs \ --stringparam Status prepub \ -o Overview.tex xh2latex.xsl Overview.html TEXINPUTS=.:../../../2004/LLCS: pdflatex Overview.tex This is pdfTeX, Version 3.14159-1.10b (Web2C 7.4.5) ... Output written on Overview.pdf (3 pages, 62474 bytes). Transcript written on Overview.log. xsltproc --novalid -o Overview.bib xh2bib.xsl Overview.html BSTINPUTS=.:../../../2004/LLCS: bibtex Overview This is BibTeX, Version 0.99c (Web2C 7.4.5) The top-level auxiliary file: Overview.aux The style file: splncs.bst Database file #1: Overview.bib TEXINPUTS=.:../../../2004/LLCS: pdflatex Overview This is pdfTeX, Version 3.14159-1.10b (Web2C 7.4.5) ... Output written on Overview.pdf (3 pages, 67583 bytes). Transcript written on Overview.log. TEXINPUTS=.:../../../2004/LLCS: pdflatex Overview This is pdfTeX, Version 3.14159-1.10b (Web2C 7.4.5) ... Output written on Overview.pdf (3 pages, 67167 bytes). Transcript written on Overview.log.
The transformation xh2latex.xsl works in the obvious way for many idioms:
Table support is limited to tables with border="1" and where all rows have the same number of cells. For example:
Name | Address | Phone |
---|---|---|
John Doe | 123 High St. | 555-1212 |
Jane Smith | 456 Low St. | 555-1234 |
Specialized markup is required for other idioms. An article.css stylesheet provides visual feedback for this special markup.
To use a latex package, add a link to the head of your document a la:
<link rel="usepackage" title="url" href="ftp://cam.ctan.org/tex-archive/macros/latex/contrib/misc/url.sty" />
The package name is taken from the title attrbute. The href attribute is not used in the LaTeX conversion.
We recommend the url.sty package, per a TeX FAQ. For example: http://www.w3.org/People/Connolly/.
The following patterns are used to extract the title page material:
support for WWW2006 style authors, following ACM style, is in progress.
The a[@rel="ref"] pattern is transformed to the LaTeX \ref{label} idiom, assuming the reference takes the form href="#label". @@needs testing
The footnote pattern is *[@class="footnote"].
The div[@class="figure"] pattern is transformed to a figure environment; any div/@id is used as a figure label. The file pattern is object/@data. Figures are currently assumed to be PDF; the object/@height attribute is copied over. The caption pattern is p[@class="caption"]. @@need to test this. Be sure to include the epsfig package a la:
<link rel="usepackage" title="epsfig" />
An a element starting with an open square bracket [ is interpreted as a citation reference. The href is assumed to be a local link ala #tag.
The pattern dl/@class="bib" is used to find the bibliography. Each item marked up ala...
<dt class="misc">[<a name="tetex">tetex</a>]</dt> <dd> <span class="author">Thomas Esser</span> <cite><a href="http://www.tug.org/tex-archive/help/Catalogue/entries/tetex.html" >The TeX distribution for Unix/Linux</a></cite> February <span class="year">2003</span> </dd>
or
<dt class="misc" id="tetex">[tetex]</dt> ...
Note the placement of the bibtex item type misc and the tag tetex and keep in mind that bibtex ignores works in the bibliography that are not cited from the body.
The xh2bibl.xsl transformation turns this markup into BibTeX format. xh2latex.xsl transforms the entire bibliography dl to a \bibliography{...} reference.
capitalization of titles seems to get mangled. I'm not sure if that's a feature of certain bibliography styles or what.
Formatting a LaTeX document is done in several passes. One typical manual shows:
ucsub> latex MyDoc.tex ucsub> bibtex MyDoc ucsub> latex MyDoc.tex ucsub> latex MyDoc.tex
The follwing excerpt from html2latex.mak shows some rules to accomplish this using make:
.html.tex: $(XSLTPROC) --novalid $(HLPARAMS) \ -o $@ xh2latex.xsl $< .html.bib: $(XSLTPROC) --novalid -o $@ xh2bib.xsl $< .tex.aux: TEXINPUTS=$(TEXINPUTS) $(PDFLATEX) $< .tex.bbl: BSTINPUTS=$(BSTINPUTS) $(BIBTEX) $* .aux.pdf: TEXINPUTS=$(TEXINPUTS) $(PDFLATEX) $* TEXINPUTS=$(TEXINPUTS) $(PDFLATEX) $*
Sources: