W3C

Transforming XHTML to LaTeX and BibTeX


$Revision: 1.23 $ of $Date: 2008/04/24 21:28:36 $

Abstract

We transform XHTML to LaTeX and BibTeX to allow technical articles to be developed using familiar XHTML authoring tools and techniques.

Introduction

Occasionally a web page turns the corner from a casually drafted idea to an article worthy of publication. Computer science conferences often require submissions using specific LaTeX styles; for example, the ISCW2004 submission instructions require that submitted papers be formatted in the style of the Springer publications format for Lecture Notes in Computer Science (LNCS). XSLT is a convenient notation to express a transformation from XHTML to LaTeX.

Tools to transform from LaTeX to HTML are commonplace, but there are far fewer to go the other way. A little bit of searching yielded some work[Gur00] that was designed to undo a transformation to XHTML. It used an odd XHTML namespace and exhibited various other quirks specific to reversing that transformation, but it provided quite a boost up the LaTeX learning curve[Mann94].

That code did not integrate with the BibTeX. In order to take advantage of automatic bibliography formatting traditionally provided by LaTeX styles, after studying the BibTeX format[Spen98] for a bit, xh2bibl.xsl was born.

Together with tradtional pdflatex and bibtex tools[tetex] and and XSLT processor such as xsltproc[XSLTPROC], this transformation can turn ordinary web pages with just a bit of special markup into camera-ready PDF in specialized LaTeX styles.

A Quick Example

This article demonstrates the basic features. See:

They are produced ala:

$ make Overview.pdf
xsltproc  --novalid --stringparam DocClass llncs \
  --stringparam Bib Overview --stringparam BibStyle splncs \
  --stringparam Status prepub  \
        -o Overview.tex xh2latex.xsl Overview.html
TEXINPUTS=.:../../../2004/LLCS: pdflatex  Overview.tex
This is pdfTeX, Version 3.14159-1.10b (Web2C 7.4.5)
...
Output written on Overview.pdf (3 pages, 62474 bytes).
Transcript written on Overview.log.
xsltproc  --novalid -o Overview.bib xh2bib.xsl Overview.html
BSTINPUTS=.:../../../2004/LLCS: bibtex  Overview
This is BibTeX, Version 0.99c (Web2C 7.4.5)
The top-level auxiliary file: Overview.aux
The style file: splncs.bst
Database file #1: Overview.bib
TEXINPUTS=.:../../../2004/LLCS: pdflatex  Overview
This is pdfTeX, Version 3.14159-1.10b (Web2C 7.4.5)
...
Output written on Overview.pdf (3 pages, 67583 bytes).
Transcript written on Overview.log.
TEXINPUTS=.:../../../2004/LLCS: pdflatex  Overview
This is pdfTeX, Version 3.14159-1.10b (Web2C 7.4.5)
...
Output written on Overview.pdf (3 pages, 67167 bytes).
Transcript written on Overview.log.

Features

The transformation xh2latex.xsl works in the obvious way for many idioms:

Table support is limited to tables with border="1" and where all rows have the same number of cells. For example:

NameAddressPhone
John Doe123 High St.555-1212
Jane Smith456 Low St.555-1234

Specialized markup is required for other idioms. An article.css stylesheet provides visual feedback for this special markup.

To use a latex package, add a link to the head of your document a la:

  <link rel="usepackage" title="url"
    href="ftp://cam.ctan.org/tex-archive/macros/latex/contrib/misc/url.sty" />

The package name is taken from the title attrbute. The href attribute is not used in the LaTeX conversion.

We recommend the url.sty package, per a TeX FAQ. For example: http://www.w3.org/People/Connolly/.

Front Matter

The following patterns are used to extract the title page material:

support for WWW2006 style authors, following ACM style, is in progress.

Cross references and footnotes

The a[@rel="ref"] pattern is transformed to the LaTeX \ref{label} idiom, assuming the reference takes the form href="#label". @@needs testing

The footnote pattern is *[@class="footnote"].

Figures

The div[@class="figure"] pattern is transformed to a figure environment; any div/@id is used as a figure label. The file pattern is object/@data. Figures are currently assumed to be PDF; the object/@height attribute is copied over. The caption pattern is p[@class="caption"]. @@need to test this. Be sure to include the epsfig package a la:

  <link rel="usepackage" title="epsfig" />

Citations and Bibliography

An a element starting with an open square bracket [ is interpreted as a citation reference. The href is assumed to be a local link ala #tag.

The pattern dl/@class="bib" is used to find the bibliography. Each item marked up ala...

<dt class="misc">[<a name="tetex">tetex</a>]</dt>
<dd>
<span class="author">Thomas Esser</span>
<cite><a
href="http://www.tug.org/tex-archive/help/Catalogue/entries/tetex.html"
>The TeX distribution for Unix/Linux</a></cite>
February <span class="year">2003</span>
</dd>

or

<dt class="misc" id="tetex">[tetex]</dt>
...

Note the placement of the bibtex item type misc and the tag tetex and keep in mind that bibtex ignores works in the bibliography that are not cited from the body.

The xh2bibl.xsl transformation turns this markup into BibTeX format. xh2latex.xsl transforms the entire bibliography dl to a \bibliography{...} reference.

capitalization of titles seems to get mangled. I'm not sure if that's a feature of certain bibliography styles or what.

Bugs/Caveats/Misfeatures

Makefile support

Formatting a LaTeX document is done in several passes. One typical manual shows:

ucsub>  latex MyDoc.tex
ucsub>  bibtex MyDoc
ucsub>  latex MyDoc.tex
ucsub>  latex MyDoc.tex

The follwing excerpt from html2latex.mak shows some rules to accomplish this using make:

.html.tex:
	$(XSLTPROC) --novalid $(HLPARAMS) \
		-o $@ xh2latex.xsl $< 

.html.bib:
	$(XSLTPROC) --novalid -o $@ xh2bib.xsl $<

.tex.aux:
	TEXINPUTS=$(TEXINPUTS) $(PDFLATEX) $<

.tex.bbl:
	BSTINPUTS=$(BSTINPUTS) $(BIBTEX) $*


.aux.pdf:
	TEXINPUTS=$(TEXINPUTS) $(PDFLATEX) $*
	TEXINPUTS=$(TEXINPUTS) $(PDFLATEX) $*

Sources:

References

[tetex]
Thomas Esser The TeX distribution for Unix/Linux February 2003
[Mann94]
Shannon Mann Beginner's LaTeX Tutorial 1994-06-16T15:32:27
[Spen98]
Spencer Rugaber The Citation project Summer 1998.
[Gur00]
Eitan M. Gurari XSLT from XHTML+MathML to LATEX July 19, 2000
[XSLTPROC]
Daniel Veillard The xsltproc tool in libxslt: The XSLT C library for Gnome 1.1.2 Dec 24 2003