Building a new technical documentation tool chain

Writing good documentation is hard. While I happen to think that API references generated from source code can be extremely useful, they’re only part of the story, and eventually everyone needs to write something more substantial for their software. You can get away with writing HTML directly, and separately using a word processor to write PDFs for so long, but eventually you need a proper tool chain with the following characteristics:

  • Lets the author concentrate on content rather than style
  • Generates multiple formats from one source (HTML, PDF, man pages, HTML Help etc)
  • Does all the tedious work for you such as TOCs, cross-references, source code highlighting, footnotes
  • Is friendly to source control systems & diffs in general
  • Standard enough that you could submit the content to a publisher if you wanted to
  • Preferably cross-platform, standards-based and not oriented to any particular language or technology

When I came to write the OGRE manual many, many years ago, I went with Texinfo – it seemed a good idea at the time, and ticked most of the boxes above. The syntax is often a bit esoteric, and the tools used to generate output frequently a bit flaky (texi2html has caused me many headaches over the years thanks to  poorly documented breaking changes), but it worked most of the time.

I’ve been meaning to replace this tool chain with something else for new projects for a while, and DocBook sprung to mind since it’s the ‘new standard’ for technical documentation. It’s quite popular with open source projects now and it’s the preferred format for many publishers such as O’Reilly. In the short term, I want to write some developer instructions for OGRE for our future Mercurial setup, but in the long term, I’d really like a good documentation tool chain for all sorts of other purposes, and Texinfo feels increasingly unsatisfactory these days.

Having spent some time this week establishing a new working tool chain, and encountering & resolving a number of issues along the way, I thought I’d share my setup with you.

Overview

The tool chain I’ve settled on is shown below. I’ve concentrated on the PDF output route because this tends to be the most complex:

I chose to use AsciiDoc, which was kindly suggested to me by tuan kuranes, in order to produce my DocBook XML from plain text. DocBook is XML, and like all XML, editing it by hand is an overly verbose affair, and prone to human error – specifically DocBook has a lot of long tag names which makes typing them very, very tedious and generally inefficient. Now, of course you can use GUI tools to hide these inefficiencies from you, and to that end you might want to try tools like XMLMind (free for non-commercial use) and Syntext (open source edition) to do this, or even just Eclipse or Visual Studio with the DTD hooked in for auto-completion. But there are 2 reasons why I chose not to do that:

  1. Maybe I’m just old-skool, but to me relying on a GUI tool to mask how awkward content is to write is the coding equivalent of ‘papering over the cracks’. I would rather resolve the fundamental problem than simply hide it from myself.
  2. Even if you hide the XML from your own eyes, it causes problems elsewhere. XML doesn’t diff very well, and XML generated from a GUI tool diffs even less well, making source control of the content less useful; change tracking is cumbersome and merging an absolute nightmare.

AsciiDoc plain text content is fast and simple to write, it’s easily readable, diffs well and doesn’t require any special editing tools. It’s similar to Texinfo, but less esoteric and the tools are far more modern.

Customising the tool chain

AsciiDoc comes with the ability to generate HTML (single or chunked) or DocBook XML directly, and you can feed the DocBook through other transformations to produce PDF or HTML Help files. I was particularly interested in the PDF route, for which AsciiDoc provides an automated toolchain called a2x. However, I found that this wasn’t good enough for my needs, because the default XSL process wasn’t capable of handling code syntax highlighting & line numbers, unless you used dblatex rather than FOP as the PDF generator which adds its own problems. dblatex is awkward to get working on Windows, and it also requires a LaTeX implementation – a route which seemed perverse to me since I was trying to avoid ‘old’ tools. If I was going to end up going through LaTeX I might as well just write it in that in the first place! So it seemed a hack to go that way, I wanted to keep the direct DocBook to PDF generation.

So, the answer was to replace a2x with my own series of steps, which replaced the XSL processing (which is done with xsltproc in a2x) with Saxon, a Java-based XSL processor which can be extended with syntax highlighting relatively easily.

Installing the tools

I’ll go over the installation on Windows, but it’s mostly the same on Mac / Linux, if anything slightly easier since the packages are often already installed. Edit: I’ve successfully followed my own instructions here and got the same tool chain running great on Mac OS X, so Linux should be fine too. /Edit You will need:

  • AsciiDoc (I’m using 8.5.3)
  • Python (I’ve tested with 2.5 and 2.6)
  • Saxon 6.5 (the latest Saxon is 9 but that’s for XSL 2.0, DocBook is XSL 1.0 and Saxon 6.5 is the stable for that)
  • Apache FOP (I’m using 0.95)
  • Java runtime (I’m using 1.6)
  • XSLTHL syntax highlighting implementation
  • Technically optional, to avoid network access on processing DocBook, but IMO you want this:

Note: optionally you can also install XML tools such as xmllint to validate the XML you’re producing, but you don’t need to do this generally. AsciiDoc will normally expect you to install xsltproc and xmllint  for example for a2x to work, but we won’t be using it.

Base Configuration

  1. Run the installers for Python and Java if you haven’t done already, and make sure they’re both on your system path
  2. Extract the other archives somewhere on your disk – personally I put them all underneath a common ‘DocBook’ root folder
  3. Add the FOP root folder to your system path too – not necessarily required but it’s convenient

Create local mappings for DocBook URIs

When processing XSL, Saxon will quite happily pull DocBook definitions from the internet, but this will slow the process down, which is especially noticeable when processing small documents. So, instead I favour making the global URIs resolve to the local copies of the DocBook XML and DocBook XSL that we downloaded earlier. In Saxon, you do this via a CatalogManager.properties file which must be somewhere on your Java classpath. For me, the file looked like this:

catalogs=E:/UsefulResources/DocBook/docbook-xsl-1.75.2/catalog.xml;E:/UsefulResources/DocBook/docbook/4.5/catalog.xml
relative-catalogs=false
static-catalog=yes
catalog-class-name=org.apache.xml.resolver.Resolver
verbosity=1

Notice how I just reference the ‘catalog.xml’ files that are included in the DocBook XML and XSL distributions; these files include the mappings from the URIs to the local files. I saved this file as e:\UsefulResources\DocBook\CatalogManager.properties, and added this folder to my classpath (see ‘Running the toolchain’ below). You can skip this step if you want, but it will speed up your XSL processing considerably so it’s highly recommended.

Defining a custom style

AsciiDoc helpfully comes with some default stylesheets for PDF output, but I supplemented them to get source code highlighting and line numbering working via XSLTHL, which doesn’t work (via FOP) in the standard release. Also, I figured that I’m going to want to customise the styles eventually anyway, so I might as well know how to do it. It’s actually very easy – just define an XSL stylesheet which pulls in the existing AsciiDoc base definitions, and adds to them.

Here’s my simple stylesheet example: fo_steve.xsl.

Most of this file is simply referencing the base stylesheets, enabling syntax colouring and line numbering (where opted for), and doing a single simple style tweak to code syntax highlighting – setting keywords to be red. Obviously there’s lots more you can do, this was just a test to prove the concept works.

Running the tool chain

It might look like there are a lot of moving parts here, but it actually fits together fairly simply, especially once you’ve done the setup as described above. Processing a text file to PDF is simply 3 stages, which can be simply scripted – much like a2x.py does in AsciiDoc in fact, except that we’re replacing xsltproc with Saxon for greater flexibility (and it also works better in Windows). In this example, I’ll be demonstrating how this works with an example file from the AsciiDoc release, ‘doc/source-highlight-filter.txt’, since it demonstrates the use of the syntax highlighter which is what prompted my deviation from a2x in the first place.

For comparison with the final PDF linked at the bottom, here’s the AsciiDoc input file: source-highlight-filter.txt

Step 1 – run AsciiDoc to create DocBook

Assuming you’re in the AsciiDoc root folder already:


python asciidoc.py --backend docbook  --doctype article  --out-file doc/source-highlight-filter.xml doc/source-highlight-filter.txt

That was easy – you now have a DocBook XML file at doc/source-highlight-filter.xml containing the DocBook representation of your far more friendly text file. a2x usually runs xmllint against this to verify it, but I’m not doing that, especially because it’s not part of my install steps and these Unix-oriented tools require a separate setup process.

Step 2 – Run Saxon to turn DocBook XML into XSL-FO

This step requires quite a lot of typing so this is why you’ll almost certainly want to script it! Replace my “E:/UsefulResources/DocBook” with the real paths you’ve used when unzipping the packages listed above.


java -cp "E:/UsefulResources/DocBook/saxon6/saxon.jar;E:/UsefulResources/DocBook/xslthl/xslthl-2.0.1.jar;E:/UsefulResources/DocBook/docbook-xsl-1.75.2/extensions/saxon65.jar;E:/UsefulResources/DocBook/xml-commons-resolver-1.2/resolver.jar;E:/UsefulResources/DocBook" -Dxslthl.config="file:///E:/UsefulResources/DocBook/docbook-xsl-1.75.2/highlighting/xslthl-config.xml" com.icl.saxon.StyleSheet -x org.apache.xml.resolver.tools.ResolvingXMLReader -y org.apache.xml.resolver.tools.ResolvingXMLReader -r org.apache.xml.resolver.tools.CatalogResolver -o doc\source-highlight-filter.fo doc\source-highlight-filter.xml E:\UsefulResources\DocBook\fo_steve.xsl

If you wanted to make the command-line simpler, you could update your CLASSPATH environment variable with the contents of the -cp option above and then omit it from the command line. Personally I’ve put this inside a batch file anyway so it doesn’t matter to me. Note: on Mac and Linux you should replace the semi-colons (‘;’) separating each element of the classpath with a colon (‘:’) instead.

I’ve called my custom stylesheet, described in the ‘Define a custom style’ section above, “fo_steve.xsl” but of course you can call yours whatever you like and use individual ones per document if you wish.

Step 3 – Run FOP to create PDF

The simplest of the steps, all you need to do now is run Apache FOP on the .fo file created from the last step to produce a PDF:


fop -fo doc\source-highlight-filter.fo -pdf doc\source-highlight-filter.pdf

And the final result should look something like this: source-highlight-filter.pdf

Conclusion

I’m really happy with this new tool chain; the input format is really easy to write and plays nice with source control, and the output quality is extremely good, including language-sensitive code highlighting and rich style customisation. It’s also cross-platform, standards-based, and built on mature yet still modern open source components. This will definitely be the way that I write my technical documentation in future.

  • morricone

    The toolchain looks good and I like the syntax of asciidoc. But the default pdf layout hurts my eyes. Someone with Knuth’s sense of typography should take a look into that.

  • http://www.stevestreeting.com Steve

    Seriously? Personally I find default LaTeX layouts really awful and liked the standard output I got here infinitely more. But that’s taste I guess – you can ‘fix’ all that via stylesheets.

  • Nicholas

    Good stuff.

    With regard to styling, is it possible to apply style (for PDF output) without modifying the tool chain itself? (For HTML, this is independently possible via CSS.) I haven’t worked that one out and am wondering if docbook was not designed for such functionality.

  • http://www.stevestreeting.com Steve

    @Nicolas: sure, you just need to alter that fo_steve.xsl stylesheet, or reference a different one (project-specific, say). The style is applied when going from DocBook to XSL-FO, so Step 2 above. One option is in your script, look for an XSL stylesheet alongside the input text (say foo-style.xsl alongside input file foo.txt), and if it exists, use that as the style, otherwise use a standard one. Or, you could just expect a second parameter to your batch file identifying the stylesheet instead of deriving it. Or all of the above :) It can get as complex as you’re willing to script really, it’s just a case of supplying a different parameter to the Saxon call.

    Obviously because PDF is ‘baked’ rather than styled dynamically like HTML you can’t apply a different style once the PDF has been created though.

  • Nicholas

    I get it.. style (as in bold/italic, etc.) is applied by importing the docbook transforms (aka stylesheets) into a custom transform (fo_steve.xsl). I zapped over your fo_steve.xsl.txt link!

  • Stodge

    Awesome – thanks for this. May be useful to us at work…

  • http://www.stevestreeting.com Steve

    @Nicholas: yeah, the stylesheet link was buried a little there. I’ve edited that section and hopefully made it more useful.

    @Stodge: no problem!

    I successfully followed my own instructions here to get the same tool chain working on Mac OS X (I just added a couple of clarifications such as using ‘:’ instead of ‘;’ to separate classpath entries). It works like a charm! I have no reason to believe it wouldn’t work the same on Linux too.

  • http://www.overminddl1.com OvermindDL1

    I am curious, have you not looked at quickbook? It is based on docbook with lots of bug fixes, a heavy emphasis on C++, doxygen integration, PDF/HTML/others output, etc… It is part of Boost (which you are already using in OGRE), and boost has a subproject called boostbook (specialized boost formatting and such for a universal look-n-feel for all projects in boost) that you can easily rip from. They encourage its use. Have you not looked it?

    Boost.ASIO docs use just about every possible feature if you want a good example.

  • http://www.overminddl1.com OvermindDL1

    Oh, and yes Quick/BoostBook uses a Wiki-style syntax for the initial formatting, *FAR* easier to use then XML obviously.

  • http://www.stevestreeting.com Steve

    I hadn’t seen quickbook before, was quite hard to find even in Google unless you knew it was in Boost. Looking at it, I prefer AsciiDoc; the syntax is a bit nicer and AsciiDoc appears to be more regularly maintained.

  • http://www.python-ogre.org Andy

    Did you happen to look at Sphinx at all (http://sphinx.pocoo.org/) ??

  • http://www.stevestreeting.com Steve

    No, but the output looks quite nice. Still, I do still like the ability to go via DocBook.

  • http://francisshanahan.com Francis Shanahan

    Thanks Steve, this helped a lot. I documented my setup with Graphviz and Pygment for my own future reference and some updated versions of other software here:

    http://francisshanahan.com/index.php/2010/setup-asciidoc-fop-pygment-on-windows/

    Thanks again,
    fs

  • http://www.stevestreeting.com Steve

    Thanks Francis, it’s interesting to see the Pygment route which I never got around to exploring.

  • Pingback: How to Setup AsciiDoc, Pygment and FOP on Windows for Beautiful PDF and XHTML Documentation | Francis Shanahan[.com]