How to Create an HTML dump of Mediawiki

If you will be traveling and need offline access to your Mediawiki wiki, what should you do?

If you need to grab pages from a wiki that you aren’t the administrator of, you can try running a web crawler on it or try this Google Gears hack.

But if you are the administrator of the wiki (or you know the admin) you can make a Mediawiki2HTML dump. There is a Mediawiki extension that does it for you. Here’s how to run it:

fetch the DumpHTML extension with shell commands like so:

cd /whatever/mediawiki/extensions
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/DumpHTML

run a shell command something like this as a cron job (create the appropriate folders first)

#!/bin/sh
# Generate a new html dump of wiki.orbswarm.com LCS 8-17-08

echo “deleting contents of /home/swarm/wiki.orbswarm.com-html”
rm -rf /home/swarm/wiki.orbswarm.com-html

# DumpHTML.php expects to be run from the maintenance directory. The skin won’t get HTMLified if you run it from another directory
cd /home/swarm/wiki.orbswarm.com/extensions/DumpHTML
/home/swarm/php5/bin/php dumpHTML.php -d /home/swarm/wiki.orbswarm.com-html -k monobook –image-snapshot –force-copy

echo “deleting /home/swarm/wiki.orbswarm.com/offline/*”
rm -rf /home/swarm/wiki.orbswarm.com/offline/*

/bin/tar -czf /home/swarm/wiki.orbswarm.com/offline/swarm-wiki-html.tar.gz /home/swarm/wiki.orbswarm.com-html/

The way the above script is set up, every day, the .gz file is placed in a web accessible folder. I can then download it before I go on my trip.

5 Comments

  1. Tom says:

    A new (slightly modified) version of DumpHTML is robust against character-encoding problems because it saves articles and media files with MD5 hashed filenames instead of double-byte encoded unicode. See http://www.mediawiki.org/wiki/Extension:DumpHTML and download a patch from https://bugzilla.wikimedia.org/show_bug.cgi?id=8147 .

  2. Lee says:

    Thanks Tom. I had run into the long-filename problem myself. My workaround was to change the single very-long-filename to a shorter filename. I’ll wait a while and hopefully your patch will be in the next version of dumpHTML (for all my nerdiness, I find the patch command a bother to use).

    Thanks

  3. Eric Carter says:

    Why did you choose to use -force-copy? Best I can tell from the limited documentation is that this is an option for compatibility with Wikimedia Commons? Is it needed in my personal Mediawiki installation that doesn’t link to Wikimedia Commons?

  4. Eric Carter says:

    Oh hey, there’s some pretty good documentation here:
    http://www.mediawiki.org/wiki/Extension_talk:DumpHTML

    For anyone interested.

  5. lee says:

    Eric, I use -force-copy just because the docs use it. It works well enough so I didn’t fiddle with it.

    Oh and I should note that on the mediawiki page you note, I am the user “Gadlen”.

Leave a Comment

Do not write "http://" or "https://" in your comment, it will be blocked. It may take a few days for me to manually approve your first comment.