How to Create an HTML dump of Mediawiki
If you will be traveling and need offline access to your Mediawiki wiki, what should you do?
If you need to grab pages from a wiki that you aren’t the administrator of, you can try running a web crawler on it or try this Google Gears hack.
But if you are the administrator of the wiki (or you know the admin) you can make a Mediawiki2HTML dump. There is a Mediawiki extension that does it for you. Here’s how to run it:
fetch the DumpHTML extension with shell commands like so:
cd /whatever/mediawiki/extensions
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/DumpHTML
run a shell command something like this as a cron job (create the appropriate folders first)
#!/bin/sh
# Generate a new html dump of wiki.orbswarm.com LCS 8-17-08echo “deleting contents of /home/swarm/wiki.orbswarm.com-html”
rm -rf /home/swarm/wiki.orbswarm.com-html# DumpHTML.php expects to be run from the maintenance directory. The skin won’t get HTMLified if you run it from another directory
cd /home/swarm/wiki.orbswarm.com/extensions/DumpHTML
/home/swarm/php5/bin/php dumpHTML.php -d /home/swarm/wiki.orbswarm.com-html -k monobook –image-snapshot –force-copyecho “deleting /home/swarm/wiki.orbswarm.com/offline/*”
rm -rf /home/swarm/wiki.orbswarm.com/offline/*/bin/tar -czf /home/swarm/wiki.orbswarm.com/offline/swarm-wiki-html.tar.gz /home/swarm/wiki.orbswarm.com-html/
The way the above script is set up, every day, the .gz file is placed in a web accessible folder. I can then download it before I go on my trip.
A new (slightly modified) version of DumpHTML is robust against character-encoding problems because it saves articles and media files with MD5 hashed filenames instead of double-byte encoded unicode. See http://www.mediawiki.org/wiki/Extension:DumpHTML and download a patch from https://bugzilla.wikimedia.org/show_bug.cgi?id=8147 .
Thanks Tom. I had run into the long-filename problem myself. My workaround was to change the single very-long-filename to a shorter filename. I’ll wait a while and hopefully your patch will be in the next version of dumpHTML (for all my nerdiness, I find the patch command a bother to use).
Thanks
Why did you choose to use -force-copy? Best I can tell from the limited documentation is that this is an option for compatibility with Wikimedia Commons? Is it needed in my personal Mediawiki installation that doesn’t link to Wikimedia Commons?
Oh hey, there’s some pretty good documentation here:
http://www.mediawiki.org/wiki/Extension_talk:DumpHTML
For anyone interested.
Eric, I use -force-copy just because the docs use it. It works well enough so I didn’t fiddle with it.
Oh and I should note that on the mediawiki page you note, I am the user “Gadlen”.