The Case
MediaWiki, the world’s largest wiki content management system, produces webpages that are clean and simple to read. But collecting them is a noisy, recursive nightmare. MediaWiki articles are clean, well formatted, and simple. Behind the scenes, MediaWiki stores article revisions, user histories, and communications between users. Useful data, to be sure, but not necessary for a straightforward data collection. Many of these pages are dynamically generated, and a webcrawl may infinitely recurse. And of course, all this data lives in a database, which isn’t easily collected or produced.
A client was using MediaWiki to store technical data and documentation in a private wiki on their network. They needed to collect data used by an entire business unit, and the usual tools were returning hundreds of thousands of pages of data for what should have amounted to a few thousand articles. The noise was too much for the automated collection to be of use, and the target data set was too large for manual collection.