the legend
There is an internet “legend” that, if one clicks on any Wikipedia page and thereafter clicks on the first link that is not in parentheses, and continues this process for all subsequent pages, one will eventually arrive at the Wikipedia page for “Philosophy”. There exist other research implementations of this legend online (though the most recent one we've seen was done in 2010), but to our knowledge, none done through the efficiency and optimization allowed by parallel computing, and that is what we intend to do. We've also made our code public, so you can check out the path to Philosophy any time you're curious. The project uses MapReduce and MPI, and is written in Python, heavily utilizing the XML (SAX) parser as well as regular expressions.
motivation
The original “motivation” for the project was a joking remark as we discussed potential ideas for the project- we were scanning large datasets available online, specifically those for which it would be helpful to have a parallel program, and came across the legend looking for locations to download Wikipedia. Discovering whether the legend is true seems creative, parallelizable, and interesting, and would build a partial network (based on the first links on n pages) of Wikipedia interconnectedness.
We are also very interested in other related patterns or observations we can make from looking at how links/topics in Wikipedia are related to each other (see Application Level Objectives for details and additional features).
We are also very interested in other related patterns or observations we can make from looking at how links/topics in Wikipedia are related to each other (see Application Level Objectives for details and additional features).
Caveat
It is important to note that all results are specific to the state of Wikipedia on a particular day (the day we downloaded the data) and because the site is constantly changing, connections will change as well.