The program uses the Hadoop MapReduce function to compile a list of the steps needed on the way to philosophy. We took the data from Wikipedia itself, a raw English Wikipedia data dump (pages-articles.xml.bz2) with size of 9.5 GB compressed, 44 GB uncompressed. Our data dates from November of 2013.
In the pre-processing stage of the data, done independently of the analysis, we used an XML parser to separate out the articles from the Wikipedia data dump. We then batched these millions of articles into distributable blocks and used MPI to split them up and process them further into a dictionary entry per page, through stripping out all article titles and the links located within each article, excluding results in parenthesis, non-main article text (as per the myth), redirect pages, disambiguation pages, and non-article type pages such as those titled “Category: “, “Image: “ and “File: “.
Next, we put our data through MapReduce code that actually sifts through all the dictionaries of of a page’s (key) links (values) and identifies if the first link leads back to “Philosophy.”. The output is a list of edges that takes an article to philosophy, and if the article disproves the theory, the reason why. One of our Additional Features is to show any infinite loops (such as ”Sand Fence” > “Snow Fence” > “Sand Fence.”
Click here to see our code.
In the pre-processing stage of the data, done independently of the analysis, we used an XML parser to separate out the articles from the Wikipedia data dump. We then batched these millions of articles into distributable blocks and used MPI to split them up and process them further into a dictionary entry per page, through stripping out all article titles and the links located within each article, excluding results in parenthesis, non-main article text (as per the myth), redirect pages, disambiguation pages, and non-article type pages such as those titled “Category: “, “Image: “ and “File: “.
Next, we put our data through MapReduce code that actually sifts through all the dictionaries of of a page’s (key) links (values) and identifies if the first link leads back to “Philosophy.”. The output is a list of edges that takes an article to philosophy, and if the article disproves the theory, the reason why. One of our Additional Features is to show any infinite loops (such as ”Sand Fence” > “Snow Fence” > “Sand Fence.”
Click here to see our code.