our perspective
We planned ahead and started early, and are lucky we did so. If we hadn't, we never could have finished on time. Even beginning weeks in advance, it was a sprint the the finish line.
I think our 90% number (the approximate percentage that's the percent of links that go back to Philosophy) is surprising. First of all, it was higher than I expected. Secondly, it was amazing the difference Wikipedia showed in only a month- in one of our downloads we estimate 85% (with a small degree of error), and with a later Wikipedia download, around 92% (no known degree of error). Manually iterating through Wikipedia pages online, we discovered that some days, 'Knowledge' led back to Philosophy, while others it didn't... it just depends on how the article looks that day.
The biggest challenge for me was the nit-picky set-up of the Wikipedia pages. The MapReduce code was challenging because it was new applications of logic, but we were able to sit down and crank it out in about a day. The MPI code was reasonably easy and only took a few hours to implement. What ended up being the unforeseen time killer in the project was the numerous adjustments and iterations we had to make the the link parsing file, and each modification meant another 12 hours of wait time until the next batch of links were in our hands. The last iteration was just 24 hours ago, making for a stressful end to the project, though we knew our parallel code runs well.
As a next step, I'd love to do further analysis on the first links- because we had to modify the link parser so close to the finish, time for analysis was highly, highly limited. I'm pleased that we could trace the path, but I'd love to build some more detailed visualizations to go along with it. In fact, next time, I'd be inclined to avoid Wikipedia data altogether- I think focusing on big-picture code and parallelization would have been much more exciting than iterating through the link parser with minor adjustments.
Finally, the fact that 90% of articles do link back to "Philosophy" on Wikipedia is an interesting observation in of itself. The common explanation for this is that all articles are supposed to start with a general description/explanation, and that usually means it will talk about the overarching subject, or branch, and since "Philosophy" is known as the Mother of All Sciences, what better article to be the origin node of Wikipedia?
It was a blast and we learned a lot, especially about MapReduce- thanks!
I think our 90% number (the approximate percentage that's the percent of links that go back to Philosophy) is surprising. First of all, it was higher than I expected. Secondly, it was amazing the difference Wikipedia showed in only a month- in one of our downloads we estimate 85% (with a small degree of error), and with a later Wikipedia download, around 92% (no known degree of error). Manually iterating through Wikipedia pages online, we discovered that some days, 'Knowledge' led back to Philosophy, while others it didn't... it just depends on how the article looks that day.
The biggest challenge for me was the nit-picky set-up of the Wikipedia pages. The MapReduce code was challenging because it was new applications of logic, but we were able to sit down and crank it out in about a day. The MPI code was reasonably easy and only took a few hours to implement. What ended up being the unforeseen time killer in the project was the numerous adjustments and iterations we had to make the the link parsing file, and each modification meant another 12 hours of wait time until the next batch of links were in our hands. The last iteration was just 24 hours ago, making for a stressful end to the project, though we knew our parallel code runs well.
As a next step, I'd love to do further analysis on the first links- because we had to modify the link parser so close to the finish, time for analysis was highly, highly limited. I'm pleased that we could trace the path, but I'd love to build some more detailed visualizations to go along with it. In fact, next time, I'd be inclined to avoid Wikipedia data altogether- I think focusing on big-picture code and parallelization would have been much more exciting than iterating through the link parser with minor adjustments.
Finally, the fact that 90% of articles do link back to "Philosophy" on Wikipedia is an interesting observation in of itself. The common explanation for this is that all articles are supposed to start with a general description/explanation, and that usually means it will talk about the overarching subject, or branch, and since "Philosophy" is known as the Mother of All Sciences, what better article to be the origin node of Wikipedia?
It was a blast and we learned a lot, especially about MapReduce- thanks!