January 12 linkdump: Reddit on Hadoop on steroids, Hadoop lessons learned

Great Hadoop story, and a great read too, from Lau Jensen on Best In Class blog:

Hadoop opens a world of fun with the promise of some heavy lifting and in order to feed the beast I’ve written a Reddit-scraper in just 30 lines of Clojure.

[…]

Now that we’re sitting with almost unlimited insight into the posts which make Redditors tick, we can think of many stats that would be fun to compute. Since this is a tutorial I’ll go with the simplest version, ie. something like calculating total number of upvotes per domain/author, but for a future experiment it would be fun to pull out the top authors/posts and also scrape the URLs they link, categorizing them after content length, keywords, number of graphical elements etc, just to get the recipe for a succesful post.

Alex Popescu has a few notes and questions about ReadPath usage of Hadoop in production:

If you thought using NoSQL solutions would automatically address and solve backup and restore policies, you were wrong. […]

Comments

Tags