Completed coding of recursive crawler, it was fun and a lot of hard work, some meditation, and lots of google. I finally did it. My friend Abhijeet asked to make recursive crawler and I was thinking how can I do that. So came up with this idea wo making two lists

1. processed list (All crawled urls are stored here)

2. unprocessed list (All new url are stored here)

Now if a new url exists in any of these lists then skip it and move furthur. Happy crawling guys.....:)

This program do the following thing

  1. store data in mongodb
  2. parse html in page title, meta data, meta keywords
  3. In case if page request fails error handling save it from breaking
  4. it does not follow any other domain except the given one

Here is the link