Ok, as I write my crawler, I've noticed that there are many pieces to the whole crawler puzzle.  I've got the part that pulls pages off the internet as fast as possible (that part works great), but that part is the easy part.  The hard part is doing something useful with the data once I've fetched it.  The first step is pulling URLs from the html for future crawling.  Storing/de-duping/… of urls is a whole other problem with doing a huge crawl, so I'll cover that later.

Tweet
submit to reddit