Crawling the whole internet using a home internet connection and a few PCs. Can it be done in a reasonable amount of time and without buying a million harddrives and computers? How big is this problem? Let's assume that we want to do the simplest type of analysis – just pulling all of the pages on the internet, grabbing the urls and then deleting the file and crawling more. We only need temporary storage of the html, so we only have to store the urls. Recently, Google celebrated the indexing of it trillionth web page (2007, I think). So, let's say that there are 1T urls that we need to store. Are those unique pages, or 1T pages, including dupes? For each url, we need to store its return code (200, 404, …). Normally, we'd store the date/time last fetched, Last-modified, mime-type, etc, but we're just going to store the url. If we store it in a flat file, and we store urls in a per-domain file, we can cut off the domain part of the urls (aka http://mydomain.com/) and just store the path and filename in the files. Making a wild guess at the average url length is across the entire internet, let's say it's somewhere in-between 40-100 characters. We'll choose 80 chars as a stab in the dark. 1T urls * 80 chars = 80TB worth of urls. If we store all of the URLs in a key/value database that supports compression (tokyo cabinet) and use the highest compression possible, we might be able to get the storage of all of the URLs down to below 10TB. So, that's possible with a single PC for a couple of grand. We now have a machine that's capable of storing every possible url on the internet (or close to it).
Tweet