
Update on the GeoCities rescue from archiveteam.org
Submitted by Steve on Mon, 06/29/2009 - 19:53If you didn't already know, I've been trying to help Jason Scott at http://archiveteam.org to back up Yahoo's 18-year-old free hosting service http://geocities.com before they take it down for good later this summer ( http://help.yahoo.com/l/us/yahoo/geocities/geocities-05.html ). Well, everything was going well until I got capped by Comcast ( http://badcheese.com/?q=node/94 ) at 250GB of bandwidth in May 2009. So, my effort was put on hold. Jason and the rest of the gang at http://archiveteam.org has methodically run through all of the 'neighborhoods' at GeoCities already and downloaded nearly 1TB of content. There is still a lot of information in the user's directories that Jason has not downloaded yet.
Since I was using the http://archive.org crawler (Heritrix) to crawl GeoCities, I was asking a lot of questions in the forums and it turns out that the guys at archive.org were actually paying attention to my problems. So when Comcast cut me off, the guys at http://archive.org decided to help us out and do a "deep crawl" of GeoCities which started in early June. They also said that they'll do "catch up" crawls until the service closes to make sure they get any recent updates up until the day that Yahoo closes their doors.
So, thanks to Jason @ http://archiveteam.org and Gordon Mohr @ http://archive.org, GeoCities and all of it's animated gif goodness will remain on the internet until the end of time. Yea!
NOTE: We tried contacting people at Yahoo and most tape and harddrive storage companies to help us with the project and nobody even bothered to return our phone calls or emails except for the guys at http://archive.org

Collage of 5000+ images taken from Amazon's iPhone app via mechanical turk
Submitted by Steve on Tue, 06/16/2009 - 10:44
I put together a little script on my home machine to download all of the images that people submit through Amazon’s iPhone application. The application allows people to take a photo of anything they want to, then the photo goes to Amazon’s mechanical turk service where someone does a search for the product on Amazon’s website and returns a url to that image. The iPhone user can then purchase the product using their phone. Turn-around time varies, but the average from my experience with the application is about 2 minutes which I think is pretty good.
The 5000+ images were from a time period of less than two weeks and I didn’t collect all of the possible images over the two week period, just a good percentage of them. Click here for the big image [flickr.com].
I also made a web page so people can click on each image and see the raw captured image if they’d like: turk images
Be warned, there are a couple NSFW images in there.
Not surprising stats that I noticed while looking at the images: Lots of pictures of dogs, cats, knees, feet, shoes, faces, … It seems like people seem to just be bored and take pictures of just about anything to see what the application will return. Popular product images are: watches, kitchenware, remote controls and most predictably books. Most pictures were taken in someone’s home or while out shopping.

I got the dreaded phone call from Comcast today
Submitted by Steve on Mon, 06/08/2009 - 15:03"Sir, we'd like to talk to you about your internet usage if we can."
I knew exactly what the Comcast representative was going to say. I was given a phone number that was not the 1-800-comcast typical sit-on-hold-for-thirty-minutes-and-get-a-peon-who-gives-me-the-runaround line. I got straight to a tech. I knew this was trouble.
"Sir, I'm not sure if you're aware of it, but Comcast has an acceptable use policy that states that no one user can use more than 250 gigabytes of data in one months, and it looks like you used (pause) Oh my (pause) 750 gigabytes during the month of May. If you continue next month to exceed the 250GB limit, you'll be disconnected from comcast internet for 12 months."
It turns out that this is a new-ish addition to Comcast's Acceptable Use Policy that was changed on October 1st, 2008 and there's no exceptions - I explained that I was trying to crawl geocities.com for http://archiveteam.org and for posterity and was trying to be a good netizen ( Network Citizen http://en.wikipedia.org/wiki/Netizen ) and that I didn't use all of that bandwidth to download porn or mp3s or movies, but they didn't seem to care. "No exceptions" they said.
I asked them about other account options. Comcast has a business-class line that has a 750GB bandwidth cap, but that's still really not enough to download all of geocities in a summer, much less offer it up in any sort of manor for people to grab later. I was even contacted by archive.org (the Internet archive) asking me if I could help them out with their own archive of geocities which hasn't been updated since 2001, but now I'll have to turn them down since I don't have the bandwidth to grab the site.
So, my true feelings about Comcast now? I'm even more upset at them than I was before. Comcast screwed me from day-one; over-promising me at the time of install, then under-delivering and over-charging when I got my first bill. Taking care of that took about 3 months of emails, calls and regular mail. It's getting harder and harder to get someone to actually HEAR a customer's complaint nowadays. Upper management has so many layers of defense in place to make sure that complaints get taken care of somewhere downstream that they don't seem to care anymore at all. It's like the beach-landing scene at the beginning of the movie "Saving Private Ryan". When I get off the 30-minute boat ride to the shore (on hold time), I'm shot at with every excuse that the first-level tech support guy has in his arsenal of excuses. Then escalating the call to the manager-level is relative to the sniper in the pillbox that is mowing down everyone who made it through the first line of defense. I eventually sneak up to the pillbox by sending emails, cold-calling techs and upper-management types that I can find on the internet and take out the pillbox sniper in the back of the head and get the original pricing and features that was originally promised to me by the installation/sales guy in the first place.
I even contacted the old "comcast cares" twitter account, but no response from that guy either. Nothing from anyone at all about this bandwidth cap issue.
I'm working on getting a second line now. Qwest DSL is about the same speed/price, but they have a little-documented bandwidth cap also. It's about 400GB, and when you reach it, you just have to hit "accept" on a webpage to continue internet use. No evil threats with disconnection, no extra charges, just a web page.
Another option is a hosted machine somewhere locally that I can go to and swap out harddrives with. Good bandwidth on a host, plus sneakernet back to my home to crunch the data is actually also acceptable with my current projects, so that's something that I'm working on also.
I'm also looking into the business-class comcast connection, however according to the business class Acceptable Use Policy, they state: Comcast reserves the right to suspend or terminate Service accounts where data consumption is not characteristic of a typical commercial user of the Service as determined by the company in its sole discretion, or where it exceeds published data consumption limitations. Common activities that may cause excessive data consumption in violation of this Policy include, but are not limited to, numerous or continuous bulk transfers of files and other high capacity traffic using (i) file transfer protocol (“FTP”), (ii) peer-to-peer applications, and (iii) newsgroups. so it looks like anything that would require a large amount of data-transfer, even for a business-class account, could flag the account for termination also.
Comcast seems to hold all of the cards, or so it would seem.
The reason that Comcast and other Cable/Entertainment providers are putting these bandwidth caps in place is simple. Cable companies make money on 'premium' services. These services are things like Pay-Per-View, HBO, etc ... You are all familiar with the typical bait-and-switch that cable companies provide you with when you sign up. Get tons of stuff for only $20 a month!!! Then in the fine print, you find out that $20 gives you basic cable, no DVR, no movie channels, and the $20/mo is only good for a short time. After the initial period expires, they start to slowly ream you for more and more money until you end up paying out the nose for standard TV and internet. I'm paying approx $150 for Comcast's Internet and HD DVR with basic HD service (no movie channels, no phone service). This is almost a car payment. At least with a car payment, you get the car after a few years and can stop paying! This is $150/mo FOR LIFE with nothing to show for it when you leave! I have a real tough time swallowing that one.
But where was I, oh yea. Premium services. The killer problem with the cable company's revenue model is that downloadable content is right around the corner if not here already. If people can download all of their favorite movies and TV shows directly off the internet via HULU, YouTube or any number of emerging video websites, why would anyone pay for cable TV anymore? Cable companies are a thing of the past and will die out some day. Perhaps not soon, but the writing is on the wall and Comcast and other cable companies know it. That's the real reason for the bandwidth caps. Not to stop piracy, not to keep their neighboring customers happy, but to limit LEGAL multimedia downloading that competes with their high-priced premium services. It's an anti-competition move to save their asses. Remember newspapers and Craigslist? just think Cable companies and HULU now.
So, what's the lesson learned here? I'm unable to build a startup cheaply that competes with largely-funded companies due to bandwidth caps and low upload speeds (I can't crawl the web, I can't store large chunks of data, I can't host a high-volume website) from my home internet connection, so I need a business-class connection (which still has a cap, is more expensive, still not unlimited use, ...) or better (hosted machine in a datacenter (actually the best solution for my needs)) from a company that doesn't have a competing interest in the Entertainment industry. I need to get away from Comcast. Comcast has some really upset and pissed off users out there and I don't see it getting any better any time soon. The more popular online video sites become, the more Comcast is going to clamp-down on internet usage. For my projects, I can't have that. I'm going a different direction. Sorry Comcast. You lose in the end.

Why making a startup out of your basement is so difficult
Submitted by Steve on Tue, 05/19/2009 - 13:37I like building startups.
Sure, I could make a blog about my dog's favorite chew toy.
Sure, I could make another url-shortening service.
Sure, I could make a mashup of something that involves CraigsList, eBay, Google maps, Hulu and flickr.
However, that crap is boring and I prefer to think bigger. I always like to dream about putting something together that will really make a difference. I would love to take on the big guys and beat them at their own game. However, there are more and more stumbling blocks to do this as I try more and more different things out of my basement with my shoestring budget:
- ISP bandwidth caps - I can get 7Mbps downstream from my Comcast ISP, but I can't crawl any large sites due to Comcast's 200GB/mo bandwidth cap, and don't even get me started with the upstream bandwidth ...
- Storage/indexing - I can buy a few 1TB hard drives every once in a while and stay within my startup shoestring budget, but getting mysql or lucene to index terabytes of content is not a simple solution anymore. I've got a basement with 8 old-ish computers of varying capacities that are pretty busy on a regular basis with miscellaneous things to do. Keeping on top of a mountain-sized chunk of data is not an easy task anymore.
- Google is huge. I can't compete with Google at crawling/indexing/search. Nobody really can. However, I could choose a subset of Google's empire and focus on it and build a better mousetrap and win, just choosing an interesting nitch to hack on is a difficult task in itself. To me, Google's adsense/adwords system seems the most profitable and interesting target that I'd love to build a competitor to. Troy and I built http://bidboxr.com in an effort to experiment with the online ad space and we learned a lot in the process, but we didn't go the extra distance and take on the big-G at their own game. One of our other ideas http://mediawombat.com a flash search engine proved to be a valuable experiment, but again, my shoestring budget kept me from really hitting this one home.
- People/Time - I'm a husband and father of two wonderful children. I find it difficult to go to the bathroom without being interrupted by someone or something nowadays. Finding spare time is becoming a difficult thing to do in itself. Most of my development time is done in the wee hours of the night when I'm low on energy, but finally have some free time to myself. I tend to try to work out the details in my head over a series of days or weeks. Take notes about possible solutions and new directions, then think about that. When I think that my brain has slept on the problem enough nights, I can usually whip out a solution in code-form in an hour or two. This is my current way of building things. The actual work has been delayed a lot longer than it was when I was single or in college and could just pound on the keyboard for 24 hours in a row until it worked. I'm not sure if this new way is better or worse, but it fits in with my lifestyle a little bit more.
- I'm trying to learn about SEO. It seems to me that SEO is an always-changing and challenging market that is profitable and un-tapped on several fronts. I've got a few experiments going with an SEO twist mostly so I can learn about the whole SEO world, but SEO takes a long time - search engines take a long time to index your content, so changes are only reflected after a long period of time. My SEO experiment is http://xis.cc and is doing well with Yahoo, but Google doesn't like it very much.
Anyway, that's all I have for now. More later. :)

Table Mesa & Flatirons Summer and Snow
Submitted by Steve on Wed, 05/13/2009 - 18:13This is a great shot and pseudo-typical Boulder weather. A cross between Winter and Summer in the same shot. :)

Apache tuning for small-ish linux machines
Submitted by Steve on Wed, 04/22/2009 - 19:51I started with a dedicated web server running on a 256MB Linux machine with a single-core. It's the machine that's hosting this website right now. I've had some very good experiences with this machine and some not-so-good. I've upgraded the memory to 512MB, but still finding myself stretching for resources. Also, apache seemed to crash on occasion and I kept fighting with it over and over again to provide good response times and still tune for low-memory. I found several issues that I'd like to mention in case others are having similar issues. Mysql is taking 200Mb for a key buffer, the OS takes approx 100MB, so that leaves about 200MB for Apache/PHP. The solution is to not keep servers hanging around processing unlimited keepalive requests. The setting "MaxRequestsPerChild" forces apache to respawn children after a certain number of requests have been processed. This keeps apache and PHP from dying - if PHP has a bug and hangs, then PHP will be broken, but apache will continue to serve static content until it reaches the limit of this value, then it'll respawn and PHP will be all better again. This is not optimal for a mega-busy webserver, but it's a good 'stable' configuration for a medium-busy web server like mine. I host 40+ medium-to-small websites on this single machine using these settings.
Operating system: Fedora FC6, Apache: 2.2.6, PHP: 5.1.6
Apache settings:
Timeout 5
KeepAlive On
MaxKeepAliveRequests 200
KeepAliveTimeout 3
<IfModule prefork.c>
StartServers 1
MinSpareServers 1
MaxSpareServers 5
MaxClients 25
MaxRequestsPerChild 500
</IfModule>
This may not seem all that important, but it took a while to hone these to make apache work just right under low-memory conditions and possibly buggy php and still continue to live on a busy machine and serve-up a bunch of content without any noticeable issues. If you've got a machine with a similar setup to mine, try out this apache config and let me know how it goes.

Got rejected by TechStars again, but we're getting better at it
Submitted by Steve on Mon, 04/13/2009 - 10:16
Last year we applied to TechStars 2008 for our website http://mediawombat.com (a search engine that indexes the contents of flash media (*.swf files)) and were rejected. That was our first rejection and stung a little bit. Much like Micah, we were already
dreaming about the fast servers and huge pipes that we could afford with the seed money and were looking forward to doing something huge.
This year, we came up with another concept and put it together. The new idea is http://bidboxr.com (a combination of ebay and adsense – where you would put your auctions on other peoples’ websites as ads) and got rejected by DreamIT Ventures before the deadline for TechStars (they told us that competing with ebay was crazy). I attended TS4AD in March and I used all of the tips and tricks that I learned there to pitch our site, our team and keep TechStars in the loop (like they requested) while we worked on our site some more. We were sent an email on March 30th saying that we had made the top-50 companies and that TechStars will notify us in two weeks if we made the cut or not. We were a little happier, but tried to keep the pessimistic attitude. The chances had gone from 1/200 to 1/3, so we were feeling lucky, but it wasn’t a done-deal yet.
This morning (April 13) we got our TechStars 2009 rejection letter, but it didn’t sting as much as the earlier ones did. I was keeping pessimistic about the whole thing just in case (there were 500+ applicants to TechStars this year after all) and we already have small investors expressing interest in our site and the site is up and running without hardly any costs to keep it running, so we didn’t really need TechStars all that much to begin with.
So, in an effort to not sound like one of those American Idol rejects shown on the first few episodes of every season who say things like, “@#$% you Simon! I’m going to make it big without you and your stupid show”. For those of us who didn’t get into YC or TS this year (Micah), don’t worry. I think that I’ve figured out how to make it (not big, but get some traction at least) without the use of seed investors.
I’m a technical guy. I write code, get websites up and running and deal with the computer/technology piece of the puzzle. I don’t know anything about pounding the pavement and cold-calling people to see if they’d be interested in our technology. The thing that I have going for me is that my partner is also a technical guy, but has a background in sales! He understands the technology (he created a lot of it) and also does a fantastic job of getting people to come and look at our site and sit down with us to discuss possible business opportunities.
TechStars suggests that 2/3 to 3/4 of your team should be technical and I agree. Having a team that can whip out tons of code out in a very short period of time is essential to making major changes to your site in the 3 months of TechStars and keeping your initial customers happy. However, if you don’t make it into a seed program, my suggestion is that you add a little more heft into your sales force to do some of the footwork that the seed programs may have helped you out with and also change your thinking to more like a penniless startup (read on).
When I attended TS4AD in March, I was really impressed by one speaker in particular. He was the CEO(?) of http://dailycandy.com and talked about starting a company in the post-9/11 NYC economy where there was no money. His first server was a computer sitting under someone’s desk and the whole website operated on a business-class DSL line that they paid < $100/month for. They didn’t go the VC route. His rule of thumb was, “Don’t spend a dollar until you have two”. I wrote that quote down. They focused on one thing at a time and did email marketing – it was free and software was hand-written, so costs were essentially nothing. The company slowly grew over time and so did the product, but by keeping the overhead low (no copy machine, every employee assembled their own desks and chairs, etc.), the company was able to maintain the startup culture as long as possible in tough economic times and survive. His talk was great and left me feeling like there was a second option that was available to me as a founder of a startup that I hadn’t even though of before. For a short while, we even considered pulling our application out of TechStars to run the way of the cheap-o startup, but we didn’t pull out our application – just in case.
So, in short, we (and you guys too) really don’t need a seed program (Take that, Simon!). We had made the top-50 at TS and missed getting selected by a very slim margin. This gives us some good confidence and tells us that our idea has some real merit. We’re currently very fueled by this notion and using that energy to move forward with our expansion plans on our own. Just this morning, we were contacted by a company who is willing to list over 5000 products on our new site and it’s making us feel even better about our situation!
For those of you who were rejected by TechStars, DreamIT, Y-Combinator, … We feel your pain and need to keep our collective chins up and noses to the grindstone. Getting rejected is not the end of your startup, it’s only the beginning. Get out there and ring some doorbells. Take a small business owner out to lunch and enjoy being an entrepreneur!
Oh, and next year – we’ll probably do it all over again. :)

A little experiment
Submitted by Steve on Fri, 03/27/2009 - 20:20
So, with Woz's legions of geek/hacker fans, how is he not going to automatically win Dancing With The Stars via a python script?
Submitted by Steve on Tue, 03/10/2009 - 20:01
Ok, so I know that Steve Wozniak (one of my childhood heroes) has legions of Apple fanboys and geek hackers under his belt that he can call to action in a moment’s time, right? Voting can be done via the Internet with any email address (doesn’t even require registration) for a couple of hours every night. My question is, where’s the hacker that posts the automated python script to get a fake email address, auto-register at abc.com and mass-vote for Woz on Dancing For The Stars? I can’t believe that this hasn’t already been done. I, myself tried to put something together in Perl quickly on Monday night, but I didn’t have time (family stuff interrupted). So, hacker nerds … Where is it? I’ll be the first one to start it up and get it running on my systems, so let’s get it going! :)

Denver DIA wireless is free, but completely broken
Submitted by Steve on Tue, 02/03/2009 - 07:25So, I’ve got 2 hours to wait before my flight leaves for JFK today, so I open up my laptop to change my email to vacation mode and read some news, etc … Linux connects to the WAP, but fails to get any DNS info from the server, so I reboot into WinXP. XP detects the WAP with “excellent” connection status, but getting the ‘portal’ server to actually serve any information is almost completely useless. The server serves up a 30 second commercial before it allows the user to get access to the internet, but I’m guessing from the speed of things (now about 45 minutes since I started trying to get access to the internet, when I decided to whip out my offline BLOG writer application), the server is busy serving up the ad, to so many people that the server is completely unresponsive. The actual wireless strength of the signal is great, it’s just that whoever they chose as the ISP (http://freefinet.rtr ???) does a completely crappy job at actually doing the connection redirection. Blech! Sure, it’s free wireless, that you have to hit “accept” for twice, and watch a 30-second video that never shows up, but come on!!! Ever heard about caching? Proxying? Load-balancing? Or how about this, if a connection completely fails after about 30 minutes of attempts, how about just lifting the proxying crap and just allowing normal internet in case your stupid server can’t handle the load??? In by book, DIA wireless is completely useless to me. This is both the fault of the ISP and DIA. DIA should give the ISP 2 days to fix the problem or drop them like a hot rock. I'm sure that there are thousands of ISPs in Denver willing to provide the access hardware for one free ad.
Another reason that relying on "the cloud" is a bad idea.
FYI: I was able to get access to the internet after 63 minutes of attempting to get through the ISP redirection/bullshit.


Dell's new packaging
Submitted by Steve on Tue, 11/18/2008 - 14:54Just got in some rails from DELL. Looks like they also altered their packaging standards.

- Steve

mysqlgame is my kind of game!
Submitted by Steve on Sun, 11/02/2008 - 14:00Ok, no fancy graphics. No sound effects. No FPS actually. :) Check out mysqlgame here: http://mysqlgame.appspot.com

Had to post this.
Submitted by Steve on Fri, 10/17/2008 - 14:15
Sarah was flashing some smiles last night during dinner, so I had to put this picture out there.
I’m so proud. :)

Gonna play with Nutch tonight
Submitted by Steve on Tue, 10/07/2008 - 15:38Gonna play with Nutch tonight as a possible replacement for my own personal web crawler. Nutch is a java-based web crawler that we may implement for our large crawling process, but not for our ripping/indexing layer.
Followup:
Why I'm not going to be going with Nutch:
fetching http://www.everydaybirthday.com/
fetching http://welcome.hp.com/gms/gr/el/sz3/smb/notebooks_tabletpcs.html
fetching http://boomp3.com/listen/fbnoc45_p/am-gold-1970-04-your-song-elton-john
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
at org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:1736)
at org.apache.hadoop.mapred.SequenceFileRecordReader.getProgress(SequenceFileRecordReader.java:108)
at org.apache.hadoop.mapred.MapTask$1.getProgress(MapTask.java:165)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:155)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)

HOWTO: diff'ing two huge files in linux
Submitted by Steve on Mon, 10/06/2008 - 14:18Situation: I have two computers with a large number of files on them (approximately 250 million files on each machine). I need to sync them up and rsync is not an option because it takes way too long. So, I need to 'diff' the files on the two machines.
I did a 'find' on both machines to a file. These files turned out to be about 15GB each, but the file size was too large to just 'diff' because diff wants to read everything into memory:
[root@fs105 tmp]# diff imagelist_image01.txt imagelist.txt
diff: memory exhausted
Solution: sort the files manually first, then use the 'comm' command to find the differences
[root@fs105 tmp]# ls -la
total 27935834
drwxr-xr-x 2 root root 120 Oct 6 10:18 .
drwxr-xr-x 8 root root 192 Oct 3 13:56 ..
-rw-r--r-- 1 root root 13859131915 Oct 6 10:13 imagelist_image01.txt
-rw-r--r-- 1 root root 14719246513 Oct 3 14:02 imagelist.txt
[root@fs105 tmp]# sort -S 2G -T . imagelist.txt > imagelist_image02_sorted.txt ; sort -S 2G -T . imagelist_image01.txt > imagelist_image01_sorted.txt
[root@fs105 tmp]# comm -3 imagelist_image01_sorted.txt imagelist_image02_sorted.txt > diff.txt
[root@fs105 tmp]# ls -lah ; wc -l diff.txt
total 55G
drwxr-xr-x 2 root root 240 Oct 6 14:16 .
drwxr-xr-x 8 root root 192 Oct 3 13:56 ..
-rw-r--r-- 1 root root 895M Oct 6 15:09 diff.txt
-rw-r--r-- 1 root root 13G Oct 6 14:11 imagelist_image01_sorted.txt
-rw-r--r-- 1 root root 13G Oct 6 10:13 imagelist_image01.txt
-rw-r--r-- 1 root root 14G Oct 6 12:25 imagelist_image02_sorted.txt
-rw-r--r-- 1 root root 14G Oct 3 14:02 imagelist.txt
17487092 diff.txt

Adobe confirms flash for iPhone – time to write the iPhone app!
Submitted by Steve on Tue, 09/30/2008 - 15:17
Today, Adobe confirmed that they got flash to work on the iPhone. Now Apple just has to release it.
Of course, this to me means that I’ll have to start writing my iPhone app that I’ve been putting off for 6 months now. I’ll be writing a search interface to access my MediaWombat.com site’s results.
Who wants to sell/rent me their Intel Mac Mini for cheap? :)

MediaWombat.com Gets an IM search bot!
Submitted by Steve on Sun, 09/28/2008 - 21:40For you Google Chat people, Add flashsearch@bot.im to your contact list and start searching for flash stuff! :)
- Steve

Why isn't there a linux distro out there that is made for huge web 2.0 infrastructures?
Submitted by Steve on Tue, 09/16/2008 - 15:23I'm the Lead Linux Sysadmin for a prominent web 2.0 company.
Most Web 2.0 companies are all built around the same technologies: Linux, mysql, memcache, tomcat, Apache, php, high-availability (ha-linux), mysql replication, load-balancing, ...
How come there isn't some Linux distribution out there already that I can just deploy, configure and turn off what I don't need? How come every time I install a Linux distro, I have to configure it the same way for my enterprise, or make my own kickstart script or make a custom distro myself, or write cfengine and post-install scripts to do it all? These platforms are *so* common amongst all of the new-ish web 2.0 companies, I'm mega-suprised why this type of distro isn't already in existance somewhere. I could build one, but like I said, I'm the Lead sysadmin for a dot-com company, so I don't have any time to do this.
Can one of you Mtn Dew swilling, no-sleep getting, no girlfriend, wife or children having, college kids out there whip this up for me tonight so I can come into work tomorrow and just install it everywhere and get some peace of mind so I can do some of my more lengthy tasks on my to-do list?
Oh, and throw some off-site backup stuff in there as well. :)

DjangoCon 2008 Keynote: Cal Henderson (Flicker Design Engineer) "Why I hate Django"
Submitted by Steve on Tue, 09/16/2008 - 13:54
Disconnecting from Google (or) Regaining some of my Internet privacy back (Part 1)
Submitted by Steve on Thu, 09/11/2008 - 10:56
I love Google. I’ve got tons of accounts on all of the associated Google services and I love them all. However, I’m beginning to get paranoid about what Google knows about me. There’s been compromises of different sites before and I’m sure that Google will get compromised also some day. Or, Google will throw away their “don’t be evil” mantra and blackmail everyone who has a gmail account with their personal email content in exchange for something. Look at it this way. If you use Google services, have the Google toolbar installed and use Google Chrome, there’s nothing that Google can’t find out about you. They can see all of your files. They can track every website that you visit and the information that you send back and forth to those websites, and of course track everything that you do on the Google services which contain every search you’ve ever done, to your health information (if you decide to connect Google to your local hospitals).
So, disconnecting from Google (to me) started sounding like an option. However, I like lots of the Google services and wanted to keep using them if necessary (gCal, gReader, …). So how do I disconnect from the cloud as people are now calling it? BTW, cloud computing != "the cloud" != distributed or cluster computing != the Internet != big brother. Cloud computing is "reliable services delivered through next-generation data centers that are built on compute and storage virtualization technologies and are accessible anywhere in the world." (wikipedia).
I'd like to re-gain some of my privacy, not because I have anything to hide, more because I'm paranoid of Google. I've also got a family and two children now and I've been reading some disturbing things about identity theft recently. I'd like to continue to put tidbits about my family online, but I'm going to start to cleanse my personal info from Google's know-it-all data warehouse brain if I can. I'll document my efforts here if anyone is interested.
Tools: I have a Linux host that I can use as my replacement (serious) personal email account that I can be sure nobody can access but me, so email is a good first step at removing Google's eyes from my info. I'll start there.
- Firstly, Remove Google chrome, Google Desktop and Google Toolbar from any and all of my machines.
- Secondly, make a new 'serious' password that I won't use on any Google website for anything and change all of my Google passwords to a new 'Google-only' password.
- Next, I think is to change my email address across all of the websites that I registered under my gmail account and change my password while I'm at it. This way Google has my password, but it's not what I use. Also, make a new 'serious' password that I won't use on any of the Google services. Sites that have my credit card info are probably the first ones to hit.
Ok. I'm going to start doing this. I'll chime-in with part 2 when I'm done.

It's 2008, how come cheap wireless isn't everywhere by now?
Submitted by Steve on Sun, 09/07/2008 - 18:29
Well, another weekend and I’m at the mother-in-law’s place again. She’s got AOL dial-up and it royally stinks. Myself, I have an iPhone (no tethering – don’t get me started) and there is no wireless in her neighborhood to illegally leech from. I called AT&T and they’ll sell me a 3G laptop PCMCIA modem, but the service is $60 mo. How sucky is that? I’m already paying them $130/mo for my iPhone family plan – now they want to leech another $60 out of my wallet so I can have lame, 3G connectivity to check my email a couple of times on the weekend?
AOL dial-up is a whole other issue. Yea, it’s 56Kbit and downloading anything > 200k is painful. Some web pages take tons of time and forget JavaScript-heavy pages like Google Mail or Google Reader. I looked into a windows proxy for AOL, so I could run a DHCP, DNS and NAT server on my mother-in-law’s windows XP machine, hook a wireless access point to an Ethernet card in her machine, and have wireless access for myself from the house. No dice. I found tons of great windows utilities, but AOL keeps their raw devices hidden and all of the NAT software that I could find didn’t work with AOL dialup (AOL high-speed is ok, but she doesn’t have it). So, what to do?
I looked into a third option – there’s PengAOL – an AOL client that uses the AOL 3.0 protocol to connect to AOL from a Linux machine. I figured that I knew enough about Linux to connect to AOL using PengAOL and then NAT through that connection and I’d be in like flint. Unfortunately, the only machine that I own that I wanted to dedicate to her was a cheap-o 200Mhz Pentium-1 machine with 64 MB ram that wouldn’t install any modern Linux distros. I started installing Gentoo on it last night before my trip down to Pueblo, but unfortunately a 200Mhz computer compiling Gentoo won’t complete in any short time at all and I only had a few hours to give it a whirl. Anyway, I’m in Pueblo now and I’m without any Internet access at all and hating it.
“It’s 2008 – how come wireless isn’t everywhere for cheap by now? I’d be happy with 56Kbps – as long as I can check stuff out of CVS or check my email or VPN somewhere or load a reasonably-sized web page, I should be able to do that by now with the current technology without spending a fortune.”, I thought to myself. I graduated in 1994 – the year that the Internet changed from research-only to commercial. That was 14 years ago. 100Mhz 486’s were top-of-the-line back then. Times have changed a LOT computer-wise! Cheap laptops are everywhere, your phone can browse the web faster than AOL dial-up – even in the business world, people prefer laptops to desktops nowadays, but where’s the cheap mobile connectivity?
Wi-fi is nice and speedy, but very short distances suck. I can’t even get Wi-fi across my entire house if I centrally-locate the access point.
Cellular (Edge, 3G, …) would be great, but carriers charge an arm and a leg for something that’s not fast enough to use as your every-day connection and I don’t feel that I want it bad enough to spend $60/mo for it.
Wi-max is a pipe-dream for the time being. It’s like solar. Sounds great, but nobody I know has it except for some nutball on the outskirts of town.
Dial-up would be nice for emergency situations, but there are no free ISPs anymore. Hell, I don’t think that there are even functioning BBSs anymore that would provide me some sort of shell access for free. AOL’s the only choice and don’t even get me started on how much AOL sucks the big one. AOL even makes it super-difficult to unsubscribe! Talk about an asshole company – can you say monopoly?
Oh well. I guess that I’ll work ‘offline’ for a while this weekend. I’ll find something to do, I suppose. I’ll listen to some MP3’s or Watch some videos that I have on my laptop and see if I can write some code or something to pass the time. I know … I should write a blog post! :)

MediaWombat.com gets a facelift
Submitted by Steve on Fri, 09/05/2008 - 16:01
Troy has been hard at work at giving MediaWombat.com a much-needed facelift. He’s put out a great new version (still has a few technical wrinkles to be ironed out) for the search results page. Now, there’s an intermediate search results page which shows meta-results from all of the results in one place, then when you click through to one of the results, you get the new mega-results page.
Seen here running in Google Chrome, it looks pretty nice to me. I searched for “fish”.
Meanwhile, I’ve been working hard on getting more data into our back-end. I’m just about ready to fire up the engines and get our crawler working at light-speed and see what happens. Our crawler is currently the slowest part of the back-end process. Our ripper and S3 uploader have caught up with our crawling process, so I’ve been working with another company to speed up the crawling process by leaps and bounds (and I’m almost done re-writing the crawler in C that crawls in more of a bulk mode which is multi-threaded instead of our single-threaded Perl crawler that we’re currently using). I’m also looking to use some of the same libraries that Google used in their new Chrome browser to help me with speedy URL mangling.
Things are coming together nicely for Media Wombat. Check it out and let me know what you think! Oh, and if you’re a flash developer and have some flash sites that you’d like us to add to our index, don’t forget to submit your site to us here: http://mediawombat.com/newurl.php

2008 Malibu Hybrids suck at gas mileage
Submitted by Steve on Sun, 08/31/2008 - 20:47
I drove a 2008 Malibu hybrid around Denver for 5 days during the DNC this year and man did the mileage suck. Check out this picture taken from the dashboard of the car. Yea, 18.6 MPG in hybrid-mode city driving. That’s about the same as my 2001 gas-guzzling 8-cylinder Dodge Dakota truck. Come on! Hybrids doing < 20 MPG? Give me a break!

How do you keep a 4 year old boy busy?
Submitted by Steve on Sun, 08/31/2008 - 19:49
I found a way to keep my 4 year old occupied and allow me some computer time. I found a public domain windows program that prints out mazes. He’s so ga-ga over the mazes that I can’t print them out fast enough! I’m sitting at my computer blogging about this with one hand and hitting the print button with the other. Every now and then I print out an insanely huge one and he goes crazy but tries it! :) Anyway, Here’s the URLs to the programs that I found:
http://peter.sorotokin.com/winmaze/ (windows executable that you can print directly from – new one every time, but only square ones)
http://sorotokin.com/maze/ (The Maze Machine) – a web form page that makes mazes, but not random if you choose the same parameters over and over again – even if you hit refresh, but does fancy ones. The circular one above is done with this web application.

I'm a VIP driver for the DNC in Denver
Submitted by Steve on Sat, 08/16/2008 - 17:18I'm driving around Chris Whittington (Louisiana Democratic State Chairman) and his bodyguard for the 2008 Denver DNC. It's kind of fun, but I haven't seen any big celebs yet. Some pictures here: http://badcheese.com/~steve/gallery/thumbnails.php?album=95

My iPhone 3G gripelist
Submitted by Steve on Sat, 08/16/2008 - 16:45If you read tech blogs, you've probably heard all of the gripes about what people want from their iPhones. You'll hear things like tethering, unlocking, blah, blah, blah, ... I've got an iPhone 3G and here's my list of (all new gripes except one - with a good reason) gripes about the 3G iPhone:
1.) Tethering (yea, this is #1 on most people's list and I wanted to do something original), but tethering is such a nice feature, I wish so badly that I had it. I don't want to download bittorrent movies over my AT&T network or anything, but here's why I want it. This weekend, I'm staying at my mother-in-law's house and she's got crappy AOL dial-up and she always asks me to 'fix' her computer. She usually downloads some 3rd party software that some pop-up told her to install and she ends up muddy-ing up her whole machine. Re-downloading AVG Anti-Virus, Ad-Aware, wireshark (to check for weird network traffic), Firefox 3.0 and other stuff takes about 24 hours on dial-up AOL. Sure, I could fill up a thumb drive and take it wherever I go, but I've had many thumb drives die on me so it's not reliable hardware, and I'd have to keep the thumb drive up-to-date all of the time. I have a laptop and an iPhone ... If I could tether my laptop, I'd have it all downloaded in a few minutes and get on with my life.
2.) Let the App-store applications have access to the iPod area of the iPhone. I want to have my alarm clock play a song instead of 10-year-old Mac sounds when I wake up. I want some 3rd party apps allow me to change the sort order on podcasts, so I can listen to them in reverse-date-order instead of newest-first (duh... Apple, this should be a 2-minute code fix - get on it!). I'm sure that Apple keeps the app/iPod memory spaces separate so the apps can't muck with the protection on the iPod-side of things, but come on!







