So, I loaded all of the netflix data into a mysql DB last night and started all of the indexes building while I slept. Most of it was done when I woke up this morning. I’ve been thinking about the whole problem a little bit. It seems that’s it’s mainly just a profiling problem more than a statistical problem. I installed a CPAN module to pull info from IMDB. I think that the genre data will be the most helpful to classify movies, then classify people based on the movie classifications and the likes or dislikes. Also, I used to be a netflix subscriber, so I know that when a person joins, he/she starts liking stuff right away just to populate the netflix suggestion DB and get recommendations from the system, so that’s a known pattern that I’ll be watching for. Some immediate things I noticed: The average ‘rating’ for all moves was a 3.6. There are many customers who only reviewed one movie - this could either present a problem (noise, unpredictable) or an easy classification type. I’m not sure yet. :) More later.

Tweet
submit to reddit