I’ve seen many infrastructures in my day. I work for a company with a very complicated infrastructure now. They’ve got a dev/stage/prod environment for every product (and they’ve got many of them). Trust is not a word spoken lightly here. There is no ‘trust’ for even sysadmins (I’ve been working here for 7 months now and still don’t have production sudo access). Developers constantly complain about not having the access that they need to do their jobs and there are multiple failures a week that can only be fixed by a small handful of people that know the (very complex) systems in place. Not only that, but in order to save work, they’ve used every cutting-edge piece of software that they can get their hands on (mainly to learn it so they can put it on their resume, I assume), but this causes more complexity that only a handful of people can manage. As a result of this the site uptime is (on a good month) 3 nines at best.
In my last position (pronto.com) I put together an infrastructure that any idiot could maintain. I used unmanaged switches behind a load-balancer/firewall and a few VPNs around to the different sites. It was simple. It had very little complexity, and a new sysadmin could take over in a very short time if I were to be hit by a bus. A single person could run the network and servers and if the documentation was lost, a new sysadmin could figure it out without much trouble.
Over time, I handed off my ownership of many of the Infrastructure components to other people in the operations group and of course, complexity took over. We ended up with a multi-tier network with bunches of VLANs and complexity that could only be understood with charts, documentation and a CCNA. Now the team is 4+ people and if something happens, people run around like chickens with their heads cut off not knowing what to do or who to contact when something goes wrong.
Complexity kills productivity. Security is inversely proportionate to usability. Keep it simple, stupid. These are all rules to live by in my book.
Downtimes: Beatport: not unlikely to have 1-2 hours downtime for the main site per month. Pronto: several 10-15 minute outages a year Pronto (under my supervision): a few seconds a month (mostly human error though, no mechanical failure)
Ok, rant over. :)
Tweet