Thursday, September 23, 2010

Learning from the Mistakes of Others

I read an article recently about the now infamous BP oil spill and what went wrong.  Much of the focus was on the mechanism in place, called the Blow Out Preventer or BOP, that failed to seal the oil pipe after the Deep Horizon rig exploded.

The Popular Science article I read paints quite a different picture.  Weeks, days, hours, and minutes leading up to the explosion processes were not followed, bad practices employed, and warning signs avoided.  All these failures made a preventable situation transform into a global disaster.

A lesson to be learned here is that in our IT roles we have the power here and now to prevent catastrophic failures for our organizations by following process, establishing and implementing best practices, and paying attention to signs of a potenial problem.

I'll give a few examples of some situations I've seen, what went wrong, and how it could have been prevented.

1. Virus and Database failure.  An employee receives an email with a virus that propogates through his machine and into other computers on the network.  The virus gets into a SQL Server and causes the database to be corrupted.  On site backups fail, offsite backups brought in fail, and luckily a production copy was moved to a testing database server and used to restore production.

What went wrong and how to prevent it?  The employee whose computer was infected had the virus scan disabled (employee felt it was slowing everything down), Outlook was allow to run scripts automatically (IE, wasn't locked down properly), and when the employee noticed his machine was out of space and a mysterious file was all over his computer... he didn't notify anyone and left his machine connected to the network.  As you can see a series of failures.  Next, the database backups were assumed to be good but the backup process was never tested.  Doing daily backups and shipping backups offsite is a great idea, but only if the backups work.  A restore to a test server should be done periodically to ensure backups are working.  Also, there was no log shipping or other routine in place to capture data between backups.  Luckily, not much data was lost but some surely was because the "restore" data was a one off copy of production data to a test database.

2. A power outage occurs in an office building.  The server room backup power kicks on but the outage is expected to last a while so generator power is started up.  After all the IT staff has gone home for the day but employees are still working to meet a deadline all IT services are lost.  It takes several calls and several hours to resolve.

What happened and how to prevent it?  IT did not have a clear procedure or process for running their server room on battery backup and/or generator power.  Ironically, they had just completed a Disaster Recovery Plan and didn't contemplate this scenario.  An outside consultant handled the server room maintenance and their procedure for using generator power included unplugging the main power line for the servers.  When power was restored and the generator taken offline the servers were running soley on battery power.  Several hours later after everyone left, the battery backups died.   As mentioned, process played a part here.  Someone should have known and ensured the servers were running on the right power source.  No tools or notifications were setup to notify IT when servers started running on batteries and what the status of the batteries were.   Obviously, once all the servers went out there was no way to remotely access them and a further delay was caused trying to locate the right people, figure out the problem, and physically get someone to the building to correct the problem.  Further compounding the issue was that scheduled tasks did not get completed and some processes were stopped midprocess so it took a full day of the programming team's time to assess anything in process that might have failed.

3. This last scenario involves ignoring warning signs.  These events happened at several different places but the results were the same.  In both cases the server room started getting progressively warmer and warmer and nobody bothered to investigate.  The problem culminated in a number of servers shutting down due to overheating.  The IT staff had no plan on how to handle A/C failures and by not investigating didn't realize the dedicated a/c units to those server rooms were not working.  The servers would be restarted only to die a few minutes later again, overheating.  In both cases, someone had to scramble to find some type of large fan to cycle cool air into the server room.  In both cases after the a/c was fixed several machines weeks later experienced drive failures and other quirks likely due to overheating.

What went wrong?  Someone noticed a problem and failed to investigate.  A server room is normally small with a dedicated a/c unit and without that cold air the room can quickly get very hot causing the servers to overheat.  As mentioned in Scenario 2, while this office has a Disaster Plan it didn't include environmental failures like this.  While some servers were turned on others were left off (stopping work for some departments) to keep the heat down.  There was no plan for priority and duration for various departments to keep operating.  Additionally, this business had a "hot" backup site but no mechanism was in place to switch over.

No comments:

Post a Comment