Thursday, September 23, 2010

Learning from the Mistakes of Others

I read an article recently about the now infamous BP oil spill and what went wrong.  Much of the focus was on the mechanism in place, called the Blow Out Preventer or BOP, that failed to seal the oil pipe after the Deep Horizon rig exploded.

The Popular Science article I read paints quite a different picture.  Weeks, days, hours, and minutes leading up to the explosion processes were not followed, bad practices employed, and warning signs avoided.  All these failures made a preventable situation transform into a global disaster.

A lesson to be learned here is that in our IT roles we have the power here and now to prevent catastrophic failures for our organizations by following process, establishing and implementing best practices, and paying attention to signs of a potenial problem.

I'll give a few examples of some situations I've seen, what went wrong, and how it could have been prevented.

1. Virus and Database failure.  An employee receives an email with a virus that propogates through his machine and into other computers on the network.  The virus gets into a SQL Server and causes the database to be corrupted.  On site backups fail, offsite backups brought in fail, and luckily a production copy was moved to a testing database server and used to restore production.

What went wrong and how to prevent it?  The employee whose computer was infected had the virus scan disabled (employee felt it was slowing everything down), Outlook was allow to run scripts automatically (IE, wasn't locked down properly), and when the employee noticed his machine was out of space and a mysterious file was all over his computer... he didn't notify anyone and left his machine connected to the network.  As you can see a series of failures.  Next, the database backups were assumed to be good but the backup process was never tested.  Doing daily backups and shipping backups offsite is a great idea, but only if the backups work.  A restore to a test server should be done periodically to ensure backups are working.  Also, there was no log shipping or other routine in place to capture data between backups.  Luckily, not much data was lost but some surely was because the "restore" data was a one off copy of production data to a test database.

2. A power outage occurs in an office building.  The server room backup power kicks on but the outage is expected to last a while so generator power is started up.  After all the IT staff has gone home for the day but employees are still working to meet a deadline all IT services are lost.  It takes several calls and several hours to resolve.

What happened and how to prevent it?  IT did not have a clear procedure or process for running their server room on battery backup and/or generator power.  Ironically, they had just completed a Disaster Recovery Plan and didn't contemplate this scenario.  An outside consultant handled the server room maintenance and their procedure for using generator power included unplugging the main power line for the servers.  When power was restored and the generator taken offline the servers were running soley on battery power.  Several hours later after everyone left, the battery backups died.   As mentioned, process played a part here.  Someone should have known and ensured the servers were running on the right power source.  No tools or notifications were setup to notify IT when servers started running on batteries and what the status of the batteries were.   Obviously, once all the servers went out there was no way to remotely access them and a further delay was caused trying to locate the right people, figure out the problem, and physically get someone to the building to correct the problem.  Further compounding the issue was that scheduled tasks did not get completed and some processes were stopped midprocess so it took a full day of the programming team's time to assess anything in process that might have failed.

3. This last scenario involves ignoring warning signs.  These events happened at several different places but the results were the same.  In both cases the server room started getting progressively warmer and warmer and nobody bothered to investigate.  The problem culminated in a number of servers shutting down due to overheating.  The IT staff had no plan on how to handle A/C failures and by not investigating didn't realize the dedicated a/c units to those server rooms were not working.  The servers would be restarted only to die a few minutes later again, overheating.  In both cases, someone had to scramble to find some type of large fan to cycle cool air into the server room.  In both cases after the a/c was fixed several machines weeks later experienced drive failures and other quirks likely due to overheating.

What went wrong?  Someone noticed a problem and failed to investigate.  A server room is normally small with a dedicated a/c unit and without that cold air the room can quickly get very hot causing the servers to overheat.  As mentioned in Scenario 2, while this office has a Disaster Plan it didn't include environmental failures like this.  While some servers were turned on others were left off (stopping work for some departments) to keep the heat down.  There was no plan for priority and duration for various departments to keep operating.  Additionally, this business had a "hot" backup site but no mechanism was in place to switch over.

Sunday, September 19, 2010

Uses for Power Line network plugs

Many people complain the top speed of the "Power Line" style network adapters are slow but I've found a few ways and places that these can be useful.

If you don't know the Power Line adapters plug into the wall and use your existing wiring to transmit data.  This works because the current going through the lines is a wave an if you remember back to high school science waves have periods where they are at "0".  During these "0" times, the adapter can send information.

1. I've used one of the newer NetGear Power Lines to access internet content on my Xbox 360.  I normally use it to download patches, updates, and watch NetFlix (I'm normally able to get full picture quality with no lag).  I'm not sure how good this would be for gaming but for my purposes it works great.

2. I keep one always connected to my router and the wall so that I can have a floating hard-wired connection in the house as needed.  Sometimes my household wireless has trouble in some spots and the speed is not good enough for what I'm trying to do (like watching a streaming video on Netflix on my Netbook) so having the ability to plug in a connection anywhere is ideal.

3. Network printing.  I couldn't find a spot in my office for the printer so I put it in a closet and used a Power Line for connectivity to it.

4. WAPs.  Another solution to problem 2 is to use a Power Line plug to provide a Wireless Access Point a connection to the router.  This way you can go wireless but have extended range or test your WAP's range and performance before running CAT5.

5. IP Phones.  Allows a connection for IP phones virtually anywhere in your house or your VoIP router.

Monday, September 13, 2010

ClearCase ... not a fan

So one of my clients is switching to ClearCase for source management.  Not a fan!

With say Visual SourceSafe the install is as simple as running the install, maybe specifying a few simple options, and waiting for the install to run.

With ClearCase I've had literally 30 steps to perform and its still not installed and integrated with Visual Studio.

Some steps:
1. To check the version I had to run a batch file and email it to someone.  Ironically, the utility runs and the output message is "take this text file and email it to the person who sent you this utility."

2. I've had to search the HDD for files so I could update the registry with the correct path.  (Why couldn't the install figure this out and update the registry?)

3. I've had to run additional command line utilities to attempt to register ClearCase with Visual Studio.

Really?  I feel like we've back in the 90s running EMM386 and changing Config.sys files to make enough Virtual and Extended Memory to play a MS DOS game!

People who complain Windows is too complex/buggy need to try installing ClearCase!