|
Jun
16
|
The Concept of DR (Disaster Recovery) is dead!
Pretty provocative statement? What do I mean by this?
Typical Scenario
Large sized Enterprises of 1,000 users or more have looked around at what they implement in IT Technology and realized at some point that, although they have done their best to eliminate any potential "Single Points of Failure" during the design stage, they have now come to accept (especially post 9/11) that in the event of some kind of Natural Disaster or other major failure to the building or Data Centre they will not be able to continue to trade or stay in business.
This has generally led to the devising of a “Disaster Recovery” plan. Normally in the first instance a site is decided upon based on a third party’s Hosting Centre, then due to cost constraints of the service and/or dedicated Hardware (that everyone expects never to use, BTW) this plan is somewhat stifled at birth and can sometimes remain moribund.
Any of this sound vaguely familiar?
This then possibly progress’s at some point (when the next yearly(?) review of the DR Plan comes around) to add a bit more hardware and possibly some dedicated Leased line of some kind to enable comms? Although this is still somewhat half-hearted, when questioned about why it takes so long to test someone will usually point out that the reason it took so long is that:
- A – It wasn’t done properly by *the other guys/consultants/*
- B – We should test every quarter – "It would be easier and quicker….."
When the next review comes around, the board hit’s the roof when they’re told that in the event of a Disaster the IT dept. thinks it would realistically take between 2 – 5 days to get up and running. (This is of course depending on what the board are told, or indeed how much it cost to implement the DR in the first place?)
Result
By the time they’ve got around to doing quarterly tests, you can almost guarantee enough full time work for one Project Manager, as well as 3 – 4 IT staff being dedicated to this for 1 – 2 weeks every quarter, not forgetting the users from the Business to run the testing. On top of this you also have an array of Hardware that may/or may not fit your needs depending on whether you have purchased the kit outright or it’s *shared* from a service supplier - ALL of which doesn’t do much – BUT still needs to be replaced and/or upgraded at various stages.
Is this the best use of budget and resources? I don’t think so – do you?
Time to re-think the approach?
If you are going to re-think or re-design your DR strategy the very first thing that you should think about is getting the appropriate buy-in from the rest of the company. What do the other dept. heads think is a valid recovery time? Explain to them that this time has a logarithmic effect on the cost.
Zero down time costs sqillions, even reducing the down time of critical systems to less than 5 minutes can seem extortionate, and yet we all feel that a 4 – 6 hours recovery time for DR should be within our grasp at a reasonable cost? Shouldn’t it?
Take a leaf out of the best practices as deployed by bigger organizations and see if this can’t help you deploy a better model of DR?
We now live in a world with a reasonably healthy supply of comms and bandwidth (apart from our colleagues further away from the main centres, Sorry chaps.) **SO**, Instead of implementing a design based on *Live* – *DR*, look instead to implement a design based on *Live* – *Live*.
My personal choice of design to mitigate a disaster is to design a system that incorporates the idea of 2 Data Centres from the start. If you can’t afford much, then just make the secondary site smaller, even if it means that you only incorporate 1 x Citrix, 1 x Database and 1 x AD/Exchange . Most companies at this level will have either AD or NDS (or some other Directory Services) and as such be aware that they have limitations around tombstoning of deleted objects in the tree that mean you CANNOT just copy a Domain Controller or Catalogue and leave it on tape at the DR Site for when you come running – any more than 60 days and you’re Toast! You really need to have something live at the Secondary site.
This where the cheap bandwidth part comes in – think about how much time and effort it takes to keep on top of the Backup System? Wouldn’t this be so much easier if you just had some form of replication of your backend systems to the Secondary site? Don’t forget that in some cases there will be no alternative option except go out and buy the kit and place it on the secondary site – HP Superdome’s don’t just sit around in a warehouses, so if you’ve already bought it, you might as well be using it – rather than letting it sit around gathering dust?
**Caveat**
This will not suit all situations – in some organizations that simply cannot cost justify this level of continuity the numbers will never show it to be viable. But certainly, any organization that relies on a minute by minute continuity of service to keep in business could find this design approach highly effective.
If this concept still stretches your budget too far, then think about at least deploying a Remote Access system from another site? Once you have a DC, a Secondary Exchange Server with OWA at least you can still get eMail? Got a bit more cash? Start replicating the more important files to the DC – expand to more storage and then add a Citrix Server. You can always accomplish this by Stealth if you have to?
Why I think DR is dead
By designing systems as *Live* – *Live* or Primary and Secondary from the start it means that whenever you initiate Change Control to make changes to the System (You do have Change Control don’t you?
) you are always doing it both sites regardless of location. It becomes more than a process – it becomes ingrained!
This is what I mean by the provocative statement at the start, even if the second site is smaller and has less kit, can handle less users – SO WHAT! – when your main Data Centre has just been hit by the fact that you can use the Citrix servers in the secondary site from:
- an Internet Cafe
- Home
- an Airport
- a Laptop at a Wireless HotPoint in Starbucks
- Any site you choose
*INSTANTLY* with no cross over or manual intervention required
The IT Director will appear to be some kind of Svengali who has magically managed to keep the Systems running despite the .
The Secondary Site needs to be scalable
The reality is that with all this going on around you, there is a very distinct possibility that there will be so many people trying to fix things back at HQ, or look for new Office space until the current HQ can be that there is likely to be only a requirement of 30 – 40% of users needing access to the system.
The next thing to do then is shore this up and anticipate your needs, there is a distinct likelyhood that you will see capacity back up to 70 – 80% within days – BUT you now have days in which to get the hardware, re-image, recover tapes and be ready for it.
And won’t you come out of that smelling of roses?
Costs
Now to cap off this little rant, let’s get back to costs shall we? Please believe me when I say that no 2 companies will have the same requirements so this is all pure guess work – BUT caveats read and understood J
Once you get to ~ 1,000 users the resources required will pretty much scale the same (ish….) (BTW, as far as I know the ~ symbol means approx., OK?)
1,000 users can mean about IT head count of ~20 – 50 depending on the propensity of IT to do in-house development and/or do things in Access, etc.
A traditional approach to DR could see us with:
A bunch of hardware sitting around that gets fired up once a quarter – or it’s left powered up all the time, even though it’s not used and it consumes Power?
One full time Project Manger
4 IT Staff x 10 days x 4 times a year = 160 Days
6 Test Users x 4 days x 4 times a year = 100 Days
This gives us the equivalent of ~2.5 full time employees @ cost to the company of ~50K/pa
Manpower = 125K
Hardware = 50K plus?
Not forgetting any additional manpower needed when there have been big changes since the last test that need to be implemented as well as any failed tests need fixing and then testing again.
**I have not covered initial design and implementation – just on-going commitment**
Although this is only a rough guide, what I’m trying to show here is that it’s the labour and/or resources involved with continually testing and verifying the traditional approach that sucks the very lifeblood out of the whole concept. And if you aren’t doing regular testing then it’s probably taking even longer to get things back to a workable state.
Powerful idea
One beautiful point here is that simply by referring to this as a Secondary site as opposed to DR and incorporating the concept that this Is part or normal IT capacity, you are now changing the board’s perception of the capital and operational costs in to something that is a percentage of normal IT funding instead of Dead money for kit you hope you don’t use.
Anyway – that’s my 2 cents on that……………….
Additional 5 cents worth below…..
As soon as you even mention the concept of DR to the board – what they are drawing in their head is a symbolic switch and a mirror.
You have now done such a good job of selling it that they imagine that when the “proverbial hits the rotating blades” they believe that you will be able to through theswitch at a moments notice and this mirror now magically mimics the production system and all is 100% fantastic.
Try and rethink the architecture so that servers, services, objects and the like DO actually rely on TCP/IP to traverse the environment – this way you can quite simply redirect or “load balance” traffic, or should it come to it, use DNS to act as your switch to redirect part of the load to Systems that are working.
Taking this concept further, if you have QoS (Quality of Service) in use at the Cisco/Switch/Router level you can even start pre-defining a “cost” element to traffic negotiating across to the secondary site – this *could* allow you to design a system where you don’t even need to through a switch to cut over at all. Now how cool would that be?
