How to test a Data Center’s ability to Recover
Your company has a data center or two, or maybe even more than that, and you are wondering what is the best way to test the data center to make sure you can recover all aspects of it under any circumstances. How far should you go and what kind of committment is required?
There are a number of answers, and each answer has a different level of committment in terms of resources, both people and financial.
If you have an outsourced 3rd party recovery contract, or fully resilient data centers, then you should be performing at least annual exercises to make sure you can recover to your backup location as quickly as possible. Both speed and accuracy must be recorded and testedg during the exercise, to reduce the surprises that may crop up if you have to recover in anger during a real event.
However, your work shouldn’t end there. Your primary data center may be vulnerable to multiple risks that are not addressed by a simple recovery to your alternate location. For example, if you have diverse power feeds, have you ever turned off one of those feeds to ensure that a single feed can power the entire data center? This is an invaluable experience where you can learn a lot about the power infrastructure of your data center. It will uncover any eroneous wiring where both power feeds may not be wired correctly to all cabinets in your data center. You’d much rather find this out during a maintenance window under your control, rather than during the working day when it can cause a lot more damage. Once you’ve done this with one of the power feeds, it is a good idea to do it to the other power feed.
A further extension of this is to take both feeds down at the same time, to simulate a full power down of both feeds. This must be planned very carefully, but the advantage is that all systems within the data center will come down – they can be shut down gracefully or they can be allowed to come down hard, again to simulate a harsh power outage. This would only be done if you have docemented processes that can be followed to recover your systems and you are confident those processes will be successful.
These methods are an ideal way to test and validate your documented processes, perform tests that are much more indepth than just recovery to your alternate site, and gain a much better understanding of your data center assets. If you plan the type of exercises I’ve mentioned here, you should start planning sessions 3-4 months in advance of the exercise date. There are a lot of decisions to be made in order to do this correctly.
Thanks.
Harlan Dolgin, CBCP
Dolgin Consulting
314.304.4354
hdolgin@dolginconsulting.com
