Over the past 20-plus years, one thing every executive has in common is the desire to share their biggest IT nightmares and deepest secrets during an evening dinner conversation. I have heard tales of tapes being backed up religiously, only to find that, when needed, all the backup tapes were blank. Even when data was backed up correctly, one person I spoke with had never tested the restore process, and it did not work.
There were confessions about Disaster Recovery sites that could not possibly have been operational for weeks or even months. One executive shared that, prior to a Disaster Recovery exercise their teams spend all week backing up the production environment to get ready for the test. I cannot make this stuff up!
So far, my favorite data horror story was the one about the administrator who found out the operations manager was taking home tapes each evening in his car. It happens more commonly than we like to think—a shared secret among IT groups.
These stories share a common thread. The executives who told them never exercised the Disaster Recovery plan until the disaster occurred. Many had no documented plan, or they had an outdated plan that lacked years of updates, and the contact lists were filled with employees who no longer worked at the company. All of these problems could have been discovered during a routine exercise.
I made a decision early in my tenure at the State of Nebraska, to eliminate the disaster recovery site in favor of two active data centers. The idea with two data centers was to synchronize the replication of data and network configuration between sites. The solution Nebraska provided currently works in real-time with a geographically-separate recovery site. I made this decision due to several reasons, the most important being Availability,and the former disaster recovery site was not a realistic option in the event of an outage or disaster.
The main purpose of the disaster recovery site was to recover quickly, thereby ensuring business operations could continue with the minimum impact to customers. While it seemed desirable to recover all applications as quickly as possible, the recovery process had to prioritize the most critical applications, or the ones that affected the largest number of citizens, like the mainframe. So the State of Nebraska implemented a design that allowed the mainframe to move from a primary site to a secondary one in the event of failure at either site.
The Operations team proactively tested the design with some assistance from partner agencies. The test met the Recovery Time Objective (RTO), which was minutes, and the Recovery Point Objective (RPO) was zero. This result ensured the team that the citizens of Nebraska, in the event of an outage, would have access to business-critical applications with limited downtime and no data loss.
With the primary focus on availability, a traditional Disaster Recovery site misses its mark. Critical applications must be accessible at all times for governments to effectively serve citizens. Application availability begins with network architecture, which is why Nebraska redesigned the network in recent consolidation efforts. The consolidated configuration now delivers high availability for citizens in the dynamic and agile 21st century.
The moral of the previously-mentioned horror stories is, IT groups who are responsible for ensuring business continuity and data availability should provide confirmation of their plans through documented Disaster Recovery exercises. The reality is, delivering high availability of applications across multiple interdependent systems is an increasingly difficult task.