Why Thinking ‘Post-Failure’ Is Important
In the event of a catastrophic failure, one that brings your infrastructure down and your operations to a standstill, you need a sensible, priority-emphasized, easy-to-follow plan of action. As company operations increasingly rely on network services, digital databases, electronic communications, and web traffic, IT infrastructures (and their ability to be highly available) become increasingly important.
However, failures happen. Whether it’s an act of God, a squirrel taking out an entire power grid (it really happened), or just good old fashioned human error—the Ponemon Institute reported that 22% of all data center outages in 2016 were due to human error—your workflow, your revenue stream, and your infrastructure are susceptible to stoppages. If, and when, disaster strikes your enterprise you need a clear idea of how (and maybe more importantly, when) you’re going to get back up and running so that you can get back to business.
The Potential Consequences of a Catastrophic Failure are Horrendous
According to a survey conducted by the Uptime Institute, in 2018 nearly a third of all data centers experienced an outage—a 25% increase from the year before. The top three causes? That would be power outages (33%), network failures (30%), and software errors (28%). So, as you can see, it happens. Maybe more often than you’d thought. But, what does this mean for you and your company? Well, uptime is absolutely integral to safeguarding your workflow and revenue. In fact, per a report by Gartillioner, Inc., for every minute of IT downtime—website, servers, database, and the like—companies are losing an average of $5,600 a minute. That’s $336,000 an hour.
Now consider that the average business experiences 14.1 hours of IT downtime, annually (according to the Aberdeen group). That’s as much as 4.7 million. Annually.
Obviously, this all depends on the size of your company, your business model, and your reliance on IT, but the idea remains the same—every minute your infrastructure is down, you’re losing a lot of money.
If every minute counts during an IT stoppage, you’re going to want to get back up and running as quickly as possible. So, why not take the time now (while all systems are go) to come up with a plan to mitigate downtime in the event of a massive failure—it could quite literally be the one thing that saves your company. Is this hyperbole? Nope. According to the U.S. Bureau of Labor, 80% of companies that experience an IT-related catastrophic failure (and don’t have a disaster recovery plan in place) will fail in just over a year, while 43% won’t even re-open. These numbers are worse when the catastrophe is data related. Again, according to the U.S. Bureau of labor, a staggering 93% of companies that experience a significant data loss are out of business within 5 years.
Here Is Your Post-Failure Plan and Checklist
A disaster recovery plan is going to save you time, which is going to save you money, maybe even your company. It’s as simple as that. Yes, there are some scary statistics out there (and there’s more to come), but there are some hopeful ones too. For example, according to Datto, “With a reliable backup and disaster recovery solution in place, the majority of SMBs will fully recover from a ransomware infection. With a reliable backup and recovery solution (BDR) in place, 96% of MSPs report clients fully recover from ransomware attacks.”
A reliable disaster recovery plan isn’t something you can draw up in a day and put on the shelf somewhere, though. A reliable disaster recovery plan takes coordination, forethought, testing, and consistent retooling.
Below, you’ll find a checklist for creating and maintaining your own disaster recovery plan.
Your 17-Step Disaster Recovery Plan Checklist
- Commit to implementing a disaster recovery plan. Make it a policy. You need to make disaster recovery a priority, not just give it lip service. It’s human nature to eschew actions that don’t result in an immediate benefit, so the first item on this checklist can be quite a doozy. But, it’s important. If you and your company don’t seriously commit to a disaster recovery plan, whatever you do to create one will be half-baked and pretty ineffective. One way to make disaster recovery a priority (and stick to the commitment) is to make it company policy and a part of company culture. The very top levels of your organization can—in writing—establish a policy statement that includes items from this checklist (all of them, some of them, or a version of these that speak to your enterprise uniquely). At the very least, the statement should contain a written intention to create a disaster recovery plan; a commitment to a risk and impact assessment; a commitment to testing the plan; a written intention to updating the disaster recovery plan on a regular basis (usually annually), and a clear method of making employees aware of the disaster recovery plan.
- Do an Impact Assessment. It’s important to examine, investigate, and establish the real-world consequences of a catastrophic failure (typically revenue lost) on your business. How much money will you lose for every hour your infrastructure is down? Do not rely on industry estimates as each company is unique and you don’t want to underestimate your potential losses. It’s also a good idea to determine the cost of infrastructure maintenance, renewals, upgrades, and operation. Once you have these two numbers you can maintain a cost to impact ratio—how much you’re spending versus how much you could potentially lose—that will help guide your decisions when it comes to upgrading your infrastructure. You certainly don’t want to invest in protections that won’t protect more than they’re worth. These assessments should be done regularly (at least annually) as priorities, workflow, costs, and risks are always changing.
- Create (and consistently update) a weighted list of priorities. As a company, you need to inventory your equipment and systems and list them in order of how critical they are to your operations. If your phone system goes down—and they are only used for inter-office communication—it probably isn’t mission-critical. Put it at the bottom of the list. However, if your website is a majority of your business, you better move power supply, network services, and hosting clusters to the very top. It’s a good idea to draw a line between the absolute essentials (what do you absolutely need to operate) and everything else. This way, you’ll know exactly what to prioritize in your disaster recovery plan—the goal is to keep your downtime as short as possible.
- Establish a recovery team (with backups) and clearly communicate your employees’ responsibilities. It should go without saying, your employees’ safety should be the top priority. However, beyond that, they are going to be the ones that get you back up and running. Establish a team of employees to spearhead the recovery process—each with their own clearly defined role and set of responsibilities. Then, establish a second team of employees that can take on the same roles and perform the same functions of those on the first team. This is especially important in areas prone to natural disasters; roads could become impassable, homes could be damaged, etc. With two disaster recovery teams, it becomes increasingly probable that you’ll be able to assemble an entire recovery crew when the worst happens.
- Make documentation clarity, security, and dissemination a top priority. A tedious task that is more than likely going to require more than engineers, making a master document, securing that master document (ideally in multiple places, including an offsite location), and getting that master document into the hands of all company employees, is integral. This document should include contact information for key employees (especially the recovery teams) and should be plainly, clearly written. It cannot be stressed enough that you should take great pains to make a document that is intuitive, written in plain language, well-organized, and absolutely clear on what steps need to be taken, when, and by whom. This could save your business. The more people capable of performing a task in the event that a predetermined employee(s) cannot—thanks to your incredibly clear documentation—the better. It’s high-quality redundancy, except for people.
- Make sure your infrastructure is as secure as it can be (regularly) before a disaster strikes. What’s the best way to handle a catastrophic failure? Avert it. Yep, the best case scenario when disaster strikes is that all of your automated defenses (redundant nodes, automatic failovers, load balancers, cluster monitors, backup power supplies, network redundancies, etc.) do exactly what they are supposed to do and the failure is nothing more than a few lines of text on a log. However, that means being vigilant about testing your infrastructure, maintaining and upgrading your equipment, and being on the lookout for gaps in your protection.
- Test your backups. Perform restore testing. It’s quick, it’s easy. It’s a no-brainer. You want to know if your backups can actually do what they are supposed to do: restore data.
- Maintain (and replace) your backups. Even the most well-engineered backup media will degrade, break, or otherwise fail. That means you should be rotating and replacing your backups. Make a schedule and stick to it. Again, at least annually.
- Make sure your backups are redundant and secure. Your backups should have backups and the security for those backups should be redundant. Are you storing physical backups onsite, in a fireproof container? Good, now make a backup of that backup and store it offsite in a fireproof environment.
- Perform an inventory of your safeguards. Do you know what kind of protections you actually have in place? Have you mapped your infrastructure? Knowing what’s in place and where it actually is will help you to more quickly diagnose, isolate, and remedy any problems.
- Shore up your safeguards. Performing an inventory of your protections will make it easier to identify and shore up any weakness or single points of failure in your infrastructure.
- Make employee training a (regular) priority. Remember that Ponemon Institute report about data center outages in 2016? No? Okay, let me remind you. The Ponemon Institute reported that 22% of all data center outages came down to human error. A well-trained (and oft-trained) workforce can be one of your best defenses against disaster. There’s nothing to recover from if that guy in HR never presses the wrong switch.
- Select an alternate location for operations in the event that yours becomes unusable. Have another location (even if it’s sub-optimal) to run operations during recovery.
- Create a step-by-step (maybe even minute-by-minute) schedule of tasks that must be completed in the wake of a catastrophic failure. You’ve prioritized your systems, you’ve created a recovery team—yes, close reader, two recovery teams—and you’ve got a temporary base of operations all lined up, now it’s time to create your plan of action. Don’t skimp on the details. Create a thorough (to the point of tedium) step-by-step disaster recovery plan. Assign individual team members and employees to individual tasks, create time-to-completion estimations/expectations down to the minute, and develop a clear order of operations. Step 1. Do this. Step 2. Do this. You get the idea.
- Practice your disaster recovery plan, regularly. Run through your plan. Practice it. Observer the results. Time them. Tweak your plan if you have to. It might seem a bit strange to do this—like some sort of IT war game—but it could save you thousands if not hundreds of thousands in revenue. The quicker you recover the less damage (workflow, revenue, etc.) you’ll take.
- Seek an expert opinion. Bring in a consultant to take a look at your disaster recovery plan. They might pick up on something you overlooked or might have ideas on how to cut down on costs.
- Make sure you revisit, reprioritize, renew, and reinvest. A disaster recovery plan is not static. A disaster recovery plan is a living document that has to be changed in response to real-world variables. The only way that you can change it is by revisiting it, so make sure you’re doing that regularly and consistently (again, at least annually). When you revisit your disaster recovery plan it will give you the opportunity to adjust your priorities and make changes that could shore up your plan, save you money or both. It’s also important to keep reinvesting in not only your plan (extra training and resources are an investment) but also your safeguards. What you don’t spend now you could be spending, and then some, in the event that a catastrophic failure occurs.