Understanding Disaster Recovery
As an IT consultancy, we frequently receive requests for the development of disaster recovery (DR) solutions. Many clients come to us with a specific design or product in mind that they believe will be the right fit to meet their DR requirements. The only problem- they very often have not been properly advised on how to determine the correct fit. Their design/product might be right for the business, but it also may not. This is where we step in to help.
Definitions to Understand
First, we need to get on the same page for some basic technology definitions (I’ll cover even more terms in a bit):
Backup (noun) – A partial or full copy of your data in a restorable format.
High Availability (HA) – A system that incorporates technology that allows whole or single components to fail and be restored quickly on other resources, thereby maintaining high uptime.
Disaster Recovery (DR) – The organization of a system (or systems) that provides for the continued operation of a business entity in the event of a major outage from natural disasters, water or fire damage, power issues or other big, business-stopping issues.
Now that we have some common terminology, we can discuss how these different terms interact. First and foremost, every business should have some protection for their data. This is usually implemented through backup software and a destination for the copied data (tape, external drives, cloud provider, etc.). This service is the most critical for business data (before HA and DR) and is why we ask “What are you doing for backups?” in the beginning of every HA/DR engagement. It is the first step in a DR design and may be implemented in different ways, depending on requirements. We just need to verify the basics are covered before moving forward.
Defining Business Requirements
So, you have backups all setup and working- awesome. However, before we can dive into HA and DR solutions, we need to establish your business requirements. We do this by working with multiple parts of the business (finance, IT, operations and others) to identify key Line of Business (LOB) systems, personnel, locations and the actual impact of downtime. This impact is usually defined as “revenue loss per hour” but can be measured in other ways as necessary. Manufacturing and retail operations can more easily identify this loss, but it can (and should) still be determined for every business. We start here to determine which solutions (HA, DR or both) make the most sense to develop.
Once the key LOB systems, personnel, locations and downtime impact has been identified, we must establish the amounts of downtime and data loss that is acceptable. There are two key terms we need to understand:
- RTO – Recovery Time Objective
RPO – Recovery Point Objective
The RTO value defines the amount of time that is acceptable for the business to be off-line. In basic terms, this would be the gap of time from the start of an outage (flooded server closet, for example) to the point where the servers are up and running. This is usually measured in hours or days, but it depends on the client and type of business. If a requirement has been set for the business to be up and running in five business days and the related design supports this, the RTO value is considered five days.
The RPO value defines the amount of data that is acceptable to be lost. In normal server environments, you might have your key LOB applications performing a backup of the data throughout the day at a set interval, say every four hours. This would (loosely) mean that if there was an outage sometime in the day, there would only be up to four hours of lost data. This would make the RPO value equal four hours. If other key systems have a different backup interval, the RPO value equals the highest backup time interval.
These values are usually defined (after some instruction) by a business’ upper management and/or ownership through heavy discussion. We find that these numbers are often in conflict to what IT/IS departments are capable of handling. This is not unusual, which is why we go through the exercise above. Without a solid understanding of the aforementioned values (read: business requirements), it is unlikely that an appropriate structure (and budget) has been designed to meet business needs.
Getting into High Availability
With the key LOB applications, personnel, locations and RTO/RPO values carefully defined, we can start the process of designing a solid DR/HA plan that matches those requirements. The next step is to review the LOB applications and identify HA (high availability) solutions that fit within the requirements. Many products already have an HA component. For example:
- Microsoft SQL has Windows Failover Cluster and Always On Availability Groups
- Microsoft Exchange has Database Availability Groups (DAGs)
- Microsoft SharePoint has a farm of servers
- VMware has an HA component
…and so on.
Adding these HA components can greatly improve your RTO/RPO values. However, if they are implemented in the same location as your key infrastructure, the environment may still be at risk for downtime. This is a common issue with our on-premises clients. These clients will have HA services in place to protect uptime for key LOB applications, but if there is a site outage, everything goes down. This is what drives us to the final design element: DR.
Building The Disaster Recovery Plan
With our backups in place and HA implemented where necessary, we can review our RTO/RPO values and create an appropriate DR strategy. This usually includes a remote location which might be a datacenter, secondary business location or cloud provider. I say ‘usually’ as some clients may have an RTO value (the time the business can be down) that is so high that they would find little value in the cost associated with a remote location. We ask those particular clients to declare this in their DR documentation.
For those clients that need remote locations, we evaluate the best fit by reviewing the location information we identified earlier in our DR/HA evaluation process. Choosing a secondary business location often makes the most sense as there is already an investment in connectivity (bandwidth) and the remote office(s) could have space available for hardware. It would make sense to capitalize on these operating expenses and gain a DR location.
For clients without secondary sites, we may review regional datacenter options if a great deal of remote hardware is needed or they have equipment already available. This hardware is racked in purchased rack space at the datacenter and configured to be used for DR failover from the main location. We have designed solutions where entire environments fail over to the datacenter, including LOB applications, phone and web services.
IaaS – Infrastructure as a Service – An entire hosted environment that may mirror your on-premises locations. Can be completely configured and managed by the client. Very flexible, but usually the most expensive, especially traffic coming from the cloud. Microsoft Azure and Amazon EC2 provide IaaS services.
PaaS – Platform as a Service – Allows clients to create, configure and run Web applications without having to manage the associated infrastructure. Amazon Elastic Beanstalk and Google App Engine are PaaS examples.
SaaS – Software as a Service – On-demand software that can be provided to clients on a subscription model. Microsoft Office 365 and Salesforce.com are examples.
Which design is incorporated depends on the business requirements, RTO/RPO values and OpEx comfort of the business. We choose the appropriate options through careful review of this information.
Wrapping Up Our Hard Work
Once all of the backup, HA and DR design and implementation is completed, we update the client’s documentation to reflect the new services, note any deviations from requirements and lay out a DR “what will happen when…” document. From this point, our client has the appropriate understanding of their business needs, the solutions in place to support those requirements and the documentation necessary for on-going support and audit fulfilment.
VP of Technical Services | Cyber Advisors Inc.
Office: 952-924-9990 | email@example.com