High availability is the description of a system designed to be fault-tolerant, highly dependable, operates continuously without intervention, or having a single point of failure. These systems are highly sought after to increase the availability and uptime required to keep an infrastructure running without issue. The following characteristics define a High Availability system.
High Availability Clustering
High-availability server clusters (aka HA Clusters) is defined as a group of servers which support applications or services that can be utilized reliably with a minimal amount of downtime. These server clusters function using a type of specialized software that utilizes redundancy to achieve mission-critical levels of five9’s uptime. Currently, approximately 60% of businesses require five9’s or greater to provide vital services for their businesses.
High availability software capitalizes on the redundant software installed on multiple systems by grouping or clustering together a group of servers focusing on a common goal in case components fail. Without this form of clustering, if the application or website crashes, the service will not be available until the servers are repaired. HA clustering addresses these situations by detecting the faults and quickly restarting or replacing the server or service or server with a new process that does not require human intervention. This is defined as a “failover” model.
The illustration below demonstrates a simple two node high availability cluster.
High Availability clusters are often used for mission-critical databases, data sharing, applications, and e-commerce websites spread over a network. High Availability implementations build redundancy within a cluster to remove any one single point of failure, including across multiple network connections and data storage, which can be connected redundantly via geographically diverse storage area networks.
High Availability clustered servers usually use a replication methodology called Heartbeat that is used to monitor each node’s status and health within the cluster over a private network connection. One critical circumstance all clustering software must be able to address is called split-brain, which occurs when all private internal links go down simultaneously, but the nodes in the cluster continue to run. If this occurs, every node within the cluster may incorrectly determine that all the other nodes have gone down and attempt to start services that other nodes may still be running. This condition of duplicate instances running similar services, which could cause data corruption on the system.
A typical version of high availability software provides attributes that include both hardware and software redundancy. These features include:
- The automatic detection and discovery of hardware and software components.
- Autonomous assignment of both active and contingent roles to new elements.
- Detection of failed software services, hardware components, and other system constructs.
- Monitoring and notification of redundant components and when they need to be activated.
- Ability to scale the cluster to accommodate the required changes without external intervention.
Fault tolerance is defined as the ability for a system’s infrastructure to foresee and withstand errors and provide an automatic response to those issues if encountered. The primary quality of these systems is advanced design factors, which can be called upon should a problem occur. Being able to configure an infrastructure that envisions every possible solution is a considerable task that involves the knowledge and experience to counter the multiple concerns before they occur. System architects who design such frameworks will have the methodologies which envision the means to alleviate these problems in advance, and the ability to implement these frameworks.
The following redundancy methodologies are available and should be reviewed during the initial stages of design and implementation.
- N + 1 Model - This concept infers the sum of equipment needed (which we will refer to as ‘N’) to keep the entire framework up and running, with an additional independent component backup for each of the ‘N’ components in case of failure.
- N + 2 Model - Similar to the N + 1 model but with an additional layer of protection if two components should fail.
- 2N Model - This modality has a dual redundant backup for each element to ensure the system’s framework is fully functional.
- 2N + 1 Model - Again, this model is similar to the 2N model but with a supplemental component to add a tertiary layer of protection to the system’s framework.
As models progress from Nx to 2Nx, the cost factor also increases exponentially as for truly redundant systems that require uptime. These modalities are critical for stability and availability.
Dependability and Reliability
One of the central tenants of a high availability system is uptime. Uptime is of premier importance, especially if the purpose of a system is to provide an essential service like the 911 systems that respond to emergent situations. In business, having a high availability system is required to ensure a vital service remains online. One example would be an ISP or other service that cannot tolerate a loss of function. These systems must be designed with high availability and fault tolerance to ensure reliability and availability while minimizing downtime.
Orchestrated Error Handling
Should an error occur, the system will adapt and compensate for the issue while remaining up and online. Building this type of system requires forethought and planning for the unexpected. Being able to foresee the problems in advance, and planning for their resolution is one of the main qualities of a high availability system.
Should the system encounter an issue like a traffic spike or an increase in resource usage, the system’s ability to scale to meet those needs should be automatic and immediate. Building features like these into the system will provide the system’s ability to respond quickly to any change in the systemic functionality of the architectures processes.
Availability & Five 9’s Uptime
Five 9’s is the industry standard of measure of uptime. This measurement can be related to the system itself, the system processes within a framework, or the program operating inside an infrastructure. This estimation is often related to the program being delivered to clients in the form or a website or web application. A systems Availability can be measured as the percentage of time that systems are available by using this equation: x = (n – y) * 100/n. This formula denotes that where “n” is the total amount of minutes within a calendar month, and “y” is the amount of minutes that service is inaccessible within a calendar month. The table below outlines downtime related to the percentage of “9’s” represented.
|Availability %||90% |
|Downtime/Year||36.53 days||3.65 days||8.77 hours||52.60 minutes||5.26 minutes|
As we can see, the higher the number of “9’s”, the more uptime is provided. A high availability system’s goal is to achieve a minimal amount of potential downtime to ensure the system is always available to provide the designated services.
One of the main High Availability components is called Heartbeat. Heartbeat is a daemon which works with a cluster management software like Pacemaker that is designed specifically for high-availability clustering resource management. Its most important characteristics are:
- No specific or fixed maximum number of nodes - Heartbeat can be used to build large clusters as well as elementary ones.
- Resource monitoring: resources can be automatically restarted or moved to another node on failure.
- A fencing mechanism needed to remove failed nodes from the cluster.
- A refined policy-based resource management, resource inter-dependencies, and constraints.
- A time-based rule set to allow for different policies depending on a defined timeframe.
- A group of resource scripts (for software like Apache, DB2, Oracle, PostgreSQL, etc.) included more granular management.
- A GUI for configuring, controlling and monitoring resources and nodes.
The first segment of a highly available system is the clearly designed utilization of clustered application servers that are engineered in advance to distribute load amongst the whole cluster, which includes the ability to failover to a secondary and possibly a tertiary system.
The second division includes the need for database scalability. This entails the requirement of scaling, either horizontally or vertically, using multiple master replication, and a load balancer to improve the stability and uptime of the database.
The third characteristic is geographic diversity. This ensures that should a natural disaster strike a single locational, that failure will not hinder the ability to provide the service.
The fourth and possibly most important component is to provide a backup replication and disaster recovery methodology. The ability to ensure a working backup guarantees that our data is safe. Using the latest backup strategy (3-2-3) states that you should have three copies of your data, on two different media types, in three geographically diverse offsite locations for disaster recovery.
When discussing the theme of uncomplicated deployments, they should be specifically mapped to your specific business requirements. The following traits will benefit our operational framework regardless of industry vertical:
- Modest Training Requirements
- Increased Productivity
- Extended Life Cycle
- Cost Effectiveness
- Operational Efficiency
- Rapid Implementation
- Reduced Security Risks
- Straightforward Integration
- Simplified Management
These features define many of the primary aspects needed to ensure a highly reliable, fault-tolerant, clustering solution. High availability, at its core, should be designed with these characteristics in mind. Capabilities such as these are key tangibles that are required assets when adopting deployment options.
Best Practices Objectives
The primary goal of any high availability best practice objective is the optimal design, installation, deployment, integration, and adherence to a standard convention at the lowest reasonable cost and the minimal complexity while achieving the stated benchmarked targets of eliminating every single point of failure in the system.
First, a determined goal should be defined prior to the design of the system. This includes establishing what the Recovery Point Objective (RPO) is. The RPO is the largest amount of downtime that your company is willing to lose during a major outage. The HA hardware, software, and adjunct services should all have a defined and tested RPO.
Next, the system should be built with the most robust, cost-effective hardware available. This includes systems that are resilient to power outages and hardware failures, spanning everything from hard disks, networking components, the operating system, and the application itself encompassing the whole software stack.
Evaluation & Testing
Once the system is built, an integral linchpin is testing our target system to ensure the failover system is ready to switch over if the source fails. This requires preparing our network configurations, servers, real-time synchronous replication software, and switches to transition from source production processing to the target system that processes changeover at a moment’s notice. This method used in this scenario is known as a “hot standby” system. Additionally, this includes setting up a regimented testing schedule as the system is retested regularly.
Ensuring a reproducible and repeatable iteration of the entire software stack across multiple regions is key to constant durability, deliverability, and soundness of the application framework. The other significant service area is the replicable hardware segment, which complements the software and monitoring frameworks. Being able to rely on a dedicated duplication methodology is fundamental to guaranteeing a fully fault-tolerant and reliable system.
Monitoring & Tracking
Lastly, ongoing monitoring, evaluation, and observation should be tightly regulated to ensure performance goals are met. Any deviation from the norm should be investigated and assessed to determine the impact the variance has on the system. Once that disposition has been established, a follow-up analysis should be performed as to whether any changes should be enacted to include the adjustment or alterations needed to bring the system into a new stable state.
The primary goal of a high availability system is to prevent and eliminate all single points of failure. This should include multiple action plans that have been tested and in place, ready to react independently and immediately to any and all service disturbances, disruptions, and failures. This includes hardware, software, and application irregularities. The eradication of downtime can be accomplished with the composed, skilled planning and implementation of a system. A critical eye is required to envision and prepare for any occurrence or disaster, which could impede the primary objective of the stated and expected uptime goal. A well instituted High Availability system can achieve this target with proper planning and design, reducing or eliminating disruptions and maximizing availability.
Careful Planning + Reliable Implementation Methodologies + Stable Software Platforms + Sound Hardware Infrastructure + Smooth Technical Operations + Prudent Management Goals + Consistent Data Security + Predictable Redundancy Systems + Robust Backup Solutions + Multiple Recovery Options = 100% Uptime
Our talented Support Teams are staffed with experienced Linux technicians and System administrators who have an intimate knowledge of multiple web hosting technologies, especially those discussed in this article.
If you are a Fully Managed VPS server, Cloud Dedicated, VMWare Private Cloud, Private Parent server or a Dedicated server owner and you are uncomfortable with performing any of the steps outlined, we can be reached via phone @800.580.4985, a chat or support ticket to assisting you with this process.
Our Sales and Support teams are available 24 hours by phone or e-mail to assist.