The term “high availability” is defined by the Institute of Electrical and Electronics Engineers (IEEE) as: “Availability of resources in a computer system, in the wake of component failures in the system.”
A system can be called highly available, if its applications and service are available even in the case of an error without direct human interaction. This implies, that a user experiences no or only little interruption. High availability does not mean permanent availability, though.
Reliability and Availability
There are two strategies to uphold and guarantee a service.
- Usage of highly reliable components, with a low probability of a downtime (error elimination): Errors can never be ruled out since they are unpredictable. In practice, this is defined as Mean-Time-Between-Failure (MTBF). Higher MTBF logically is connected with higher investment and maintenance costs.
- Usage of reliable components with the ability of automatic recovery: Errors are compensated so that the system as a whole in still functional. After error recognition, availability can be guaranteed by
- error correction: restarting of processes or of the whole (operating) system
- error compensation: redundant components uphold service
Availability is the percentage of a predefined time unit of which the system functions properly; a specification about how often the system experienced a downtime is not defined though. This means that a system can experience multiple downtimes but needs a fast restoring, to achieve its required availability time. The ration of uptime to downtime is often expressed in multiple “nines”. The following table shows the influence of the availability on the actual downtime of a system per year and per month. Also maintenance work falls into this time.
Availability Class |
Term |
Availability
in % |
Downtime
per Year |
Downtime
per Week |
| 2 |
resilient |
99% |
3.7 days |
1.7 hours |
| 3 |
available |
99.9% |
8.8 hours |
10.1 minutes |
| 4 |
highly available |
99.99% |
52.2 minutes |
1.0 minute |
| 5 |
error insensitive |
99.999 |
5.3 minutes |
6 seconds |
| 6 |
error tolerant |
99.9999% |
32 seconds |
0.6 seconds |
| 7 |
error resistant |
99.99999% |
3 seconds |
0.06 seconds |
The Harvard Research Group (HRG) divides high availability into its Availability Environment Classification (AEC) in 6 classes:
- Conventional (AEC-0): Function can be interrupted, data integrity is not essential.
- Highly Reliable (AEC-1): Function can be interrupted, data integrity must be however ensured.
- High Availability (AEC-2): Function may be minimum interrupted only within fixed times during the main operating hours.
- Fault Resilient (AEC-3): Function must be maintained within fixed times during the main operating hours continuously.
- Fault Tolerant (AEC-4): Function must be maintained continuously, 24*7 enterprise (24 hours, 7 days the week) must be ensured.
- Disaster tolerant (AEC-5): Function must be available under all circumstances.
High availability is defined in enterprises frequently in the framework by service level agreements (SLA), and represents a substantial valuation criteria for IT-services.
What About the Cloud?
The aforementioned strategies to uphold and guarantee a service, are a little harder to implement using a cloud environment. Most, if not all, cloud providers will not offer any information regarding their data centers, e.g. their geographic location or tier classification. The offered service level agreements refer to the platform as a whole but does not take the uptime of an individual host into consideration.
Zimorys Public Cloud seeks to combine both – reliability and availability. Data centers and connected cloud providers are distinguished by three levels of quality – gold, silver and bronze – reflected in binding service level agreements. However, individual resources could have more quality characteristics such as certifications, fail over systems and guaranteed support level.
The following definition for the classification of data centers is used:
- Tier 4: Has multiple active supply paths for power and air-conditioning, has redundant components, it is fault-tolerant and provides an availability of at least 99,995%
- Tier 3: Has multiple active supply paths for power and air-conditioning, with only one system active in standard use; has redundant components and is manageable at the same time and provides an availability of at least 99,982%
- Tier 2: Has one path each for power and air-conditioning; has redundant components and provides an availability of at least 99,741%
- Tier 1: Has one path each for power and air-conditioning; has redundant components and provides an availability of at least 99,671%
Zimory makes certifications and quality standards of the various data centers transparent and easily understandable. Users have the option to select higher-level certifications for specific applications and to choose less expensive services for other applications.
In addition to providing reliable hardware infrastructure, Zimory Public Cloud also allows the building of multi-tier software architecture. Multi-tier architectures are scalable, since
the individual layers are logically separated. For example, in distributed system architectures, the data layer runs on a central database server, the logic layer runs on a remote application server, and the delivery is handled by a web server. In such an architecture, the individual components can be adapted to increasing load by replication. For example, if many users use the application, a clone of the application server can be created, which shares the requests with the first server. This clone operation can be triggered through the API of the Public Cloud with a rule-based trigger.
Each user of the Public Cloud also has full control of the underlying network layer. All virtual machines of an individual user are deployed in an own exclusive VLAN, allowing the building of clusters, e.g. to implement high availability services. In a series of articles, I will demonstrate how high availability application and database clusters can be built in the cloud, using open source projects such as heartbeat, DRBD, HAProxy and MySQL. So stay tuned to find out how you can quench that extra “nine” into the uptime of your multi-tier application.