Availability is the ability of the loss or a reduction in accessibility of elements of a distributed system. By proper use of Data Integrity# and Authentication Exchange#, it could counter Denial of Service (DoS)#.
Some would discuss Availability in terms of 24/7/365, meaning the service will be up 24 hours a day, 7 days a week, 365 days a year. Nines could be used to refer Availability in percentage, such as four nines referring to 99.990%. However, due to the ever distributed networks, devices, and services, defect per million (DPM) is adopted instead to measure Availability.
Availability | Nines | DPM | Downtime Per Year |
---|---|---|---|
99.000% | Two Nines | 10,000 | 3 days, 15 hours, 36 minutes |
99.900% | Three Nines | 1,000 | 8 hours, 46 minutes |
99.990% | Four Nines | 100 | 53 minutes |
99.999% | Five Nines | 10 | 5 minutes |
A high availability service will prevent financial loss and productivity loss, reduces reactive support costs (the cost to fix when things break), and improves customer satisfaction and loyalty.
The availability of the service could be disrupted by operational errors, network equipment failures, software failures, and security holes. Examples of operational errors, which is typically human errors, are the result of poor change-management processes, and lack of training and documentation. Network equipment failures include hardware failures, power outages or service provider outages, overheating and backhoe (type of excavating equipment or digger). Causes of software failures could be software crashes, unsuccessful switch-overs, or latent code failures.
Design practices could improve the availability of a service including
- hardware redundancy to avoid single point of failure
- software availability such as Spanning Tree Protocol (STP) and Hot Standby Router Protocol (HSRP)
- network/server redundancy
- link/carrier availability#
- clean implementation/cable management
- backup power/temperature management
- network monitoring simply network design,
- change control management (testing before applying changes)
- training
- backup/automatic recover
- security posture and policies