this article explains the nature of availability and how the two types of availability are calculated. finally, through the case study, how to choose the calculation method.

what is unavailability and availability?

software systems are developed for users. therefore, the availability of the software system is determined by the user. for a software system that is not used by users, it is not meaningful to talk about its usability.

from this, we derive the definition of “software unavailability”: the unavailability that the user can perceive is called unavailability.

so, is planned downtime unavailable? if the user perceives planned downtime, we generally consider planned downtime not to be unavailable.

unavailable backhand, ready to use. the ability of a software system to maintain an available state is what we call availability. this is the essence of usability.

in the industry, availability is calculated in two ways:

  1. 1. time-based availability calculations
  2. 2. availability calculation based on unit of work success rate

let’s discuss both of these approaches in detail.

method 1: time-based availability calculations

the calculation formula is: availability = time uptime of the system / (system uptime + downtime)

this downtime usually refers to unplanned downtime.

in this way, the usability metric is more like the score on the scoreboard in a basketball game. we want more points, but it can’t tell us how to get it.

most of the articles in the industry are also written this way.

method two: availability calculation based on the success rate of units of work

This is another way of calculating availability in Google SRE: availability = number of successful requests / total number of requests. This is the request success rate. However, consider that not all services are requested, such as scheduled task-type services.

so, let’s change the formula to: availability = number of successful units of work / (number of successful units of work + number of failed units of work).

in this way of computing, the owner of each service must actively think about how to measure the availability of the service for which it is responsible.

we’ve also noticed that the potential benefit of this kind of usability computation is that developers are forced to think about the definition of success of a service. in some teams with insufficient product capabilities, developers will push back the product to give a clearer definition of “success.”

what availability calculation method do you choose?

the dimensions that are considered for how availability is calculated

we consider which way to choose from the following two dimensions:

  1. 1. whether it is instructive for practical work: can it really guide us how to improve availability?
  2. 2. accuracy: accuracy to the availability of an internal service.

learn how to choose how to calculate availability through case studies

by example, it is easier for us to see the nature of usability computing.

let’s start with two cases.

CASE 1: AT 3 A.M., ONLY 3 USERS ARE ACCESSING OUR SERVICE. AND WE ARE RELEASING AN INTERNAL SERVICE A, CAUSING MULTIPLE REQUESTS FROM THESE 3 USERS TO RETURN FAILED. AT THIS POINT, HOW SHOULD WE CALCULATE AVAILABILITY?

if the time-based approach is, planned downtime is not counted in the unavailability time range. however, for users, our service is unavailable at 3 a.m.

so, in this case, the time-based approach has no guiding significance for the actual work (what the user perceives as usability). nor can it be accurate to which service is not available.

it makes more sense to choose a way based on the success rate of the unit of work.

Case 2: When there is a problem with DNS, users cannot request our services at all www.example.com/login referred to as login service and www.example.com/push referred to as push service. At this point, how should we calculate availability?

in this case, users don’t have access to our service, so how can we calculate the success rate of the unit of work? therefore, at this time, it can be actually instructed in work, but it is not accurate. therefore, calculations based on the success rate of the unit of work are inappropriate.

assuming that the login service and push service are dialed and monitored, we can accurately know that the two services are really unavailable.

final conclusion

from two examples, we can see that it is impossible to accurately calculate the availability in the eyes of users in one way alone. so, in practical work, we need to use both ways at the same time.

attach:

the relationship between antifragility and availability

both antifragility and availability express the ability of a software system to maintain an available state. but the focus is different. anti-fragility is more focused on expressing the initiative to attack the unavailable state. usability, on the other hand, is more focused on stating a fact.