Site Reliability: SLI, SLO & SLA
Service Level Indicator(SLI), Service Level Object(SLO) & Service Level Agreement(SLA) are parameters with which reliability, availability and performance of the service are measured. The SLA, SLO, and SLI are related concepts though they’re different concepts.
It’s easy to get lost in a fog of acronyms, so before we dig in, here is a quick and easy definition:
- SLA or Service Level Agreement is a contract that the service provider promises customers on service availability, performance, etc.
- SLO or Service Level Objective is a goal that service provider wants to reach.
- SLI or Service Level Indicator is a measurement the service provider uses for the goal.
Service Level Indicator
SLI are the parameters which indicates the successful transactions, requests served by the service over the predefined intervals of time. These parameters allows to measure much required performance and availability of the service. Measuring these parameters also enables to improve them gradually.
Key Examples are:
- Availability/Uptime of the service.
- Number of successful transactions/requests.
- Consistency and durability of the data.
Service Level Objective
SLO defines the acceptable downtime of the service. For multiple components of the service, there can be different parameters which defines the acceptable downtime. It is common pattern to start with low SLO and gradually increase it.
Key Examples are:
- Durability of disks should be 99.9%.
- Availability of service should be 99.95%
- Service should successfully serve 99.999% requests/transactions.
Service Level Agreement
SLA defines the penalty that service provider should pay in an event of service unavailability for pre-defined period of time. Service provider should clearly define the failure factors for which they will be accountable(Domain of responsibility). It is common pattern to have loose SLA than SLO, for instance: SLA is 99% and SLO is 99.5%. If the service is overly available, then SLA/SLO can be used as error budget to deploy complex releases to production.
Key Examples of Penalty are:
- Partial refund of service subscription fee.
- Additional subscription time added for free.
So here is the relationship. The service provider needs to collect metrics based on SLI, define thresholds of metrics based on SLO, and monitor the thresholds of metrics so that it won’t break SLA. In practical, the SLIs are the metrics in the monitoring system; the SLOs are alerting rules, and the SLAs are the numbers of the monitoring metrics applying to the SLOs.
Usually the SLO and the SLA are similar while the SLO is tighter than the SLA. The SLOs are generally used for internal only, and the SLAs are for external. If a service availability violates the SLO, operations need to react quickly to avoid it breaking SLA, otherwise, the company might need to refund some money to customers.
The SLA, SLO, and SLI are based on such assumption that is the service will not be available 100%. Instead, we guarantee that the system will be available greater than a certain number, for example, 99.5%.
When we apply this definition to availability, for example, SLIs are the key measurements of the availability of a system; SLOs are goals we set for how much availability we expect out of a system; and SLAs are the legal contracts that explains what happens if our system doesn’t meet its SLO.
SLIs exist to help engineering teams make better decisions. Your SLO performance is critical information to have when you’re making decisions about how hard and fast you can push your systems. SLOs are also important data points for other engineers when they’re making assumptions about their dependencies on your service or system. Lastly, your larger organization should use your SLIs and SLOs to make informed decisions about investment levels and about balancing reliability work against engineering velocity.
Note this abstract is taken from SRE Fundamentals, CRE and the book Site Reliability Engineering: How Google Runs Production Systems