Site Reliability Engineering (SRE)

I’m bit late to post in this blog in the year 2022 due to some personal exigencies. Being three months already in this year, and considering the widespread reach to the term Site Reliability Engineering, I believe the acronym SRE will be a better way to start off this year. I’m trying to convey what I’ve learned about SRE as a System Admin for more than a decade and SRE for another half a decade.

According to the person who coined this word Ben Treynor Sloss, the senior VP overseeing technical operations at Google SRE is

“what happens when a software engineer is tasked with what used to be called operations.”

In another words Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. Summarising this we can say that a SRE is a professional with solid background in coding/automation, who uses that experience to solve problems in infrastructure and operations.

If you think of DevOps as a philosophy and an approach to working, you can argue that SRE implements some of the philosophy that DevOps describes, and is somewhat closer to a concrete definition of a job or role than, say, “DevOps engineer” So, in a way, we can say:

class SRE implements DevOps;

abstract class DevOps {
  // Reduce organization silos
  abstract reduceOrganizationSilos(): BetterColaboration: 

  // Accept failure as normal
  abstract acceptFailureAsNormal(): ReliabilityGoal;

  // Implement gradual change
  abstract implementGradualChange(): ErrorBudget;

  // Leverage tooling and automation
  abstract leverageAutomation(): LongTermValue;

  // Measure everything
  abstract measureEverything(): BetterObservability;

class SRE implements DevOps {

I will explain more about SRE in this blog post quoting from the Introduction of the SRE Book [Site Reliability Engineering; How Google Runs Production Systems] written by Ben Treynor Sloss & edited by Betsy Beyer.

“Hope is not a strategy.”
-Traditional SRE saying
It is a truth universally acknowledged that systems do not run themselves. How, then, should a system — particularly a complex computing system that operates at a large scale — be run?

When we say “Hope is not a strategy” we mean: We need to apply best practices, instead of just letting software and new features launch and trusting that it will be successful. We use it to call out anyone who is letting something happen (such as a launch or running a system) without applying the proper principles and best practices. The book clearly defines the Principles, Practices and Management about the Site Reliability Engineering in a better way.

A site reliability engineer can be a generalist or a specialist. Depending on the individual skill set organizations can engage a SRE in a number of general or specialist roles like: Educator, SLO guard, Infra architect, Incident response leader etc. Details about SLA, SLO, SLI can be found in a previous post here. SRE’s may contribute to the code base of a product or write development policies and procedures as and when needed. Workflows, priorities and day-to-day operations for SRE vary from team to team. They all share a set of basic responsibilities for the service(s)/products(s)/platform(s) they support and always adhere to the core responsibility for availability, latency, performance, monitoring, efficiency, change management, emergency response and capacity planning. As defined in SRE book google caps operational work for SREs at 50% of their time and the remaining should be spent on their coding skills and project works. They achieve this by reintegrating developers into on-call rotations, routing excess operational work to the product development team and even re assigning bus and tickets to development or engineering managers.

One of the key responsibility of SRE is to quantify confidence in the systems they maintain. Confidence can be measured both by past reliability and future reliability. Past reliability is captured by analysing monitoring data historically and future reliability by predictions based on the past system behavior. We will discuss more on the Principles, Practices and Management about the Site Reliability Engineering in the later posts which will be followed shortly after this.

A SRE has responsibility for all these areas:

  • General systems uptimes
  • Systems performance
  • Latency
  • Incident and outage management
  • Systems and application monitoring
  • Change management
  • Capacity planning

In a nutshell Service Reliability hierarchy is as follows,

Service Reliability Hierarchy

It’s easy to define what site reliability engineers do, but which skills exactly do SREs need to perform their jobs is a much more undefined or complicated question. As mentioned earlier though the SRE skills widely vary from team to team depending on multiple factors like – types of systems managed, types of reliability challenges faced etc.: modern SREs or aspiring SREs need a core set of standard skills that helps them to understand, manage and deploy complex distributed systems at any typical organizations today.

Now we can look in to skill sets that a SRE should master:


Coding is an essential skill to master for a SRE role. Depending on the role understanding development and coding can go a long way. As day-to-day tasks of an SRE include automating processes and dealing with systems, knowing Bash, Python, Yaml and Golang can help you in the long run.

Version Control Tools:

As a SRE, while working with code, you’ll be using Git or some other kind of version control tool. So it makes sense to learn about version control tools mainly distributed verson control systems. So it’s better to have a good understanding of Git and GitHub.

Cloud Computing:

Cloud computing is on of the niche skills that modern SREs can’t live without. Around 90% of business uses cloud in any format available private, public, hydbrid. Realiability of cloud platform cannot be managed if you don’t understand the cloud architecture, cloud networking. data storage, observability and so on and so forth.

Distributed Computing:

Knowing how distributed computing works and understanding the concept of microservices are both significant advantages for an SRE. You’ll be handling large, distributed systems, so having some experience with these topics can really help you progress as a SRE.

Agile & DevOps:

As we already mentioned earlier that class SRE implements DevOps. Many would say that SRE is to DevOps what Scrum is to Agile. DevOps is not a role, it is more of a cultural aspect and can’t be assigned to a person but shoud be done as a team. DevOps engineer most times is just a title used to hire system admins. SREs focus more on the aspects of system availability, observability and scaling. DevOps is a practice of bringing development and operations teams together whereas Agile is an iterative approach that focuses on collaboration, customer feedback and small rapid releases. DevOps focuses on constant testing and delivery while the Agile process focuses on constant changes. Automation is the key to DevOps and we need some tools to do DevOps. Understanding these toolsets and afore mentioned cultural aspect of DevOps is very much in need for being a SRE.

Operating Systems:

Basically a good understanding Operating Systems usually Linux or Windows which is common in most organisations will be helpful. In this Cloud & DevOps era, most public cloud management tools, toolsets that are part of DevOps follow the conventions of Linux CLI. Cloud Native systems like Kubernetes, containers also follow the same CLI principles even if you run them in a Windows environment. So it’s an essential skill for any SRE to work with Linux or *NIX systems even if you come from a Windows background.

Understanding of Databases:

NoSQL databases, there are many types, and each has pretty specific use cases where they excel. Compare and contrast with relational databases like MySQL. This is an excellent time to dive into understanding what a data model is, why data models are necessary, and how the data model should inform your choice of database and your service architecture.

Cloud Native Applications:

Knowing cloud native applications is another skill to master as a SRE. You don’t have to know them in depth, but here are some knowledge areas that can help your organization and you as you get on the road to becoming a successful SRE. Knowing what docker is having some idea about how containers work and understanding how to run a secure application using Kubernetes is also a set of skills to master as a SRE.


In the modern distributed environments at scale, networking plays a pivotal role. It is also considered as culprit when something goes wrong. Even if the organizations have different networking engineers and/or connectivity team SREs need an indepth understanding of networking and different protocols and topologies used in modern system design to know when the network is the root cause of an incident and how efficiently and effectively to resolve those issues.


As we mentioned earlier monitoring is an integral part of Service Reliability hierarchy. Monitoring tools make your life easier when you’re an SRE. They give you a brief look into your system performance and issues your system is dealing with. Implementing these tools and getting insights from them is the primary goal of SRE, so the system experiences as little downtime as possible. Prometheus and Grafana are widely used monitoring solutions, so it makes sense to learn those.

CI/CD Pipelines:

It’s hard to address reliability problems that emerge from the source code or deployment process if a SRE don’t have a good understanding of how CI/CD process work and which tools are being used in that area. Even though SRE don’t typically develop software they must know how a software is written and deployed. Most organizations today rely on CI/CD pipeline for this. So this skill is also a niche skill for SREs.

Security Engineering/Response:

SREs who dont understand security fundamentals are at risk of implementing reliability solutions that are effective from a reliability standpoint and not really secure. Though this domain is one that SREs don’t own but they require significant skills in this area.

Incident Management:

SREs must know how incident response roles are structured and have to take lead in organizing the incident response team, communicating with takholders and devising best strategy to ensure rapid and effective incident resolution.

Problem Management:

As we mentioned earlier in Service Reliability hierarchy, postmortem/root cause analysis is a must for reliability engineering. Knowing how to run a postmortem and derive a RCA is considered as an important skill a SRE should possess.


As a SRE, you’ll need to report critical incidents that affect applications or you’ll be working with software engineers and others. In all these situations, having effective, well-developed communication skills makes life much easier. To ensure there are no miscommunications while reporting incidents this is also a skill to master if you are in the path to a SRE

The list of SRE skills could go on infinitely but the skills mentioned here are best and good to have skills to transition yourselves to a SRE or if you want to excel in your current role as a SRE.

I have worked as System admin, architect etc. and the most I enjoyed was as my tenure as a SRE and SRE lead. If you enjoy working on the backend and want to get closer to your system’s performance, reliability, and scalability, then an SRE role might just be perfect for you!

Site Reliability: SLI, SLO & SLA

Service Level Indicator(SLI), Service Level Object(SLO) & Service Level Agreement(SLA) are parameters with which reliability, availability and performance of the service are measured. The SLA, SLO, and SLI are related concepts though they’re different concepts.

It’s easy to get lost in a fog of acronyms, so before we dig in, here is a quick and easy definition:

  • SLA or Service Level Agreement is a contract that the service provider promises customers on service availability, performance, etc.
  • SLO or Service Level Objective is a goal that service provider wants to reach.
  • SLI or Service Level Indicator is a measurement the service provider uses for the goal.

Service Level Indicator
SLI are the parameters which indicates the successful transactions, requests served by the service over the predefined intervals of time. These parameters allows to measure much required performance and availability of the service. Measuring these parameters also enables to improve them gradually.

Key Examples are:

  • Availability/Uptime of the service.
  • Number of successful transactions/requests.
  • Consistency and durability of the data.

Service Level Objective
SLO defines the acceptable downtime of the service. For multiple components of the service, there can be different parameters which defines the acceptable downtime. It is common pattern to start with low SLO and gradually increase it.

Key Examples are:

  • Durability of disks should be 99.9%.
  • Availability of service should be 99.95%
  • Service should successfully serve 99.999% requests/transactions.

Service Level Agreement
SLA defines the penalty that service provider should pay in an event of service unavailability for pre-defined period of time. Service provider should clearly define the failure factors for which they will be accountable(Domain of responsibility). It is common pattern to have loose SLA than SLO, for instance: SLA is 99% and SLO is 99.5%. If the service is overly available, then SLA/SLO can be used as error budget to deploy complex releases to production.

Key Examples of Penalty are:

  • Partial refund of service subscription fee.
  • Additional subscription time added for free.

So here is the relationship. The service provider needs to collect metrics based on SLI, define thresholds of metrics based on SLO, and monitor the thresholds of metrics so that it won’t break SLA. In practical, the SLIs are the metrics in the monitoring system; the SLOs are alerting rules, and the SLAs are the numbers of the monitoring metrics applying to the SLOs.

Usually the SLO and the SLA are similar while the SLO is tighter than the SLA. The SLOs are generally used for internal only, and the SLAs are for external. If a service availability violates the SLO, operations need to react quickly to avoid it breaking SLA, otherwise, the company might need to refund some money to customers.

The SLA, SLO, and SLI are based on such assumption that is the service will not be available 100%. Instead, we guarantee that the system will be available greater than a certain number, for example, 99.5%.

When we apply this definition to availability, for example, SLIs are the key measurements of the availability of a system; SLOs are goals we set for how much availability we expect out of a system; and SLAs are the legal contracts that explains what happens if our system doesn’t meet its SLO.

SLIs exist to help engineering teams make better decisions. Your SLO performance is critical information to have when you’re making decisions about how hard and fast you can push your systems. SLOs are also important data points for other engineers when they’re making assumptions about their dependencies on your service or system. Lastly, your larger organization should use your SLIs and SLOs to make informed decisions about investment levels and about balancing reliability work against engineering velocity.

Note this abstract is taken from SRE Fundamentals, CRE and the book Site Reliability Engineering: How Google Runs Production Systems