Site Reliability Engineering (SRE)

As per Wikipedia definition of SRE is a discipline that incorporates aspects of software engineering and applies that to IT operations problems. The main goals are to create ultra-scalable and highly reliable software systems.

Many people are still unsure what is a Site Reliability Engineer despite all the information available online. In this post I will try and explain in simple terms about what an SRE is and the challenges and advantages of having an SRE in your team.

Lets start from the traditional approach to system management.

In the old days to manage your system you would hire System Administrator (SysAdmin). That person would be responsible for the configuring your system, upkeep and overall reliability. However you would still need Developers to assemble and deploy software components.

Screen-Shot-2018-06-19-at-15.55.18

This approach is functional, but not without the flaws.

This approach caused division and conflict between developers and Sysadmins. Ben Treynor of Google saw these concerns and invented the idea of “Site Reliability Engineering.” By merging work that had been done by two different departments, he put them in the “same boat”.

Screen-Shot-2018-06-19-at-15.55.31

As a result all efforts were pointed in the same direction and SRE position emerged. It’s now became possible to effectively eliminate human interaction through automation. Therefore making systems more reliable. It’s almost like, an SRE’s job is to automate themselves out of a job.

The SRE role is still forming and evolving. Eventually you will get to the point that your system is automated to the limit. Because of that in the near future SRE may merge with AIOPs role and expand with additional set of skills like observability.

At Pacto Systems we adopted the SRE model and with it we managed to reduce our costs by automating regular tasks, building reusable building blocks but doing it in the same way we build software using source control, CI/CD, etc.