Site Reliability Engineering
How Google Runs Production Systems
Summary:
The book provides insights into Google's approach to ensuring their large-scale systems are reliable and scalable, discussing principles and practices such as service level objectives (SLOs), automation, and incident management. It covers the role of a Site Reliability Engineer (SRE), detailing how they balance the need for system stability with the demands of new features and growth.
Key points:
1. Risk Acceptance: The book promotes balancing risk and failure cost, introducing an "error budget" as the acceptable failure rate. This aids in deciding when to prioritize system reliability over feature development.
Books similar to "Site Reliability Engineering":

Software Engineering at Google
Titus Winters|Tom Manshreck|Hyrum Wright

System Design Interview – An insider's guide
Alex Xu

Accelerate
Nicole Forsgren PhD|Jez Humble|Gene Kim

Outcomes Over Output
Josh Seiden

Meltdown
Chris Clearfield|András Tilcsik

Extreme Programming Explained
Kent Beck|Cynthia Andres

Modern Software Engineering
David Farley

Pragmatic Programmer, The
David Thomas|Andrew Hunt

Working Effectively with Legacy Code
Michael Feathers

Making Websites Win
Karl Blanks|Ben Jesson