Site Reliability Engineering
How Google Runs Production Systems
Summary:
The book provides insights into Google's approach to ensuring their large-scale systems are reliable and scalable, discussing principles and practices such as service level objectives (SLOs), automation, and incident management. It covers the role of a Site Reliability Engineer (SRE), detailing how they balance the need for system stability with the demands of new features and growth.
Key points:
1. Risk Acceptance: The book promotes balancing risk and failure cost, introducing an "error budget" as the acceptable failure rate. This aids in deciding when to prioritize system reliability over feature development.
Books similar to "Site Reliability Engineering":
Software Engineering at Google
Titus Winters|Tom Manshreck|Hyrum Wright
System Design Interview – An insider's guide
Alex Xu
Accelerate
Nicole Forsgren PhD|Jez Humble|Gene Kim
Outcomes Over Output
Josh Seiden
Meltdown
Chris Clearfield|András Tilcsik
Extreme Programming Explained
Kent Beck|Cynthia Andres
Modern Software Engineering
David Farley
Pragmatic Programmer, The
David Thomas|Andrew Hunt
Working Effectively with Legacy Code
Michael Feathers
Making Websites Win
Karl Blanks|Ben Jesson