Site Reliability Engineering
How Google Runs Production Systems
Niall Richard Murphy|Betsy Beyer|Chris Jones|Jennifer Petoff
The book provides insights into Google's approach to ensuring their large-scale systems are reliable and scalable, discussing principles and practices such as service level objectives (SLOs), automation, and incident management. It covers the role of a Site Reliability Engineer (SRE), detailing how they balance the need for system stability with the demands of new features and growth.
See full summary