Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy - Site Reliability Engineering_ How Google Runs Production Systems-O’Reilly Media

By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek

To avoid this fate, the team tasked with managing a service needs to code or it will drown. Therefore, Google places a 50% cap on the aggregate “ops” work for all SREs— tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. This cap is an upper bound; over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic

To avoid this fate, the team tasked with managing a service needs to code or it will drown. Therefore, Google places a 50% cap on the aggregate “ops” work for all SREs— tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. This cap is an upper bound; over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. In practice, scale and new features keep SREs on their toes.

SREs should receive a maxi‐ mum of two events per 8–12-hour on-call shift. This target volume gives the on-call engineer enough time to handle the event accurately and quickly, clean up and restore normal service, and then conduct a postmortem

Postmortems should be written for all significant incidents, regardless of whether or not they paged; postmortems that did not trigger a page are even more valuable, as they likely point to clear monitoring gaps. This investigation should establish what happened in detail, find all root causes of the event, and assign actions to correct the problem or improve how it is addressed next time. Google operates under a blamefree postmortem culture, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

Monitoring is one of the primary means by which service owners keep track of a sys‐ tem’s health and availability. As such, monitoring strategy should be constructed thoughtfully. A classic and common approach to monitoring is to watch for a specific value or condition, and then to trigger an email alert when that value is exceeded or that condition occurs. However, this type of email alerting is not an effective solution: a system that requires a human to read an email and decide whether or not some type of action needs to be taken in response is fundamentally flawed. Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.

up alerting for acute problems. • Compare behavior: did a software update make the server faster? • Examine how resource consumption behavior evolves over time, which is essen‐ tial for capacity planning

• Set up alerting for acute problems. • Compare behavior: did a software update make the server faster? • Examine how resource consumption behavior evolves over time, which is essen‐ tial for capacity planning

key principle of any effective software engineering, not only reliability-oriented engineering, simplicity is a quality that, once lost, can be extraordinarily difficult to recapture. Nevertheless, as the old adage goes, a complex system that works necessar‐ ily evolved from

key principle of any effective software engineering, not only reliability-oriented engineering, simplicity is a quality that, once lost, can be extraordinarily difficult to recapture. Nevertheless, as the old adage goes, a complex system that works necessar‐ ily evolved from a simple system that works. Chapter 9, Simplicity, goes into this topic in detail.

the more likely it is to be toil: Manual This includes work such as manually running a script that automates some task. Running a script may be quicker than manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is still toil time. Repetitive If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and over. If you’re solving a novel prob‐ lem or inventing a new solution, this work is not toil. Automatable If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil. 3 Tactical Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil. We may never be able to eliminate this type of work completely, but we have to continually work toward minimizing it. No enduring value If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a permanent improvement in your ser‐ vice, it probably wasn’t toil, even if some amount of grunt work—such as digging into legacy code and configurations and straightening them out—was involved.O(n) with service growth If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil. An ideally managed and designed service can grow by at least one order of magnitude with zero additional work, other than some one-time efforts to add resources

The Four Golden Signals The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four. Latency The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in mislead‐ ing calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors. Trafc A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this meas‐ urement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second. Errors The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insuffi‐ cient to express all failure conditions, secondary (internal) protocols may be nec‐ essary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catch‐ ing all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content. Saturation How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utiliza‐ tion target is essential. In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very

Worrying About Your Tail (or, Instrumentation and Performance

Record the current CPU utilization each second. 2. Using buckets of 5% granularity, increment the appropriate CPU utilization bucket each second. 3. Aggregate those values every minute. This strategy allows you to observe brief CPU hotspots without incurring very high cost due to collection and retention

rules <<< # Compute a rate pertask and per ‘code’ label {var=task:http_responses:rate10m,job=webserver} = rate by code({var=http_responses,job=webserver}[10m]); # Compute a cluster level response rate per ‘code’ label {var=dc:http_responses:rate10m,job=webserver} = sum without instance({var=task:http_responses:rate10m,job=webserver}); # Compute a new cluster level rate summing all non 200 codes {var=dc:http_errors:rate10m,job=webserver} = sum without code( {var=dc:http_responses:rate10m,jobwebserver,code=!/200/}; # Compute the ratio of the rate of errors to the rate of requests {var=dc:http_errors:ratio_rate10m,job=webserver} = {var=dc:http_errors:rate10m,job=webserver} / {var=dc:http_requests:rate10m,job=webserver}; >>> Again,

following example creates an alert when the error ratio over 10 minutes exceeds 1% and the total number of errors exceeds 1: rules <<< {var=dc:http_errors:ratio_rate10m,job=webserver} > 0.01 and by job, error {var=dc:http_errors:rate10m,job=webserver} > 1 for 2m => ErrorRatioTooHigh details “webserver error ratio at trigger_value” labels {severity

rule evaluation cycles to ensure no missed collections cause a false alert. The following example creates an alert when the error ratio over 10 minutes exceeds 1% and the total number of errors exceeds 1: rules <<< {var=dc:http_errors:ratio_rate10m,job=webserver} > 0.01 and by job, error {var=dc:http_errors:rate10m,job=webserver} > 1 for 2m => ErrorRatioTooHigh details “webserver error ratio at trigger_value” labels {severity

and the total number of errors exceeds 1: rules <<< {var=dc:http_errors:ratio_rate10m,job=webserver} > 0.01 and by job, error {var=dc:http_errors:rate10m

The following example creates an alert when the error ratio over 10 minutes exceeds 1% and the total number of errors exceeds 1: rules <<< {var=dc:http_errors:ratio_rate10m,job=webserver} > 0.01 and by job, error {var=dc:http_errors:rate10m,job=webserver} > 1 for 2m => ErrorRatioTooHigh details “webserver error ratio at trigger_value” labels {severity=page}; >>> Our