Site Reliability Engineer: Managing Cascading Failures

Site Reliability Engineer: Managing Cascading Failures

Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level

Overview/Description

Cascading failures are a concern for site reliability engineers (SREs) because they often stem from positive feedback and grow over time. In this course, you'll examine the various cascading failure triggers, such as overloads, CPU, and memory issues. You'll also explore the resource exhaustion issues resulting from cascading failures and the adverse effects on overall performance and stability.

You'll outline steps to prevent server overloads, ensure efficient queue management, deal with latency, and manage slow startups. You'll explore terms such as ""load shedding"" and ""code retries."" You'll also identify the benefits of setting deadlines and how propagating cancellations can reduce or eliminate unneeded work and preserve resources for other needs. Finally, you'll outline the steps involved in testing cascading failures and in addressing them immediately.

Expected Duration (hours)
1.2

Lesson Objectives

Site Reliability Engineer: Managing Cascading Failures

discover the key concepts covered in this course

define what is meant by cascading failures and identify situations in which this term is used

describe how server overloads can lead to cascading failures

define what is meant by resource exhaustion and describe its consequences

list CPU considerations as they relate to failures and overutilization

list factors that can contribute to memory exhaustion

recognize how file descriptors and threads can directly lead to failures

recognize how resource exhaustion can travel from one resource to another

recognize how resource exhaustion can lead to service unavailability

outline how to prevent server overloads

outline steps to ensure efficient queue management

differentiate between load shedding and graceful degradation

define what is meant by code retries and recognize why it is relevant to the topic of cascading failures

recognize the benefits of setting deadlines

recognize how propagating cancellations can reduce unneeded work

define what is meant by latency considerations, including bimodal latency, and describe how to address this class of problems

outline the steps involved in managing slow startups and working with cold caching

differentiate between the various cascading failure triggers

outline how to test cascading failures

list steps to immediately address cascading failures

summarize the key concepts covered in this course

Course Number:
it_sreolcfdj_02_enus

Expertise Level
Intermediate