SRE Team Management: Managing Operational Loads


Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level



Overview/Description

To ensure and maintain a system's functional state, site reliability engineers (SRE) must learn how to identify, calculate, and manage a system's operational load, which generally falls into three categories: ongoing operation activities, tickets, and pages.

In this course, you'll explore these categories in detail. You'll start by outlining methods for managing operational loads at the team level and using support ticketing systems and service level objectives.

Next, you'll investigate 'toil,' a term used to describe the operational work associated with running and maintaining a production service. You'll outline steps for identifying, calculating, and eliminating toil and examine the adverse effects toil can have on a team.

Additionally, you'll outline how to work with interrupts and distinguish between crucial metrics used for managing them. Lastly, you'll identify the human element factors to consider when dealing with interrupts, including efficiency, distractibility, and respect. 



Expected Duration (hours)
0.9

Lesson Objectives

SRE Team Management: Managing Operational Loads

  • discover the key concepts covered in this course
  • describe what is meant by operational load and outline the three general categories of operational load
  • outline how on-call engineers depend on pages to respond to incidents and outages
  • outline the steps involved in responding to emergency incidents
  • outline the purpose of customer request support tickets and provide examples of simple and complex tickets
  • describe the essential components of a typical ticketing system
  • recognize how to use service level objectives (SLO) to ensure timely responses and resolutions
  • describe what is meant by toil and provide examples of toil, such as applying schema changes to a database
  • differentiate between types of toil including automated, manual, repetitive, and tactical
  • outline steps to track and identify toil and describe why less toil is better
  • describe how to measure and calculate toil
  • outline steps to minimize or eliminate toil completely
  • differentiate between toil and complexity and describe approaches to address complexity
  • describe how toil can negatively effect staff including through low morale and confusion amongst SREs
  • list key metrics used for managing interrupts, such as the severity of the interrupt
  • outline human element factors to consider when dealing with interrupts, such as distractibility
  • summarize the key concepts covered in this course
  • Course Number:
    it_sreinovdj_01_enus

    Expertise Level
    Intermediate