SRE Troubleshooting: SRE Troubleshooting Processes


Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level



Overview/Description

Troubleshooting is a critical skill for site reliability engineers (SREs). Using past experiences, a proper mindset, and a stable troubleshooting process, SREs can effectively report, triage, examine, diagnose, test, and cure system issues.

In this course, you'll explore troubleshooting approaches and best practices, while also learning how to avoid common pitfalls. You'll explore issue reporting, triaging, examination, diagnosis, and testing. You'll recognize how to simplify and reduce troubleshooting, use the ""what, why, and where"" technique, and examine negative results. You'll also investigate how to observe and interpret recent changes to identify what went wrong with a system. Lastly, you'll locate probable cause factors and outline the steps used to make troubleshooting more effective.



Expected Duration (hours)
1.0

Lesson Objectives

SRE Troubleshooting: SRE Troubleshooting Processes

  • discover the key concepts covered in this course
  • describe how engineers think differently to "novices" when it comes to troubleshooting
  • outline best practices and approaches to troubleshooting and how to keep those skills sharp
  • outline an idealized troubleshooting model (e.g., report, triage, examine, diagnose, test/treat, and cure.)
  • list potential pitfalls to avoid, such as looking for symptoms that are not relevant
  • outline how to manage operational loads
  • recognize the importance of an adequate initial problem report
  • recognize the importance of triaging problems from the onset
  • recognize the importance of examining each component of a system to understand whether it is functioning properly
  • identify the steps and approaches used to diagnose issues
  • describe methods for testing and treating possible causes to identify actual problems
  • recognize how to simplify and reduce troubleshooting using techniques such as dividing and conquering
  • describe the "what, why, where" technique and how it can be used to diagnose a malfunctioning system
  • interpret how determining who last touched a system can be helpful when identifying what is going on with a system
  • define what is meant by "negative results"
  • recognize that systems are complex and that often you can only identify probable cause factors to document what went wrong with a system
  • outline steps to make troubleshooting easier
  • summarize the key concepts covered in this course
  • Course Number:
    it_sreeftsdj_01_enus

    Expertise Level
    Intermediate