SRE Emergency & Incident Response: Responding to Emergencies


Overview/Description
Expected Duration
Lesson Objectives
Course Number
Expertise Level



Overview/Description

Site Reliability Engineers (SREs) are responsible for assigning the appropriate resources and responsibilities to effectively deal with unexpected emergencies. To do this, SREs should ensure the proper processes and teams are in place before an emergency occurs.

In this course, you'll explore the different emergency types and outline how to plan for them. You'll examine the causes of and how to respond to test-induced, change-induced, and process-induced emergencies and what's involved in proactive approaches to emergency testing and planning.

You'll then outline the critical steps to correctly documenting emergencies, including the history of outages and mistakes. You'll then differentiate between business continuity and disaster recovery planning and outline how to create both types of plans and conduct a business impact analysis. Lastly, you'll explore some IT recovery strategies.



Expected Duration (hours)
1.2

Lesson Objectives

SRE Emergency & Incident Response: Responding to Emergencies

  • discover the key concepts covered in this course
  • outline the fundamental emergency response principles SREs need to be familiar with and recognize the critical steps to take when a system breaks
  • recognize the benefits of performing test-induced emergencies and outline what this involves
  • name the causes and outcomes of change-induced emergencies and outline how to respond to these emergencies
  • define what is meant by a process-induced emergency, describe the effects of them, and outline how to respond to them
  • describe why it is vital to keep a history of outages and mistakes and outline best practices when doing so
  • recognize the importance of asking important, relevant, and challenging questions
  • define what is meant by proactive testing, compare it to reactive testing, recognize the importance of encouraging proactive testing, and name best practices when carrying out this type of testing
  • define what is meant by business continuity and describe why this type of planning matters
  • outline the six steps involved in developing a business continuity plan
  • outline methods to test a business continuity plan, recognize the importance of testing this type of plan, and describe some tips when testing
  • recognize the importance of ongoing efforts to review and improve a business continuity plan and outline how to go about doing it
  • recognize the importance of having 'top-level' support for business plans and promoting user awareness, and outline how to achieve these goals
  • define what is meant by a business impact analysis, outline how to conduct one and its typical structure, and name the possible effects on business operations
  • recognize the importance of developing an IT disaster recovery plan, list the goals of this type of plan, and describe what to consider when developing one
  • outline key steps to creating a working disaster recovery plan
  • name some types of IT recovery strategies and recognize the importance of recovery strategies developed for IT systems, applications, and data
  • summarize the key concepts covered in this course
  • Course Number:
    it_sreeriddj_01_enus

    Expertise Level
    Intermediate