Site Reliability Engineers (SREs) are responsible for assigning the appropriate resources and responsibilities to effectively deal with unexpected emergencies. To do this, SREs should ensure the proper processes and teams are in place before an emergency occurs.
In this course, you'll explore the different emergency types and outline how to plan for them. You'll examine the causes of and how to respond to test-induced, change-induced, and process-induced emergencies and what's involved in proactive approaches to emergency testing and planning.
You'll then outline the critical steps to correctly documenting emergencies, including the history of outages and mistakes. You'll then differentiate between business continuity and disaster recovery planning and outline how to create both types of plans and conduct a business impact analysis. Lastly, you'll explore some IT recovery strategies.
SRE Emergency & Incident Response: Responding to Emergencies
discover the key concepts covered in this course
outline the fundamental emergency response principles SREs need to be familiar with and recognize the critical steps to take when a system breaks
recognize the benefits of performing test-induced emergencies and outline what this involves
name the causes and outcomes of change-induced emergencies and outline how to respond to these emergencies
define what is meant by a process-induced emergency, describe the effects of them, and outline how to respond to them
describe why it is vital to keep a history of outages and mistakes and outline best practices when doing so
recognize the importance of asking important, relevant, and challenging questions
define what is meant by proactive testing, compare it to reactive testing, recognize the importance of encouraging proactive testing, and name best practices when carrying out this type of testing
define what is meant by business continuity and describe why this type of planning matters
outline the six steps involved in developing a business continuity plan
outline methods to test a business continuity plan, recognize the importance of testing this type of plan, and describe some tips when testing
recognize the importance of ongoing efforts to review and improve a business continuity plan and outline how to go about doing it
recognize the importance of having 'top-level' support for business plans and promoting user awareness, and outline how to achieve these goals
define what is meant by a business impact analysis, outline how to conduct one and its typical structure, and name the possible effects on business operations
recognize the importance of developing an IT disaster recovery plan, list the goals of this type of plan, and describe what to consider when developing one
outline key steps to creating a working disaster recovery plan
name some types of IT recovery strategies and recognize the importance of recovery strategies developed for IT systems, applications, and data