Resilient - Incident Response

Learn more about Well-Architected AdaptableResilientIncident ResponseAbility to Triage

Where to look?
Product Area | Location
What does good look like?
Pattern
Platform | Business✅ SMEs or stakeholders who should be alerted to support complex issues are identified before an incident occurs
Platform | Business✅ The hand-off between delivery and support teams is a part of go-live
Platform | Business✅ If consulted, Salesforce architects respond quickly and help the team stay focused on recovery
Platform | Documentation✅ System and design patterns used in a given solution are discoverable and readable by support staff
Platform | Org✅ Logging and custom error messages are incorporated into execution paths throughout the system

Learn more about Well-Architected AdaptableResilientIncident ResponseMonitoring and Alerting

Where to look?
Product Area | Location
What does good look like?
Pattern
Platform | Documentation✅ Entry criteria for prompt tuning alerts are defined based on direct and indirect generative AI feedback metrics
Platform | Org✅ Alerts are only used to inform users of scenarios that require human intervention; other failures are logged and reportable
Platform | Org✅ Alerts are sent to users who are capable of responding to them
Platform | Org✅ When possible, alerts are delivered in advance of a potential failure

Learn more about Well-Architected AdaptableResilientIncident ResponseTime to Recover

Where to look?
Product Area | Location
What does good look like?
Pattern
Platform | Business✅ Teams know what services in production they are responsible for owning
Platform | Business✅ Recovery protocols are practiced on regular intervals
Platform | Documentation✅ Recovery tactics are defined and classified by incident type and trigger
Platform | Documentation✅ Exit criteria for incident responses exist in your SLOs and are clear
Platform | Documentation✅ Activation criteria and assignment logic for elevated permissions during incidents are clear
Platform | Documentation✅ Incident response permission sets and authorizations are clearly listed
Platform | Org✅ Session-based permission sets for incident response exist and can be assigned to support staff during recovery
Platform | Org✅ Setup Audit Trail shows designated recovery testers have logged into testing environment on agreed upon time and have followed recovery test scripts
Platform | Test Plans✅ Test scripts for recovery testing exist and are repeatable
Platform | Test Plans✅ Environments for incident simulations are clearly listed

Learn more about Well-Architected AdaptableResilientIncident ResponseAbility to Triage

Where to look?
Product Area | Location
What to avoid?
Anti-Pattern
Platform | Business⚠️ SMEs or stakeholders who should be alerted aren't identified until an incident occurs
Platform | Business⚠️ The hand-off between delivery teams and support teams isn't a part of the release process
Platform | Business⚠️ Salesforce architects consider incident response to be outside their scope of work
Platform | Documentation⚠️ System and design patterns used in a given solution are not readily available to support staff
Platform | Org⚠️ Logging and custom error messages are not used

Learn more about Well-Architected AdaptableResilientIncident ResponseMonitoring and Alerting

Where to look?
Product Area | Location
What to avoid?
Anti-Pattern
Platform | Documentation⚠️ There are no criteria defined for triggering prompt tuning alerts for generative AI apps
Platform | Org⚠️ Alerts are sent when any type of failure occurs, regardless of whether follow-on actions are required
Platform | Org⚠️ Alerts about issues requiring technical solutions are delivered to business users
Platform | Org⚠️ Alerts are only delivered in response to failures that have already occurred

Learn more about Well-Architected AdaptableResilientIncident ResponseTime to Recover

Where to look?
Product Area | Location
What to avoid?
Anti-Pattern
Platform | Business⚠️ It is unclear what teams are responsible for different services in production
Platform | Business⚠️ Recovery protocols don't exist or aren't practiced on regular intervals
Platform | Documentation⚠️ Incident response is performed ad hoc
Platform | Documentation⚠️ Exit criteria for incident responses do not exist
Platform | Documentation⚠️ Elevated permissions are not assigned, or assigned ad hoc
Platform | Documentation⚠️ Incident response permission sets and authorizations are not listed
Platform | Org⚠️ Session-based permission sets do not exist for incident response, or are not authorized for support staff to use
Platform | Org⚠️ Setup Audit Trail shows designated recovery testers have not logged into the testing environment or did not follow recovery test scripts
Platform | Test Plans⚠️ Environments are not established for incident simulations
Platform | Test Plans⚠️ Test scripts do not exist for recovery testing