Resilient | Salesforce Architects

Read about our update schedules here.

Introduction

Resilient solutions handle change well. Resiliency is the ability to quickly and effectively recover from a problem or failure. Resilience is grounded in two key qualities: toughness and elasticity. Toughness in a system enables it to withstand and endure difficulties. Elasticity means a system is able to return to an ideal state or shape. Architecting for resilience means combining these two aspects into your systems to create strength and flexibility in the face of change — whether that change is intentional or unplanned.

In technology contexts, a system demonstrates resilient behavior by continuing to function or quickly return to a stable state even if individual pieces in the system fail. Code and configuration defects can create unexpected (and unwanted) behavior, as can network and hardware issues. As a result, every component in your architecture has the potential to fail.

You can improve the resilience of your Salesforce solutions by focusing on three key habits: application lifecycle management, incident response, and continuity planning.

Application Lifecycle Management

Application lifecycle management (ALM) is software development practice that focuses on how software is created, delivered, and managed — from idea through end-of-life. Encompassing people, processes, and tools, ALM is a holistic way of looking at the big picture of how applications are conceived, approved, built, delivered, and managed, along with the more specific disciplines (including DevOps, specific delivery methodologies, testing strategies, governance, and CI/CD) that might be involved.

Healthy ALM means that the business can react quickly to changes, and applications can keep pace without compromising stability or quality. It is a cornerstone of resiliency. Without clear and practical ALM, teams will struggle at every stage of app creation, delivery, and maintenance. Symptoms of poor ALM include:

Slow and error-prone development cycles
Intensive and difficult deployments
High-severity issues or bugs discovered in production and post-QA environments
Reliance on rollbacks or hot-fix deployments to stabilize releases

Because ALM touches nearly every aspect of a solution, establishing clear and effective ALM practices is a key part of architectural work.

You can build better ALM practices by focusing on three key areas: release management, environment strategy, and testing strategy.

Release Management

Release management involves planning, sequencing, controlling, and migrating changes into different environments. Technically, a release can be any time you move changes into an environment. In this context, the term release refers to an intentional group of changes, moved into a given environment at the same time.

Introducing change into a stable system causes that system to transition from a stable state to a new state. During this transition, the system is vulnerable to further changes triggering an uncontrolled, unstable state — which can cause a critical incident. From an architectural point of view, designing for resilient releases is more than just ensuring individual changes undergo effective testing; it also includes planning for how changes are introduced into your systems (and to users) safely.

It’s critical to be clear about what, how, and how often changes will move into the system. If your project or business has official change management and enablement processes and teams, their work depends on predictable and accurate release information. Business stakeholders also care about release information — especially as it relates to features or bug fixes they’ve requested. Establishing consistent and clear release schedules, and shipping stable release artifacts, is an effective way to build trust in your solution and demonstrate value to your stakeholders.

To enable effective release management for Salesforce, consider:

Stop using org-based development. This paradigm reflects older, more limited technologies for development and release. Advancements in SFDX mean that every kind of team can now adopt source-driven development and release capabilities. Migrate away from any org-based development or release processes immediately.
Choose the most stable release mechanism possible. This accomplishes two things. First, it minimizes the duration of release windows and downtime. Second, it allows for highly controlled and predictable release behaviors. The more stable your release mechanism, the less likely it is that releases will introduce changes that require hot-fixes or rollbacks. Should an unforeseen issue arise, stable release mechanisms also create simpler ways for support staff or system administrators to perform rollbacks. Salesforce offers various tools for releasing changes (see Tools Relevant to Resilient for a comprehensive list), but they are not equal in terms of release stability. Keep in mind that not all businesses have teams with the skill sets needed to make use of all mechanisms. The best release mechanism for a given business team will be the most stable option within the skill sets of the team. That said, here are the mechanisms you should use for releases, listed in order of stability:
- Unlocked packages. Unlocked packages are the most stable release artifact. Deployments are done by installing a package, which is the fastest and most predictable way of introducing change. Packages use versioning, which allows for robust change management and fine-grained (and system admin-friendly) rollbacks, and they require strong metadata management, which can help you identify mismanaged dependencies early. They also create auditable development pipelines and artifacts. (For more on this, see Packageability.)
- DevOps Center. DevOps Center allows delivery teams with low- and pro-code skill sets to utilize source control, work collaboratively on changes and define common release paths. DevOps Center integrates with source control and allows for point-and-click control of changes and deployments.
- Source-driven development and metadata deploys using Salesforce CLI (also referred to by the acronym SFDX). If you cannot use packages, source-driven development and deployments using supported sf project or sfdx source commands is a traceable and manageable release mechanism. This mechanism can be used with DevOps Center. Though it is possible to use source-driven development and the Salesforce CLI to deploy metadata in the older package.xml format, avoid this approach. The structure of metadata is fundamentally different in source format. This structure evolved to support package development and scratch org workflows, along with more granular change tracking in sandboxes. It is more readable, allows for more decoupling of complex metadata types and dependencies, and gives you much more control of deployment manifests.

Note: All of these release mechanisms use source-driven development and SFDX. You can mix-and-match all of these approaches to create the right release structures for your company and teams. You do not have to take an all-or-nothing approach. All of these options are fundamentally compatible with each other.

Name your releases. Giving releases clear identifiers helps all your teams and stakeholders stay aligned. At Salesforce, product and engineering teams across Sales Cloud, Service Cloud, and the Salesforce Platform use “Spring”, “Summer”, and “Winter” naming conventions. Tableau and MuleSoft teams use named versions for their releases. (This is why our Roadmap Explorer uses time periods to show data from all releases consistently.) If you don’t already have a naming convention for releases at your company, establish one — and use it. Giving releases a clear name (like “Spring 2024” or “4.2.1”) is a simple, straightforward way to both define and organize your releases. Effective use of release names/versions help align teams and your ALM systems, including:
- Your roadmap. As you drive prioritization, sequencing, and scope of feature delivery through roadmapping, release labels in your roadmap make it clear and easy for all of your stakeholders to see precisely what changes are coming when.
- Your documentation. Having development teams put the name of the release they are building for in their documentation, change logs, work descriptions, code comments, branches of source control, and so on increases traceability and auditability of development artifacts. It also makes it easier for releases to stay clearly organized at every stage of development and delivery — and throughout the systems your teams use to manage projects, change requests, in-flight development artifacts, and more.
Within a release manifest, manage dependencies well. Salesforce metadata has built-in dependencies. A common reason Salesforce deployments fail is improperly managed dependencies. Choosing a more stable release channel (as described above) can help expose mismanaged dependencies earlier in your development cycle. One of the key reasons unlocked packages are the most stable release vehicle is the strong metadata dependency management required for package development and creation. If you (or your release management teams) do not understand the built-in dependencies between Salesforce metadata types, you will not be able to proactively spot problematic combinations in your deployment and release manifests. (For more on this topic, see Dependency Management.)

The list of patterns and anti-patterns below shows what a proper (and poor) release management looks like for a Salesforce org. You can use these to validate your designs before you build, or identify places in your system that need to be refactored.

To learn more about Salesforce tools for release management, see Tools Relevant to Resilient.

Environment Strategy

Salesforce provides a variety of environments for you to use during application development and testing cycles. An effective environment strategy for Salesforce requires understanding how to use different environments and what good management looks like. This is a key competency for healthy ALM cycles. Environments are where work gets done. Their usefulness in ALM comes from the level of fidelity to production they provide, along with their isolation from production.

Compared to a poor environment strategy (or no strategy at all), a good environment strategy provides several benefits:

More relevant fidelity to production in every environment
Faster environment set up/tear down
Better agility in how you develop and test
More security throughout your pipeline
Less noise and conflict throughout delivery stages
Happier development teams

Teams often struggle to realize these benefits. Challenges to getting the most out of your development environments and strategy come from many sources. One key source is the type of development model your teams follow. In the older org-based development approach, environments had more than one role to fulfill. They have to be the place where various kinds of work happen, and they also have to be the source for your release artifacts (that is, the metadata that you want to deploy in a release). This often means environments were not easy to set up or tear down, they were often overcrowded and full of metadata conflicts between teams, and they did not contribute meaningful speed or flexibility to ALM overall.

Using a source-based development model fundamentally shifts the relationship environments have to your releases and release artifacts. In this model, source control is the source of the metadata you want to release. Environments are just places where work gets done.

However, the source-based development model is not a guarantee of good environment strategy by itself. Even with source control, teams can still struggle to set up conditions to test external system integrations, configurations that depend on metadata not in source control (like managed packages or customizations that depend on data), and so on. This can lead to challenges similar issues to those seen with an org-based model.

To develop an effective environment strategy, consider:

Adopt a source-driven development and release model. Stop using org-based development models. (For more on this, see release management.) You must untangle your environments from what you deploy to create a healthy environment strategy and healthier releases.
Understand what kind of work different environments support. The different environment types supported by Salesforce have different capabilities and limits. As you design your environment strategy, consider what environments can and can’t do, and make sure work is happening in an environment with the right capabilities. Refer to the following overview of the various Salesforce development environments and their features:

Features	Scratch Org	Developer Sandbox	Developer Pro Sandbox	Partial Copy Sandbox	Full Sandbox
Supports org shape	Yes	No	No	No	No
Supports source tracking	Yes	Yes	Yes	No	No
Lifespan	1 - 30 days	Manually controlled	Manually controlled	Manually controlled	Manually controlled
Refresh Interval	N/A	1 day	1 day	5 days	29 days
Release Preview Support	Developer controlled	Based on sandbox instance	Based on sandbox instance	Based on sandbox instance	Based on sandbox instance
Provisioning Time	>5 minutes	Hours - Days	Hours - Days	Hours - Days	Hours - Days
Metadata determined by	Source control	Production	Production	Production	Production
Data determined by	Manual data load	Manual data load	Manual data load	Sandbox template	Production
Data limit	200 MB	200 MB	1 GB	5 GB	Matches Production

Here is how different features map to common development tasks, along with compatible environment recommendations:

Task	Org Shape	Source Tracking	Frequent Refreshes	Release Preview Support	All metadata from production	Partial Metadata from production	Large datasets from production	Partial datasets from production	Compatible Environments
Prototyping	X	X	X	X	X	X		X	Scratch Orgs, Developer and Developer Pro Sandboxes
New feature investigations or proof-of-concept development	X	X	X	X	X	X		X	Scratch Orgs, Developer and Developer Pro Sandboxes
User acceptance testing		X	X	X	X	X		X	Developer, Developer Pro and Partial Copy sandboxes
Performance and scale testing				X	X		X		Full sandbox
User Training			X	X	X	X	X*	X	Developer Pro, Partial Copy and*Full sandboxes
*If required to complete a specific kind of work, otherwise use a less resource-intense environment

Decouple environments from release artifacts. Do not use org-based development. Treat environments as places where work happens for a fixed amount of time. View the state of metadata in an environment as orthogonal to your release artifacts. If a piece of code or configuration gets “figured out” in an environment, it should be committed to source control — that is what constitutes a release artifact.
- Environments are ephemeral. Build processes so you can create and destroy them as quickly as possible.
- Artifacts endure. They belong in source control.
Decouple environments from release paths. It is common to see mandatory release paths that require changes to be deployed to specific environments. Often, this is done to have some kind of proxy for validating application maturity or release stability. It can also be done in an attempt to minimize the number of environments where complex testing infrastructure has to be configured. In source-based paradigms, you have greater flexibility in how (and where) you can validate and test changes.
- Release stages apply to release artifacts, not environments. Do not create an environment just for the purpose of “gathering” all changes in a particular release stage. That is what source control (especially branching) is for. Use branching strategies in source control to organize what changes are to be deployed to which environments. Depending on the kind of work to be done, you might need to deploy all the metadata in a release to an environment. Branching enables you to do that. Every development environment should be refreshed or destroyed as soon as the relevant work is finished (within limits). Make sure you synchronize any changes to metadata that take place within a specific environment (and that you want to retain) to source control.
- Environments are only as useful as the kind of production fidelity they do (and do not) provide. Optimize your environment setup workflows or automations to allow for environment teardown/refresh at the earliest possible interval. View any configuration that blocks faster, more frequent environment refreshes as a critical risk to overall resilience in your ALM processes. Add remediation work to your planning and prioritize it. Explore how you can adopt more loosely coupled, modular units in your system. This will enable more kinds of development to happen in scratch orgs and free up sandbox allocations for other work. Don’t forget about the capabilities that scratch orgs provide for testing out features not available in production — whether those features require licenses you haven’t purchased or are simply not enabled.
- In environments accessible to business users or end users, let those users focus on what matters to them. Don’t have generic, undifferentiated environments where many different groups of end users or business stakeholders are trying to do ALM-related work. Invite and activate specific stakeholders into specific environments to do specific work. Carefully evaluate any process that puts end users or business stakeholders in an environment with more data than a Partial Copy sandbox can support. Make sure the data volume is necessary to the work to be done. Optimize user acceptance testing and early-stage development cycles to be as close together as possible. Optimize all testing stages to enable faster, earlier feedback and iteration cycles for development teams and end users. (For more on this, see testing strategy.)
Build release paths for different types of changes. Not all changes require the same kinds of ALM work in the same order. It probably isn’t a valuable use of end-user time to perform acceptance testing for minor changes to back-end components of a system. User acceptance and scale testing might be tremendously valuable during early-stage development of a mobile application. Identify release paths for different kinds of change. Example categories include:
- High risk: Changes impact customers, partners, or all internal users. Changes impact security or integration. Complex new functionality has been added.
- Medium risk: Changes impact more than x internal users. Changes impact data models, automation that carries out data operations, or integration.
- Low risk: Directly impacts fewer than x internal users. No change to security, data models or automations involving data operations, or integration.
Do not allow overcrowded environments to exist. Lack of discipline in prioritizing, scoping, and sequencing work will inevitably lead to overloaded development environments — with volumes of work that are just too much, too many, too different. Overcrowded environments create high levels of stress, ambiguity, and conflict among development teams. They also create noise within your development pipelines and impede quality control efforts. In addition to these negative impacts, overcrowded development environments are serious threats to environment maintenance and security. View overcrowding as a symptom of potential problems in your ALM processes. Investigate for any root cause issues and address them. If you still face overcrowding, you can purchase additional sandboxes.

The list of patterns and anti-patterns below shows what a proper (and poor) environment management looks like for a Salesforce org. You can use these to validate your designs before you build, or identify areas of your system that need to be refactored.

To learn more about Salesforce tools for environment management, see Tools Relevant to Resilient.

Testing Strategy

A test strategy is the guiding principles and standards for how you plan and run tests that gauge the success/failure of your applications during your ALM processes. Test strategy keeps every stakeholder involved in testing aligned with the priority, purpose, and scope of a given test, and helps project teams create effective and thoughtful test plans.

Typically, developers or quality assurance/testing experts will be involved in creating and executing specific tests. Test strategy helps ensure that these individuals know what kinds of tests need to be conducted for a given project, in what sequence tests should occur, and what is needed for building well-formed tests, test plans, and artifacts (for example, test data sets, devices, traffic or network simulators, and so on).

Effective testing strategy creates a clear picture of how, when, where, and why to run different test types (including unit tests, UI tests, and regression tests) in various combinations and conditions to uncover how your system (including any in-flight changes) will behave. An effective test strategy produces tests that better show you how well the system conforms to non-functional requirements (such as scalability, reliability, and usability) that can be difficult to measure through a single kind of test.

To create effective testing strategies for Salesforce, consider:

Test iteratively, frequently, and through automated means as much as possible. Design and implement test automation that will enable teams to run a variety of test types against a variety of workloads. Orchestrate various test runs to happen automatically as changes come into source control. This will enable teams to proactively identify (and address) regressions early on. Use continuous integration/continuous delivery (CI/CD) for this if possible. If not, establish clear test plans that enable teams to run sequences of tests early and often, in a self-service manner.
Recognize that not every change requires every kind of test. Just as your release pipeline should accommodate paths for high/med/low risk applications, so should your testing strategy. Clearly outline how teams can select and follow an appropriate testing regime for applications with various kinds of risk, use cases, or complexity. (For more on this, see environment strategy.)
Define what tests can be conducted in different environment types. Fidelity to production is a key component of accurate testing — but this means different things for different kinds of tests. For example, regression testing needs fidelity to production in terms of metadata (and to some extent data). Make sure you define what kind of production-like fidelity is required for a given set of tests, and clearly classify what kinds of environments can support the right conditions for different tests. For an overview of what kind of work aligns to what type of environment, see environment strategy.
Use endurance, stress, performance, and scale tests to gauge application maturity continuously. These tests show how release-ready an application is, relative to production-level needs. For major new features, run these tests at several intervals in the application development cycle. It is an anti-pattern to consider these tests as a part of only a single phase or stage of development instead of on-going tasks. It is most useful for teams to gain feedback about app performance early and often, which helps them better understand how they are moving an app towards (or away from) production-level readiness. The ability to better identify and address issues before changes go into production is well worth the added complexity required to frequently run these more sophisticated tests. Additional considerations include:
- Know which tests matter. You will likely have a fixed amount of time to conduct your scale or performance testing, making it impractical to test every facet of your system. Not all features are used equally, and not all scale bottlenecks will impact the business equally. Ensure your scale tests are focused on the most highly used and highly valued parts of the system. Define and understand the most important opportunities for verifying and improving scale and performance in your org.
- Know what “good enough” looks like. Defining the success criteria for your scale and performance tests is critical. You and your development teams will use these as testing benchmarks, and you should also ensure they inform the functional requirements development teams build towards in the first place. Typically these criteria include supporting a specific number of concurrent users with response times less than an agreed upon value, and your SLOs. Define your key target criteria, and design scale and performance tests that ensure the criteria are met.
- Ensure you have adequate environments. Scale and performance testing require particular fidelity to production. Data sets, request demographics, request rates, and workload characteristics should all match what you see in production as much as possible. For scale testing, you must use a Full sandbox. If your org doesn’t have a Full sandbox available for scale testing, you will not be able to run adequate scale tests.
Ensure test workloads help you measure non-functional requirements. Considerations include:
- Test data. Every kind of test should occur against data that is isolated from production. In Apex unit tests, implement data factory patterns to ensure code generates its own test data, isolated from environment data. You can also create and maintain test data sets (in a variety of formats) to test data load behaviors, populate development environments with data for UI-based tests, and assist with integration testing. All test data (whether maintained as an externalized data set or created on-demand by data factory code) should be scrubbed of sensitive and identifying data. It should also include corrupt, incomplete, and malformed data, in order to support negative and boundary unit test behaviors.
- Mocks and stub services. For integration testing, you can use mock/stub services to simulate API responses. Apex supports a Stub API to create mocking frameworks for use in Apex tests. Mocking and stubs can help validate data-handling behaviors of a system, with less dependence on complex data factories or external test data sets. These are sometimes more appropriate to use in tests where production-like traffic or data volumes are not relevant.
- Devices and assistive technology. A key part of building engaging and accessible applications is ensuring they meet user expectations across a variety of devices and with different kinds of assistive technologies. Meaningful usability testing may require more investment and different kinds of expertise to carry out effectively — but it is an essential part of knowing how well-architected your user-facing applications will be when released.
- Simulators. When you need to replicate production-like volumes of user requests, API traffic, or variations in network speed, you may need tools that simulate these conditions. Not every test needs this level of investment. These tools are often most useful in scalability and performance testing.

ALM Patterns and Anti-Patterns

The following table shows a selection of patterns to look for (or build) in your org and anti-patterns to avoid or target for remediation.

✨ Discover more patterns for ALM in the Pattern & Anti-Pattern Explorer.

	Patterns	Anti-Patterns
Release Management	In production: - Metadata shows use of stable release mechanisms, such as: -- Metadata organized into unlocked packages -- DevOps Center is active and installed -- Deployments via Metadata API use `source` format - Deployment logs show no failed deployments within the available history - Deployment history shows clear release cadences and fairly uniform deployment clusters within release windows	In production: - Metadata indicates use of org-based release mechanisms, such as: -- Active use of change sets -- Deployments via Metadata API use `package.xml` format - Deployment logs show repeated instances of failed deployments within the available history - Deployments have no discernable cadence or show uneven clusters of deployments (signs of hot-fix and ad hoc rollbacks) - DevOps Center is not enabled and installed
Release Management	In your roadmap and documentation: - Release names are clear - Features are tied clearly to a specific, named release - Release names are searchable and discoverable - Teams can find and follow clear guidelines for tagging artifacts, development items, and other work with the correct release names - It is possible to pull together a clear view of a release manifest by release name - Quality threshholds for generative AI apps are defined for different development stages	In your roadmap and documentation: - Release names are absent - Features are not tied clearly to a specific release - Release names are ad hoc or do not exist - Teams refer to artifacts, development items, and other work in different ways - It is not possible to pull together a clear view of a release manifest using a release name - Quality thresholds for generative AI apps are not defined, or are not defined at different development stages
Environment Strategy	In your orgs: - A source-driven development and release model is adopted - Source tracking is enabled for Developer and Developer Pro sandboxes - Metadata in a given environment is independent from your release artifacts - Environments do not directly correspond to a release path - Release paths for a change depend on the type of the change (high risk, medium risk, low risk) - Overcrowded environments do not exist - Risky configuration changes are never made directly in production - No releases occur during peak business hours	In your orgs: - An org-based development and release model is adopted - Source tracking is not enabled for Developer and Developer Pro sandboxes - Metadata in a given environment is your release artifact - Environments directly correspond to a release path - The release path for every change is the same - Overcrowded environments exist - Risky configuration changes are made directly in production - Releases occur during peak business hours
Testing Strategy	Within your business: - Usability tests employ a variety of devices and assistive technology - Simulators are used to replicate production-like conditions for scalability and performance testing - Tests are automated to run when changes come into source control - Endurance, stress, performance, and scale tests are run at several intervals in the application development cycle and considered on-going tasks - You include scale testing as part of your QA process when you have B2C-scale apps, large volumes of users, or large volumes of data - Your scale tests are focused on priority aspects of the system - Your scale tests have well-defined criteria - You conduct scale testing in a Full sandbox - Prompt engineering includes a quality review by a human	Within your business: - Usability tests are not conducted, or are conducted on a limited set of devices - Production-like volumes of user requests, API traffic, and variations in network speed are not tested - Test automation is not in place - Endurance, stress, performance, scale tests are considered a phase or stage of development - You don't conduct scale tests as a part of your QA process and you have B2C-scale apps, large volumes of users, or large volumes of data - Your scale tests aren't prioritized - Your scale tests don't have well-defined criteria - You conduct scale tests in a Partial Copy or Developer sandbox - Prompt engineering lacks a quality review by a human
	In your org: - All test data is scrubbed of sensitive and identifying data	In your org: - Test data is identical to production data
	In Apex: - Data factory patterns are used for unit tests - Mock/stubs are used to simulate API responses	In Apex: - Your unit tests are reliant on org data - Mocks/stubs are not used
	In your design standards and documentation: - Environments are classified by what type of tests they can support - Appropriate test regimes are specified according to risk, use case, or complexity	In your design standards and documentation: - It is not clear which environment can support what type of tests - Test regimes are not categorized by risk, use case, or complexity

Incident Response

In security and site reliability engineering (SRE), incident response is focused on how teams identify and address events impacting the overall availability or security of a system, as well as how teams work to address root causes and prevent future issues. Incident response involves the processes and tools as well as the organizational behaviors required to address issues in real-time and in the period after an issue occurs.

As an architect, you may not be the person monitoring your solution’s operations on a day-to-day basis once it goes live. Part of architecting for resilience is designing capabilities that enable support teams to perform first-level diagnosis, stabilize systems, and effectively hand over the investigation and root cause mitigation to development or maintenance teams. Teams directly supporting users on a day-to-day basis may not have deep understanding or expertise in the architecture of the system. It is essential for these teams to have appropriate tools and processes for monitoring daily operations, accessing information from the system when diagnosing a potential incident, and helping them serve as effective first-responders for any issues impacting availability.

You can improve how well teams respond to incidents in your Salesforce solutions by focusing on time to recover, ability to triage, as well as monitoring and alerting.

Time to Recover

When an incident occurs, the first priority must be restoring systems to a stable operational state. Often, businesses think the only way to recover from an incident is to “fix the problem”. This is directionally sound — accurate root cause analysis and remediation is how you ultimately resolve critical issues in a system. However, this approach is not the most practical in the early stages of crisis response. Depending on the severity of an incident, every second of an outage or incident could create revenue (or reputation) loss for the business.

Often, attempting to diagnose and address root causes will delay efforts to restore the system to operation. Logistically, adopting an approach that asks incident responders to address root causes puts tremendous strain on subject matter experts (SMEs) and support staff at your company. Working to find and fix root causes during an incident requires SMEs to be on-call for every incident, and can block front-line/customer-facing support staff from taking action. It can also result in teams releasing changes that, in turn, create in more incidents. Ultimately, such an approach increases costs, consumes bandwidth across teams, and creates behaviors in times of crisis that can erode customer trust and brand reputation.

The right incident management paradigm is to prioritize and focus on recovery as a first step. After the system is restored to stability, then follow up with blameless postmortems, incident investigations, root cause remediation, and similar activities. This order of operations better enables incident response staff to triage, diagnose, and execute recovery tactics, alerting relevant SMEs to assist only as necessary. It also enables SMEs to identify and fix root causes with less pressure from a ticking incident clock.

To adopt a recovery-first mindset to incident response, consider:

Establish and achieve service-level objectives (SLOs). Service-level objectives are standards (developed with your stakeholders) related to specific system non-functional requirements (like performance or uptime). These objectives are measured by service-level indicators (SLIs), over a period of time. Without SLOs, much of the work around incident response and troubleshooting complex issues can feel disorganized and reactive (for example, “stop this specific error, for this handful of users who reported it”). This cycle is often what causes teams to push root cause analysis closer to incident response — because it seems like it will help stop the reactive behaviors. Establishing SLOs (and SLIs) is a more effective way to start. To establish SLOs, think about:
- Non-functional requirements (NFR) of your system (for example, response times, peak request rates, and concurrent users) over the next 1-3 years.
- What do you want users/customers to experience? These should become your SLOs. An example might be: Users can run reports quickly in Salesforce.
- What can you measure, for what period of time? These will become your SLIs. An SLI to match the example above could be: x% of reports load within n seconds on average, measured over a 30-day period.
Define and standardize recovery tactics. Change roll-backs and workaround implementations can help get the system functional again and minimize the impact of an incident. Document recovery tactics and protocols that can be executed by the appropriate members of your support or operations teams. Recovery tactics differ based on incident type. Below is a general framework that maps incident types to recovery tactics. (For more on identifying failure points and defining mitigation strategies, see Availability.)

Incident Type	Apparent Trigger	Recovery Tactics
System outage	Corrupted logins or issues with account access	Carry out account recovery policy
System outage	Service unavailable	Activate redundant/backup service, Manual workarounds
Production bug	Recent change	Deployment rollback or prior version de-deploy
Production bug	Emergent / unexplained bug	Manual workarounds, disable non-essential features, escalate to SMEs

Define clear exit criteria. Use your service-level objectives (SLOs) to determine when your system is out of incident or impact status.
Define post-incident review and root cause remediation processes. Take time to review incidents after you have recovered. During reviews, take a blameless postmortem approach: work with stakeholders to focus on establishing clear facts about what occurred and how it occurred, rather than attempting to assign fault or blame to individuals. Use different review formats to examine ways to address issues in the long term:
- An after-action review focuses on the response to the incident, and is useful in evaluating if the right response processes and tactics are in place.
- A root cause analysis focuses on the root cause of the incident. These reviews can help identify any bugs or issues in your system’s design and implementation that led to the incident.
Practice your agreed upon recovery protocols periodically. Practice recovery protocols to ensure everyone knows how to handle incidents well. Use sandboxes and test environments to give teams places to practice incident simulation and recovery. Also practice your post-incident reviews. This makes recovery a part of your engineering and support culture.

The list of patterns and anti-patterns below shows what architecting to prioritize recovery looks like within a Salesforce solution. You can use these to validate your designs before you build, or identify areas of your system that need to be refactored.

To learn more about Salesforce tools to help with time to recover, see Tools Relevant to Resilient.

Ability to Triage

In the context of technology, triage involves assigning categories and levels of severity to issues and support requests. No matter how well planned your solution is, user support issues and requests will arise. These can range from issues that stem from lack of sufficient training or change management, gaps in UI/UX, and unexpected end-user behaviors, to urgent system issues not caught by monitoring or alerting.

Support and operations teams need the ability to investigate user support queries efficiently and diagnose them quickly. Triaging issues to filter out less severe concerns and quickly spot critical system incidents is a key competency for these teams. Poor triaging slows all levels of user support, prolongs critical incidents, and increases the risk of further disruptions to your customers and your business.

Although you may not be involved in day-to-day operation and support, as an architect, it is your responsibility to help ensure support and operations teams can effectively triage issues in any solution you create on the Salesforce platform.

To enable teams to effectively triage issues within your Salesforce solutions, consider:

Ensure support teams have access to useful information.
- Document your system and design patterns. Ensuring readability and consistency in your solution is a key part of enabling support staff to understand the system they are responsible for supporting. In your documentation, consider how teams will find information about how to prioritize issues or incidents with different parts of the system. Also, ensure that teams can quickly get to technical information about recovery tactics, based on the area of impact.
- Design with debugging in mind. Support teams and org administrators will need to enable debugging and diagnostics to correctly triage user issues in various environments. Examples of debug-friendly patterns include those that incorporate logging and custom error messages into execution paths throughout the system.
- Identify incident SMEs and stakeholders. Create a list of relevant SMEs or stakeholders who should be available to support recovery from an incident, and who should be involved during post-incident analysis.
- Treat hand-offs thoughtfully. Ensure the quality of each solution hand-off to support or operations teams as a part of go-live. Provide training for support staff to walk through the relevant system architecture and mock incident response drills. Think about post-incident hand-offs, including how teams should document information not captured by logs or case notes, as well as how incident responders can contribute to root cause investigations or perform user acceptance testing for any remediations.
If you are consulted, keep everyone focused on recovery as a first concern.
- Respond quickly. Respond to any user support requests, monitoring notifications, and alerts you receive.
- Help distinguish symptoms from issues. Work to determine if there is an actual system incident that needs to be addressed. Try to identify the component(s) where actual issue is manifesting. Help ensure the agreed upon recovery tactics are followed to get the system out of incident status quickly.

The list of patterns and anti-patterns below shows what architecting for effective triaging looks like within a Salesforce solution. You can use these to validate your designs before you build, or identify areas of your system that need to be refactored.

To learn more about Salesforce tools to help with triage, see Tools Relevant to Resilient.

Monitoring and Alerting

Monitoring and alerting are widely used terms in site reliability engineering. In the context of system resiliency, monitoring is the ability to continuously assess the current state of a system and alerting is the ability to automate notifications to stakeholders about potential concerns about the state of the system. Effective monitoring and alerting is a key part of decoupling the scale and growth of your system from the scale and growth of your support staff.

Salesforce provides a variety of built-in capabilities to monitor behaviors in your system. Salesforce also offers real-time event monitoring as an add-on or as part of Salesforce Shield. In any Salesforce solution, designs architected for monitoring and alerting provide:

Capabilities for automated incident response
Relevant information to the right users, at the right time
Clear information for historical views and trend analysis

To architect for effective monitoring and alerting within your Salesforce solutions, consider:

Make automation a priority. While notifying users about critical state changes is a crucial part of making sure that your system remains stable and operational, in an ideal architecture the system self-corrects issues when possible and only sends alerts for urgent, non-recoverable issues. Even without self-correcting capabilities, automation can make your alerting and reporting more useful.
- Start with what Salesforce already provides. The Salesforce Platform provides relevant logs and APIs for you to monitor your solution’s operations with respect to governor limits. In addition, the platform sends alerts for governor limit violations and similar issues. Use these logs and alerts as the basis for exploring ways to more fully automate system self-recovery, incident reporting, and alerts. For example, you might create an automation that monitors the log and then takes a recovery action when a particular type of event is logged.
- Classify changes to system state in predictable ways. Create specific, meaningful categories for key states you want to monitor and report on. Align these categories with the categories you define to manage state in your application components. Adopt an API-oriented mindset for how you handle state change information. Consistent message formats and state categories will simplify automation, reporting, and alerting.
- Align your automation logic with other parts of your system. If you’ve built proper automation error handling, you can extend those patterns to how you classify state changes and respond with automation. For state changes considered recoverable, you can automate retry behaviors. For state changes that are deemed to be critical or fatal, automate alerts to users.
Avoid creating noise. When users receive too many alerts — particularly alerts that require no action to be taken — they tend to start disabling or ignoring all alerts. This undermines any efforts to create helpful alerts. To better scope who/when/what for your alerts, consider:
- Build stakeholder maps. To make sure your system delivers the right alerts to the right stakeholders at the right times, you need to first identify and classify your stakeholder groups.
- Route messages based on user privileges. Only send alerts to recipients who have the ability and authority to respond. Business users might benefit from alerts about issues they can fix by correcting issues in records they have access to. If an issue requires a more involved technical response, alerts should be directed to support staff.
- Make the expected response clear. Only send alerts in scenarios that require human intervention. Structure messages to make it clear what action is expected from the recipient. If you do send an alert to a stakeholder for visibility and no action is required from them, make that clear in the version of the message they receive.
- Make alerts timely and relevant. Alerts that are delivered in response to failures that have already occurred (without remediation) are not as helpful as alerts about a potential failure before it happens. Ideally, support staff are alerted as soon as problematic conditions occur in the system, providing an opportunity to triage issues before they can have negative impacts on business operations.

The list of patterns and anti-patterns below shows what architecting for effective monitoring and alerting looks like within a Salesforce solution. You can use these to validate your designs before you build, or identify areas of your system that need to be refactored.

To learn more about Salesforce tools for monitoring and alerting, see Tools Relevant to Resilient.

Incident Response Patterns and Anti-Patterns

The following table shows a selection of patterns to look for (or build) in your org and anti-patterns to avoid or target for remediation.

✨ Discover more patterns for incident response in the Pattern & Anti-Pattern Explorer.

	Patterns	Anti-Patterns
Time to Recover	Within your business: - Recovery protocols are practiced on regular intervals - Teams know what services in production they are responsible for owning	Within your business: - Recovery protocols don't exist or aren't practiced on regular intervals - It is unclear what teams are responsible for different services in production
	In your documentation: - Recovery tactics are defined and classified by incident type and trigger - Exit criteria for incident responses exist in your SLOs and are clear - Activation criteria and assignment logic for elevated permissions during incidents are clear - Incident response permission sets and authorizations are clearly listed	In your documentation: - Incident response is performed ad hoc - Exit criteria for incident responses do not exist - Elevated permissions are not assigned, or assigned ad hoc - Incident response permission sets and authorizations are not listed
	In your org: - Session-based permission sets for incident response exist and can be assigned to support staff during recovery - Setup Audit Trail shows designated recovery testers have logged into testing environment on agreed upon time and have followed recovery test scripts	In your org: - Session-based permission sets do not exist for incident response, or are not authorized for support staff to use - Setup Audit Trail shows designated recovery testers have not logged into the testing environment or did not follow recovery test scripts
	In your test plans: - Test scripts for recovery testing exist and are repeatable - Environments for incident simulations are clearly listed	In your test plans: - Test scripts do not exist for recovery testing - Environments are not established for incident simulations
Ability to Triage	Within your business: - SMEs or stakeholders who should be alerted to support complex issues are identified before an incident occurs - The hand-off between delivery and support teams is a part of go-live - If consulted, Salesforce architects respond quickly and help the team stay focused on recovery	Within your business: - SMEs or stakeholders who should be alerted aren't identified until an incident occurs - The hand-off between delivery teams and support teams isn't a part of the release process - Salesforce architects consider incident response to be outside their scope of work
	In your documentation: - System and design patterns used in a given solution are discoverable and readable by support staff	In your documentation: - System and design patterns used in a given solution are not readily available to support staff
	In your org: - Logging and custom error messages are incorporated into execution paths throughout the system	In your org: - Logging and custom error messages are not used
Monitoring and Alerting	In your org: - Alerts are only used to inform users of scenarios that require human intervention; other failures are logged and reportable - Alerts are sent to users who are capable of responding to them - When possible, alerts are delivered in advance of a potential failure	In your org: - Alerts are sent when any type of failure occurs, regardless of whether follow-on actions are required - Alerts about issues requiring technical solutions are delivered to business users - Alerts are only delivered in response to failures that have already occurred
Monitoring and Alerting	In your documentation: - Entry criteria for prompt tuning alerts are defined based on direct and indirect generative AI feedback metrics	In your documentation: - There are no criteria defined for triggering prompt tuning alerts for generative AI apps

Continuity Planning

A key to business resilience is continuity planning, which focuses on how to enable people and systems to function through issues caused by an unplanned event. Business continuity plans (BCPs) take a people-oriented view of how to keep processes moving forward through crisis. Technical aspects of continuity planning are contained in the disaster recovery portions of a BCP. For more on this topic, see Technology Continuity.

Without adequate continuity plans, your organization may be paralyzed in the event of a crisis or system outage. Ineffective continuity planning can have catastrophic impact on customers, stakeholders, and business. In the wake of an adverse event, each moment that passes without maintaining or recovering critical processes risks financial damage, reputational damage, employee safety, and even regulatory compliance.

You can build better continuity planning into your systems by focusing your efforts in three areas: defining business continuity for Salesforce, planning for technology continuity, and building backup and restore capabilities.

Business Continuity

Your company may already have a BCP in place. If this is the case, make sure Salesforce is included. If your company doesn’t have a BCP, work with your stakeholders to create one that covers your Salesforce org(s).

Salesforce will likely play a unique role in business continuity plans, because of the role it occupies in the system landscape. Salesforce is often relied upon to be a source of truth for customer data and essential business processes, across many business divisions. As such, the role Salesforce plays in a BCP may differ from other systems. It is likely that Salesforce will be involved in many high-priority areas for recovery.

To create relevant business continuity planning for Salesforce systems, consider:

Clarify priorities for recovery. As with the general approach for incident response, recovery needs to be the first priority for systems in moments of crisis. There will be many business-critical services running in and with Salesforce. Among these, you need to help stakeholders identify the correct priority for various business functions and capabilities in recovery contexts. A general framework could be:
- Stabilize essential business infrastructure
- Stabilize customer services
- Stabilize employee and partner services
Account for your ecosystem in your BCPs. Salesforce is not the only system in your landscape. Make sure you identify gaps in your BCP around systems that integrate with Salesforce, solutions installed from AppExchange vendors, and any other systems that connect to data or processes in Salesforce. If your ability to deliver depends on vendors, ask those vendors about their continuity plans. Assess their capabilities and plan for how your systems will remain available.
Integrate BCP concerns into your testing strategy. Create test plans for your BCP and carry them out. It is especially important to remember to test the areas of your BCP related to processes or people — these can often get overlooked. Incorporate relevant items from your BCP into your overall ALM test strategy. Create and follow a maintenance schedule to review tests and ensure your plan stays up to date.

The list of patterns and anti-patterns below shows what proper (and poor) continuity planning looks like for a Salesforce solution. You can use these to validate your designs before you build, or identify places in your system that need to be refactored.

To learn more about Salesforce tools for defining business continuity, see Tools Relevant to Resilient.

Technology Continuity

The goal of technology continuity is to make sure the business won’t be prevented from maintaining essential operations due to issues with the components in a system. Salesforce prioritizes maintaining our services at the highest levels of availability, and providing transparent information about any issues. You can see real-time information about Salesforce system performance and issues at trust.salesforce.com. As an architect building on Salesforce, your solutions benefit from the site reliability, security, and performance capabilities that Salesforce provides across the entire platform.

However, the overall continuity of your Salesforce solutions extends beyond the built-in services Salesforce provides. From an architectural perspective, Salesforce technology continuity planning has to begin with asking (and answering) questions about how Salesforce fits into your larger enterprise landscape. What kind of systems integrate with Salesforce? How do external systems depend on processes or information in Salesforce? In your Salesforce orgs, what processes or functionality rely on AppExchange solutions? Do your users access Salesforce through third-party identity services or SSO?

To build better technology continuity in your Salesforce systems, consider:

Assess your infrastructure. The most common remediation strategy for technology outages or issues is building redundant services or systems that you can fallback to during an incident. At Salesforce, we have an intentionally redundant architecture, meaning we maintain copies of our customer’s systems and services in different physical locations. We use a number of disaster recovery techniques, including site switching, which enables us to direct user traffic from one data center to an entirely different data center if needed. To ask to identify where you might need to build intentional redundancy, ask the following:
- What happens during an outage for [X]? Can we switch to another service?
- How long does it take to recover [X]? What is the impact to our customers? What is the impact to our partners? What is the impact to internal teams?
- What about backups and frequency of backups? Could those provide data needed to support the business?
- Do we have dependencies on vendors? What are their BCP plans?
Provide operational support. Operational support is about getting teams back up and running as fast as possible. Think through how your system can handle significant increases in capacity requirements and demand from unanticipated changes — even changes that are industry-wide, region-wide, or global. Make sure your BCP accounts for additional resources or break-glass procedures that Site Reliability Engineering (SRE) or support teams may need to respond to incidents effectively. Questions to ask about operational support include:
- In an outage, would our technical teams have the tools they need to continue work? Have we simulated an outage to validate plans or identify gaps?
- If a disaster is in a specific area, do we have coverage plans?
- Are our customers global? Do they operate 24/7?
- Do we have proper monitoring and alerting to notify the appropriate individuals when there are failures?
Automate and test your recovery tactics. After an issue is remediated, identify where and, if you can, automate your recovery tactics and adjust any process issues. Many companies schedule incident simulations for a subset of services to test system resiliency. An example could be simulating a system administrator account being locked out or compromised, or simulating an outage or issue with an AppExchange provider. (For more on this, see incident response.) Questions to ask about how testing and automation can help you restore services faster include:
- How often do you schedule and run incident simulations?
- Do we know how long it takes to restore to a stable state?
- Do we have stable delivery processes in place?
- Do we know where we can automate failover and recovery?

Treat any items that come out your post-incident reviews like other development items, and add them to your planning systems to be prioritized and worked on.

The list of patterns and anti-patterns below shows what proper (and poor) technology continuity planning looks like within a Salesforce solution. You can use these to validate your designs before you build, or identify places in your system that need to be refactored.

To learn more about Salesforce tools for technology continuity planning, see Tools Relevant to Resilient.

Building Backup and Restore Capabilities

Restoring backed-up copies of data or metadata can help return your org to its last known stable state and provide a failover system that can be used in the event of a catastrophic system failure or outage. Backing up your data and metadata regularly and storing your encrypted, backed-up copies in a secure location adds an additional layer of resilience to your architecture.

Without backup and restore strategies you will not be able to restore clean versions of your production data and metadata when data is maliciously corrupted, when defects inadvertently make their way into production, or when a failure during a large data load corrupts production data. Any one of these scenarios can result in your business-critical production data becoming corrupt or even permanently lost. Setting up backup and restore technology offers a number of advantages aside from continuity planning, including assisting with large data volume mitigation strategies, adhering to compliance-related retention policies, and more.

To help ensure continuity with backup and restore strategies in your Salesforce solutions, consider:

Get started. The first step to having a good backup and restore strategy is to have one in the first place. Even something as simple as making nightly backups of all of your org’s data and metadata can save your business from losing critical information or functionality in the event of a disaster.
Restrict access to backups. System administrators are the only users who should have access to backed-up copies of your data to prevent any chance of a business user being able to view records in a backup copy that they wouldn’t be authorized to view in your org.
Test your restore process regularly. Regardless of what backup and restore strategy you choose to implement, test your restore process in a Full or Partial Copy sandbox regularly to be sure that it will work correctly when you need it.
Align your backup and restore strategy with your data archival strategy. When records are archived or purged from your system, evaluate what should happen with that data in your backups or archives. (For more on this topic, see data volume).

You may require a more granular backup strategy if your data volumes are so large that a full backup doesn’t have time to complete before the next backup starts running or if your organization’s data changes so frequently that the updates are mission-critical to your organization.

Here are ways to shape a more granular backup strategy:

Scope to specific objects. This strategy involves backing up records from different objects at different time intervals. Keep in mind that child objects have to be backed up at the same intervals as their parents to maintain data consistency.
Time-box partial backups. This strategy involves differentiating between full backups (all data and metadata) and partial backups (only metadata and records that have been added or changed since the last backup).

Always continue to perform full backups. It’s important to note that you should never eliminate full backups completely, even if data volumes result in long run times. In the case of large data volumes, plan for regular, but infrequent full backups (weekly for example) in conjunction with more frequent partial or object-specific backups (nightly or every X number of hours, for example). This will give you the flexibility to reconstruct the most complete and accurate dataset to use in your restore processes.

The list of patterns and anti-patterns below shows what proper (and poor) backup and restore capabilities look like within a Salesforce solution. You can use these to validate your designs before you build, or identify places in your system that need to be refactored.

To learn more about Salesforce tools for backup and restore, see Tools Relevant to Resilient.

Continuity Planning Patterns and Anti-Patterns

The following table shows a selection of patterns to look for (or build) in your org and anti-patterns to avoid or target for remediation.

✨ Discover more patterns for continuity planning in the Pattern & Anti-Pattern Explorer.

	Patterns	Anti-Patterns
Business Continuity	Within your business: - A "recovery first" mindset is adopted with a focus on bringing the highest priority business functions and capabilities out of impact as soon as possible - There is a maintenance schedule for the review of BCP test plans	Within your business: - A "fix-the-problem" mentality is the only approach to incident management - BCP test plans are not refreshed at regular intervals
	In your documentation: - A BCP exists containing: steps to continue processing or triage data if Salesforce becomes unavailable, a list of events that can trigger the use of the BCP, steps and intervals for BCP testing - Your BCP includes upstream and downstream systems and dependencies	In your documentation: - A BCP does not exist, is incomplete, or includes only Salesforce
	In your test plans: - The areas of your BCP related to processes and people are accounted for	In your test plans: - The areas of your BCP related to processes and people are not accounted for
Technology Continuity	Within your business: - You have evaluated if you need to build intentional redundancy or fail-over systems - Incident recovery tactics are automated wherever possible	Within your business: - You have not evaluated the need for intentional redundancy or fail-over systems - Incident recovery tactics are all manual
Technology Continuity	In your documentation: - Your BCP accounts for additional resources or break-glass procedures teams might need to respond to incidents effectively	In your documentation: - Your BCP does not include operational support needs
Backup and Restore	In your documentation: - A backup and restore strategy exists for both data and metadata	In your documentation: - A backup and restore strategy does not exist or the strategy is incomplete (it applies to only data or metadata, not both)
Backup and Restore	At your company: - Backups are stored in a secure location accessibly by only authorized users - Test plans and test logs show data restores are tested in a full or partial copy sandbox at least two times each year	At your company: - Backups are not human readable - Backups are stored in locations that unauthorized business users can access - There is no data restoration process or the data restoration process is untested

Tools Relevant to Resilient

Tool	Description	Application Lifecycle Management	Incident Response	Continuity Planning
Apex Hammer Tests	Learn about Salesforce Apex testing in current and new releases	X
Apex Stub API	Build a mocking framework to streamline testing	X
Backup and Restore	Automatically generate backups to prevent data loss			X
Big Objects	Store and manage large volumes of data on-platform			X
Field History Tracking	Track and display field history		X
Get Adoption and Security Insights for Your Organization	Monitor adoption and usage of Lightning Experience in your org		X
Manage Bulk Data Load Jobs	Create update, or delete large volumes of records with the Bulk API		X
Manage Real-Time Event Monitoring Events	Manage event monitoring streaming and storage settings		X
Monitor Data and Storage Resources	View your Salesforce org’s storage limits and usage		X
Monitor Debug Logs	Monitor logs and set flags to trigger logging		X
Monitor Login Activity with Login Forensics	Identify behavior that may indicate identity fraud		X
Monitor Setup Changes with Setup Audit Trail	Track recent setup changes made by admins		X
Monitor Training History	View the Salesforce training classes your users have taken		X
Monitoring Background Jobs	Monitor background jobs in your organization		X
Monitoring Scheduled Jobs	View report snapshots, scheduled Apex jobs and dashboard refreshes		X
Performance Assistant	Test system performance and interpret the results	X
Proactive Monitoring	Minimize disruptions with Salesforce monitoring services		X
Salesforce Data Mask	Automatically mask data in a sandbox	X
The System Overview Page	View usage data and limits for your organization		X
Use force:lightning:lint	Analyze and validate code via the CLI	X

Resources Relevant to Resilient

Resource	Description	Application Lifecycle Management	Incident Response	Continuity Planning
7 Anti-Patterns in Performance and Scale Testing	Avoid common anti-patterns in performance and scale testing	X
Analyze Performance & Scale Hotspots in Complex Salesforce Apps	An approach to address performance and scalability issues in your org	X
Build a Disaster Recovery Plan (Trailhead)	Build a disaster recovery plan			X
Business Continuity is More than Backup and Restore	Take a comprehensive view of BCP			X
Design Standards Template	Create design standards for your organization	X
Diagnostics and Monitoring tools in Salesforce	Learn how to improve the quality and performance of your implementations		X
Guiding Principles for Continuity Planning	Review the basic principles underlying effective BCP			X
How to Scale Test on Salesforce	Approach scale testing in five steps	X
Introduction to Business Continuity Planning for Architects (Trailhead)	Get started with business continuity planning			X
Introduction to Performance Testing	Learn how to develop a performance testing method	X
Monitor Your Organization	Learn about self service monitoring options		X
Scale Test Strategy Checklist	Create and customize scale and performance test plans	X
Site Reliability Engineering At Salesforce	Learn what Salesforce SRE does and how they do it		X
Test Strategy Template	Ensure completeness of your test strategy	X
Understand Source Driven Development (Trailhead)	Learn about package development and scratch orgs	X

Tell us what you think

Help us keep Salesforce Well-Architected relevant to you; take our survey to provide feedback on this content and tell us what you’d like to see next.