Read about our update schedules here.

Introduction

Resilient solutions maintain a high quality of service even when failures occur. If performance degrades or service is interrupted, the solution quickly and effectively recovers.

The resilience of a solution is grounded in two key qualities:

To architect your solution for resilience, you must design for both toughness and elasticity, ensuring both durability and rapid recovery in the face of planned and unplanned changes.

In technology contexts, consider a system or solution as a collection of interdependent components that coordinate to perform shared goals. Every component has the potential to fail. Problems within those components, from code and configuration defects to network and hardware issues, can cause unexpected, undesired behavior. A system demonstrates resilient behavior when one or more components fail, but the overall system continues to function or quickly returns to a stable state.

To improve the resilience of your Salesforce solutions, we recommend focusing on three key habits.

Application Lifecycle Management

Application lifecycle management (ALM) is a practice for holistically managing software throughout its lifecycle, from creation through retirement. ALM is a cornerstone of system resiliency and encompasses people, processes, tools, and disciplines related to the application lifecycle. Those disciplines include DevOps and delivery methodologies, observability, testing strategies, governance, and CI/CD.

When a business practices effective ALM, its teams react quickly to change, and its applications keep pace with evolving business requirements without compromising stability or quality.

On the other hand, without healthy ALM, teams struggle at every stage of the application lifecycle.

Symptoms of poor ALM include:

Because ALM touches nearly every aspect of a solution, establishing clear and effective ALM practices for your solution is a key part of your architectural work.

Build better ALM practices by focusing on three key areas.

Release Management

Release management involves planning, sequencing, controlling, and migrating changes into one or more environments. A single release is a group of planned changes that a team moves into a target environment at the same time.

Releasing a change to a system introduces risk to it. If the system is in a stable state before the change, it transitions to a new state, where it’s also more vulnerable to risks from future changes. If any future changes trigger an uncontrolled, unstable state in the system, they can cause a critical incident. In a solution architecture, designing for resilient releases is more than just testing individual changes effectively. It also involves planning how to introduce changes to your systems and their users safely.

The work that your teams do depends on predictable and accurate release information. In your change management and enablement processes, be clear about which changes can move into your system. In your release management and enablement processes, specify how—and how often—changes are released to your system.

Your business stakeholders also care about release information, especially if it’s related to features or bug fixes that they request. To build trust in your solution and demonstrate value to your stakeholders, establish release schedules that are consistent and clear and ship release artifacts that are stable.

To establish effective release management for Salesforce:

The best release mechanisms for your team are the most stable options that your team has the required skills for. These are the recommended release mechanisms, listed in order of stability. All of them are compatible with each other, so use several of them in tandem if that’s best for your company.

The patterns and anti-patterns for ALM show what proper and poor release management looks like for a Salesforce org. Use the patterns to validate your designs before you build, or to identify places in your system that need to be refactored.

To learn more about Salesforce tools for release management, see Salesforce Tools for Resiliency.

Environment Strategy

Salesforce provides a variety of environments for you to use during application development and testing cycles. An effective environment strategy for Salesforce requires understanding how to use the environments and what good management looks like. In ALM, how useful a development or testing environment is depends on its fidelity to and isolation from production.

A good environment strategy provides several benefits.

Teams often struggle to realize these benefits. Challenges to getting the most out of your development environments and strategy can come from several sources. One likely source is the type of development model that your teams follow.

In the older, org-based development approach, each environment needed to serve several functions. In addition to being where your team does its various kinds of work, it needed to be the source for your release artifacts (that is, the metadata that you wanted to deploy in a release). Because environments weren’t easy to set up or tear down, they were often overcrowded and full of metadata conflicts between teams, and they don’t contribute meaningful speed or flexibility to ALM overall.

Using a source-based development model fundamentally shifts the relationship that environments have to your releases and release artifacts. In this model, source control is the source of the metadata that you want to release. Environments are just places where your teams do their work.

However, following the source-based development model doesn’t alone guarantee a good environment strategy. Even with source control, teams can still struggle to set up conditions to test integrations with external systems; configurations that depend on metadata that isn’t in source control, such as managed packages or customizations that depend on data), and so on. In certain circumstances, the challenges from a source-based model is are similar to the challenges that are typical of an org-based model.

To develop an effective environment strategy:

Scratch Org Developer Sandbox Developer Pro Sandbox Partial Copy Sandbox Full Sandbox
Supports Org Shape Yes No No No No
Supports Source Tracking Yes Yes Yes No No
Lifespan 1–30 days Manually controlled Manually controlled Manually controlled Manually controlled
Refresh Interval Not Available 1 day 1 day 5 days 29 days
Release Preview Support Developer controlled Based on sandbox instance Based on sandbox instance Based on sandbox instance Based on sandbox instance
Provisioning Time > 5 minutes Hours or days Hours or days Hours or days Hours or days
Metadata Determined By Source control Production Production Production Production
Data Determined By Manual data load Manual data load Manual data load Sandbox template Production
Data Limit 200 MB 200 MB 1 GB 5 GB Same as in production

Refer to this table to learn which features and environments to use for several common development tasks.

Task Org Shape Source Tracking Frequent Refreshes Release Preview Support All metadata from production Partial Metadata from production Large datasets from production Partial datasets from production Compatible Environments
Prototyping X X X X X X X Scratch Orgs, Developer and Developer Pro Sandboxes
New Feature Investigations or Proof-of-Concept Development X X X X X X X Scratch Orgs, Developer and Developer Pro Sandboxes
User Acceptance Testing X X X X X X Developer, Developer Pro and Partial Copy sandboxes
Performance and Scale Testing X X X Full sandbox
User Training X X X X X* X Developer Pro, Partial Copy and*Full sandboxes
*If required to complete a specific kind of work, otherwise use a less resource-intense environment

In addition, note that for Agentforce agents that use features such as the Einstein Data Library, knowledge articles, and unstructured data, comprehensive testing is limited unless you have a Data 360 sandbox. You also need a Data 360 sandbox to ensure accurate testing conditions.

The list of patterns and anti-patterns for ALM shows what proper and poor environment management looks like in a Salesforce org. Use them to validate your designs before you build, or to identify areas of your system that need to be refactored.

To learn more about Salesforce tools for environment management, see Salesforce Tools for Resiliency.

Signaling Strategy

A signaling strategy defines the critical signals and application instrumentation needed to detect, diagnose, and remediate failures before they cascade into system-wide degradation. Effective instrumentation transforms applications from passive victims of failure into active participants in their own resilience, capable of detecting problems, adapting their behavior, and coordinating graceful degradation when necessary.

When applications implement comprehensive instrumentation, they gain the ability to self-regulate under stress, communicate their health status to operators, and participate in coordinated recovery efforts. These capabilities allow systems to maintain service quality even as individual components experience distress. On the other hand, without proper instrumentation, applications become black boxes that fail silently until catastrophic symptoms appear. Teams react to problems only after users report them, and troubleshooting becomes an exercise in archaeology rather than observation.

These signals are exposed through standardized interfaces that allow both automated systems and human operators to assess application health. The instrumentation itself becomes part of the application's resilience strategy, enabling circuit breakers to trip based on error rates, autoscalers to respond to queue depths, and operators to make informed decisions during incidents.

The patterns and anti-patterns for ALM show what a proper and poor signaling strategy look like in a Salesforce org. Use them to validate your designs before you build, or to identify areas of your system that need to be refactored.

To learn more about Salesforce tools for a signaling strategy, see Salesforce Tools for Resiliency.

Testing Strategy

A test strategy is a set of guiding principles and standards for how to plan and run tests that gauge the success and failure of applications during ALM processes. A test strategy keeps every stakeholder who is involved in testing informed about and aligned with the priority, purpose, and scope of a given test. It also helps project teams create effective and thoughtful test plans.

Typically, developers or quality assurance and testing experts are involved in creating and executing specific tests. A test strategy helps ensure that these individuals know what kinds of tests need to be conducted for a given project and in what sequence to conduct them. A test strategy also helps ensure that teams have what they need to build well-formed tests, test plans, and artifacts (for example, test data sets, devices, and traffic or network simulators).

An effective testing strategy creates a clear picture of how, when, where, and why to run different test types—including unit tests, UI tests, and regression tests—in various combinations and conditions to uncover how your system and any in-flight changes will behave. An effective test strategy produces tests that show you how well a system conforms to non-functional requirements—such as scalability, reliability, and usability—which can be difficult to measure through a single kind of test.

To create effective testing strategies for Salesforce:

ALM Patterns and Anti-Patterns

The following table shows a selection of patterns to look for or build in your org, and anti-patterns to avoid or target for remediation.

✨ Discover more patterns for ALM in the Pattern & Anti-Pattern Explorer.

Patterns Anti-Patterns
Release Management In production:
- Metadata shows use of stable release mechanisms, such as:
-- Metadata being organized into unlocked packages
-- DevOps Center being active and installed
-- Deployments via the Metadata API using the source format
- Deployment logs show no failed deployments within the available history.
- Deployment history shows clear release cadences and fairly uniform deployment clusters within release windows.
In production:
- Metadata indicates use of org-based release mechanisms, such as:
-- Active use of change sets
-- Deployments via Metadata API use package.xml format
- Deployment logs show repeated instances of failed deployments within the available history.
- Deployments have no discernable cadence or show uneven clusters of deployments, which are signs of hot-fixes and ad hoc rollbacks).
- DevOps Center isn't enabled and installed.
In your roadmap and documentation:
- Release names are clear.
- Features are clearly tied to a specific, named release.
- Release names are searchable and discoverable.
- Teams can find and follow clear guidelines for tagging artifacts, development items, and other work with the correct release names.
- It's possible to pull together a clear view of a release manifest by a release name.
- Quality threshholds for generative AI apps are defined for different development stages.
In your roadmap and documentation:
- Release names aren't included.
- Features aren't clearly tied to a specific release.
- Release names are used ad hoc or don't exist.
- Teams refer to artifacts, development items, and other work in different ways.
- It's not possible to pull together a clear view of a release manifest using a release name.
- Quality thresholds for generative AI apps aren't defined, or if they are, aren't defined for different development stages.
Environment Strategy In your orgs:
- A source-driven development and release model is adopted.
- Source tracking is enabled for Developer and Developer Pro sandboxes.
- Metadata in a given environment is independent from release artifacts.
- Environments don't directly correspond to a release path.
- Release paths for a change depend on the type of the change (high risk, medium risk, or low risk).
- Overcrowded environments don't exist.
- Risky configuration changes are never made directly in production.
- No releases occur during peak business hours.
- Data 360 sandboxes are used to properly test agentic use cases that require Einstein Data Library, knowledge articles, and unstructured data
In your orgs:
- An org-based development and release model is adopted.
- Source tracking isn't enabled for Developer and Developer Pro sandboxes.
- Metadata in a given environment is a release artifact.
- Environments directly correspond to a release path.
- The release path for every change is the same, regardless of the type of change.
- Overcrowded environments exist. - Risky configuration changes are made directly in production.
- Releases occur during peak business hours.
- Agentforce agents that require Einstein Data Library, knowledge articles, and unstructured data are not tested using Data 360 sandboxes
Signaling Strategy In your orgs:
- Teams collaborate on defining and standardizing health check APIs and SLOs.
- The regular review and refinement of signaling strategies are part of post-mortems and operational readiness reviews.
In production:
- Health checks are implemented for all applications.
- Applications provide explicit signals about their health, such as their load and capabilities.
- Applications are designed to degrade gracefully when dependencies are unhealthy.
- Load shedding is used to prevent cascading failures.
In your design:
- Backpressure and load-shedding mechanisms prevent services from being overwhelmed by traffic.
- It's assumed that dependencies eventually fail. Signal handlers are built to ameliorate failures.
In your orgs:
- Teams operate in silos, creating inconsistent and incompatible health-signaling mechanisms.
- Signaling strategies are an afterthought, only addressed when an incident occurs.
In production:
- Components fail silently without signaling their health status.
- Applications retry requests to unhealthy services indefinitely.
- All requests are treated with the same priority, regardless of their importance.
- To identify problems, operators rely solely on reactive measures, such as user complaints or critical system failures.
In your design:
- It's assumed that all dependencies will always be available, and network partitions, latency spikes, or other common issues aren't accounted for.
- Applications accept all incoming requests, even when they are overloaded, leading to increased latency and a higher likelihood of failure
Testing Strategy In your business:
- Usability tests employ a variety of devices and assistive technology.
- Simulators are used to replicate production-like conditions for scalability and performance testing.
- Tests are automated to run when changes come into source control.
- Endurance, stress, performance, and scale tests are run at several intervals in the application development cycle and considered ongoing tasks.
- You include scale testing as part of your QA process when you have B2C-scale apps, large volumes of users, or large volumes of data.
- Your scale tests are focused on high-priority aspects of the system.
- Your scale tests have well-defined criteria.
- You conduct scale testing in a Full sandbox.
- Prompt engineering includes a quality review by a human.
- Agentforce Testing Center is used for robust agent testing.
In your business:
- Usability tests aren't conducted, or if they are, are conducted on a limited set of devices.
- Production-like volumes of user requests, API traffic, and variations in network speed aren't tested.
- Test automation isn't in place.
- Endurance, stress, performance, scale tests are considered a phase or stage of development.
- You don't conduct scale tests as a part of your QA process, and you have B2C-scale apps, large volumes of users, or large volumes of data.
- Your scale tests aren't prioritized.
- Your scale tests don't have well-defined criteria.
- You conduct scale tests in a Partial Copy or Developer sandbox.
- Prompt engineering doesn't include a quality review by a human.
- Agentforce agents aren't tested, or if they are, tested only ad hoc using Agent Builder.
In your org:
- All test data is scrubbed of sensitive and identifying data.
In your org:
- Test data is identical to production data.
In Apex:
- Data factory patterns are used for unit tests
- Mocks and stubs are used to simulate API responses.
In Apex:
- Unit tests rely on org data.
- Mocks and stubs aren't used.
In your design standards and documentation:
- Environments are classified by the types of tests they can support.
- Appropriate test regimes are specified according to risk, use case, or complexity.
In your design standards and documentation:
- Which types of tests each environment supports isn't clear.
- Test regimes aren't categorized by risk, use case, or complexity.

Incident Response

In security and site reliability engineering (SRE), incident response is focused on how teams identify and address events that impact the overall availability or security of a system, as well as how teams work to address root causes and prevent future issues. Incident response involves the processes, tools, and organizational behaviors required to address issues in real time and after an issue occurs.

As an architect, you may not be the person monitoring your solution’s operations on a day-to-day basis once it goes live. Part of architecting for resilience is designing recovery capabilities that enable support teams to perform first-level diagnosis, stabilize systems, and effectively hand over the investigation and root cause mitigation to development or maintenance teams. Teams directly supporting users on a day-to-day basis may not have a deep understanding of or expertise in the architecture of the system. It’s essential for these teams to have the tools and processes that they need to monitor daily operations, access information from the system when diagnosing a potential incident, and serve as effective first-responders for any issues impacting availability.

You can improve how well teams respond to incidents in your Salesforce solutions by focusing on your time to recover, ability to triage, and monitoring and alerting.

Time to Recover

When an incident occurs, the first priority must be restoring systems to a stable operational state. Often, businesses think that the only way to recover from an incident is to “fix the problem.” This assumption is fair in that accurate root cause analysis and remediation is how you ultimately resolve critical issues in a system. However, “fixing the problem” during the early stages of crisis response isn’t the most practical approach. Depending on the severity of an incident, every second of it and its impact could lead to a revenue or reputation loss for the business.

Often, attempting to diagnose and address root causes delays efforts to restore a system to operation. Logistically, adopting an approach that asks incident responders to address root causes puts tremendous strain on the subject matter experts (SMEs) and support staff at your company. Working to find and fix root causes during an incident requires SMEs to be on call for every incident, which can block frontline, customer-facing support staff from taking action. It can also result in teams releasing changes that, in turn, create more incidents. Ultimately, such an approach increases costs, consumes bandwidth across teams, and leads to behaviors in times of crisis that can erode customer trust and brand reputation.

The right incident management paradigm is to prioritize and focus on recovery as a first step. After a system is restored to stability, you can follow up with blameless postmortems, incident investigations, root cause remediation, and similar activities. This order of operations better enables incident response staff to triage, diagnose, and execute recovery tactics, alerting relevant SMEs to assist only as necessary. It also enables SMEs to identify and fix the root causes of an incident with less pressure from a tickingclock.

To adopt a recovery-first mindset to incident response:

Incident Type Apparent Trigger Recovery Tactics
System Outage Corrupted logins or issues with account access An account recovery policy
Service unavailability Activating redundant, backup service; manual workarounds
Production Bug A recent change Deployment rollback or de-deployment of the previous version
An emergent, unexplained bug Manual workarounds, disabling non-essential features, escalating to SMEs

The patterns and anti-patterns for incident response show what architecting to prioritize recovery looks like in a Salesforce solution. Use them to validate your designs before you build, or to identify areas of your system that need to be refactored.

To learn more about Salesforce tools to help with time to recover, see Salesforce Tools for Resiliency.

Ability to Triage

In the context of technology, triaging involves assigning categories and levels of severity to issues and support requests. No matter how well planned your solution is, user support issues and requests will arise. These issues can stem from a lack of sufficient training or change management, gaps in UI/UX, unexpected end-user behaviors, and urgent system issues not caught by monitoring or alerting.

Support and operations teams need to be able to investigate user support queries efficiently and diagnose them quickly. Triaging issues to filter out less severe concerns and quickly spot critical system incidents is a key competency for these teams. Poor triaging slows all levels of user support, prolongs critical incidents, and increases the risk of further disruptions to your customers and your business.

Although you may not be involved in day-to-day operations and support, as an architect, it’s your responsibility to help ensure that your support and operations teams can effectively triage issues in any solution that you create on the Salesforce platform.

To enable teams to effectively triage issues within your Salesforce solutions:

The patterns and anti-patterns for incident response show what architecting for effective triaging looks like in a Salesforce solution. Use them to validate your designs before you build, or to identify areas of your system that need to be refactored.

To learn more about Salesforce tools to help with triaging, see Salesforce Tools for Resiliency.

Monitoring and Alerting

Monitoring and alerting are widely used terms in site reliability engineering. In the context of system resiliency, monitoring is continuously assessing the current state of a system, and alerting is automating notifications to stakeholders about potential concerns about the state of the system. Effective monitoring and alerting is a key part of decoupling the scale and growth of your system from the scale and growth of your support staff.

Salesforce provides a variety of built-in capabilities to monitor behaviors in your system. Salesforce also offers real-time event monitoring as an add-on or as part of Salesforce Shield. In any Salesforce solution, designs architected for monitoring and alerting provide:

To architect for effective monitoring and alerting within your Salesforce solutions:

The list of patterns and anti-patterns show what architecting for effective monitoring and alerting looks like in a Salesforce solution. Use them to validate your designs before you build, or to identify areas of your system that need to be refactored.

To learn more about Salesforce tools for monitoring and alerting, see Salesforce Tools for Resiliency.

Incident Response Patterns and Anti-Patterns

This table shows a selection of patterns to look for or build in your org, and anti-patterns to avoid or target for remediation.

✨ Discover more patterns for incident response in the Pattern & Anti-Pattern Explorer.

Patterns Anti-Patterns
Time to Recover In your business:
- Recovery protocols are practiced at regular intervals.
- Teams know which services in production they own and are responsible for.
- Teams understand relevant tooling to support the diagnosis of issues.
In your business:
- Recovery protocols don't exist or aren't practiced at regular intervals.
- Which teams own and are responsible for the different services in production isn't clear.
- Teams have no guidance or standards on tooling to support the diagnosis of issues.
In your documentation:
- Recovery tactics are defined and classified by incident type and trigger.
- Exit criteria for incident responses are included in SLOs and are clear.
- Activation criteria and assignment logic for elevated permissions during incidents are clear.
- Incident response permission sets and authorizations are clearly listed.
- A troubleshooting guide to assist with identifying and diagnosing common issues exists.
In your documentation:
- Incident response is performed ad hoc.
- Exit criteria for incident responses don't exist.
- Elevated permissions aren't assigned, or if they are, are assigned ad hoc.
- Incident response permission sets and authorizations aren't listed.
In your org:
- Session-based permission sets for incident response exist and can be assigned to support staff during recovery.
- Setup Audit Trail shows that designated recovery testers logged into the testing environment at the agreed-upon time and followed recovery test scripts.
In your org:
- Session-based permission sets don't exist for incident response, or if they do, support staff aren't authorized to use them.
- Setup Audit Trail shows that designated recovery testers didn't logged into the testing environment or didn't follow recovery test scripts
In your test plans:
- Test scripts for recovery testing exist and are repeatable.
- Environments for incident simulations are clearly listed.
In your test plans:
- Test scripts for recovery testing don't exist.
- Environments for incident simulations aren't established.
Ability to Triage In your business:
- SMEs or stakeholders who should be alerted to support complex issues are identified before an incident occurs.
- The handoff between delivery and support teams is a part of go-live.
- If consulted, Salesforce architects respond quickly and help the team stay focused on recovery.
In your business:
- SMEs or stakeholders who should be alerted aren't identified until an incident occurs.
- The handoff between delivery teams and support teams isn't a part of the release process.
- Salesforce architects consider incident response to be outside their scope of work.
In your documentation:
- System and design patterns used in a given solution are discoverable and readable by support staff.
In your documentation:
- System and design patterns used in a given solution aren't readily available to support staff.
In your org:
- Logging and custom error messages are incorporated into execution paths throughout the system.
In your org: - Logging and custom error messages aren't used.
Monitoring and Alerting In your org:
- Alerts are used only to inform users of scenarios that require human intervention; other failures are logged and reportable.
- Alerts are sent to users who are capable of responding to them.
- When possible, alerts are delivered before a potential failure.
In your org:
- Alerts are sent when any type of failure occurs, regardless of whether follow-on actions are required.
- Alerts about issues requiring technical solutions are delivered to business users.
- Alerts are only delivered in response to failures that have already occurred.
In your documentation:
- Entry criteria for prompt-tuning alerts are defined based on direct and indirect generative AI feedback metrics.
In your documentation:
- There are no criteria defined for triggering prompt-tuning alerts for generative AI apps.

Continuity Planning

A key to business resilience is continuity planning, which focuses on how to enable people and systems to function through issues caused by an unplanned event. Business continuity plans (BCPs) take a people-oriented view of how to keep processes moving forward through crisis. Technical aspects of continuity planning are contained in the disaster-recovery portions of a BCP. See Technology Continuity.

Without adequate continuity plans, your organization may now know how to act—and therefore not act at all—during a crisis or system outage. Ineffective continuity planning can have catastrophic impact on customers, stakeholders, and business. In the wake of an adverse event, each moment that passes without maintaining or recovering critical processes risks financial damage, reputational damage, employee safety, and even regulatory compliance.

You can build better continuity planning into your systems by focusing your efforts in three areas: defining business continuity for Salesforce, planning for technology continuity, and building backup and restore capabilities.

Business Continuity

Your company may already have a BCP in place. If it does, make sure that Salesforce is included in it. If your company doesn’t have a BCP, work with your stakeholders to create one that covers your Salesforce orgs.

Salesforce is often relied upon to be a source of truth for customer data and essential business processes across many business divisions. As such, the role that Salesforce plays in a BCP may differ from the roles that other systems play. It’s likely that Salesforce will be involved in many high-priority areas for recovery.

To create relevant business continuity planning for Salesforce systems:

The patterns and anti-patterns for continuity planning show what proper and poor continuity planning looks like in a Salesforce solution. Use them to validate your designs before you build, or to identify places in your system that need to be refactored.

To learn more about Salesforce tools for defining business continuity, see Salesforce Tools for Resiliency.

Technology Continuity

The goal of technology continuity is to make sure that issues with components in a system don’t prevent the business from maintaining essential operations. Salesforce prioritizes maintaining our services at the highest levels of availability and providing transparent information about any issues. You can see real-time information about Salesforce system performance and issues at trust.salesforce.com. As an architect building on Salesforce, your solutions benefit from the site reliability, security, and performance capabilities that Salesforce provides across the entire platform.

However, the overall continuity of your Salesforce solutions extends beyond the built-in services Salesforce provides. From an architectural perspective, Salesforce technology continuity planning has to begin with asking and answering questions about how Salesforce fits into your larger enterprise landscape. What kinds of systems integrate with Salesforce? How do external systems depend on processes or information in Salesforce? In your Salesforce orgs, what processes or functionality rely on AppExchange solutions? Do your users access Salesforce through third-party identity services or SSO?

To build better technology continuity in your Salesforce systems:

Treat any items that come out of your post-incident reviews like your other development items. Add them to your planning systems so that you can prioritize them and work on them.

The patterns and anti-patterns for continuity planning show what proper and poor technology continuity planning looks like in a Salesforce solution. Use them to validate your designs before you build, or to identify places in your system that need to be refactored.

To learn more about Salesforce tools for technology continuity planning, see Salesforce Tools for Resiliency.

Building Backup and Restore Capabilities

Restoring backed-up copies of data or metadata can help return your org to its last known stable state. It can also provide a failover system during a catastrophic system failure or service interruption. Backing up your data and metadata regularly and storing your encrypted, backed-up copies in a secure location adds an additional layer of resilience to your architecture.

Without backup and restore strategies, you can’t restore clean versions of your production data and metadata when they’re maliciously corrupted, when defects inadvertently make their way into production, or when a failure during a large data load corrupts production data. Any one of these scenarios can result in your business-critical production data becoming corrupt or even permanently lost. Setting up backup and restore technology offers a number of advantages in addition to continuity planning, including assisting with strategies for mitigating large data volumes and adhering to compliance-related retention policies.

To help ensure continuity with backup and restore strategies in your Salesforce solutions:

You may need a more granular backup strategy if your data volumes are so large that a full backup doesn’t have time to complete before the next backup starts running. You may also need a more granular backup strategy if your organization’s data changes so frequently that the updates are mission critical to your organization.

To make your backup strategy more granular:

Don’t ever stop performing full backups. It’s important to note that you should never eliminate full backups completely, even if data volumes result in long run times. For large data volumes, plan for regular but infrequent full backups (for example, weekly backups). Also plan for more frequent partial or object-specific backups (for example, nightly backups or backups every X number of hours). This approach gives you the flexibility to reconstruct the most complete and accurate dataset to use in your restore processes.

The patterns and anti-patterns for continuity planning show what proper and poor backup and restore capabilities look like in a Salesforce solution. Use them to validate your designs before you build, or to identify places in your system that need to be refactored.

Salesforce Backup and Recover

Salesforce Backup and Recover, an integrated Salesforce solution that includes Own Recover from the Own acquisition, protects important data from loss or corruption. Our highly secure, easy-to-set-up, always-available solution ensures business continuity and data resilience, and it simplifies compliance.

Use Salesforce Backup and Recover to prevent data loss, recover from data incidents quickly, and simplify your overall data management strategy. You can create backup policies for high-value and regulated data, and restore that data in just a few clicks.

Deploy Automated Backups

Automated daily backups protect all your crucial org data, including metadata, sandboxes, managed package data, file attachments, and more. Run backups as frequently as needed to meet your recovery point objective (RPO) goals and safeguard your deployments. Backups are always accessible and stored securely and compliantly. Continuous Data Protection is also available for even more sensitive or transactional data, allowing for faster recovery of rapidly changing, critical information.

Monitor Data Proactively

Detect unusual data activity, data loss, and corruption with proactive alerts that are sent directly to your email. Receive real-time alerts to identify statistical outliers or to create rules that notify you of unusual data activity, helping you detect incidents faster than ever before.

Restore Precisely and Swiftly

Salesforce Backup and Recover expedites recovery by providing granular visibility into changes, allowing for the quick identification and restoration of affected data. Tools such as visual graphs highlight unwanted changes, while easy-to-use recovery features precisely restore affected objects, fields, and records.

Extend the Use of Your Backups

Our tools enable you to use backups for analytics, audits, and compliance, offering searchable historical data, open search functionality for the visibility of past data, and export capabilities for external analytics or warehousing. This repurposes backups without needing additional Salesforce APIs.

Unified Backup Data Management

Backup and Recover offers a single console for consolidating all backups, management, operations, and compliance. This console allows you to access, manage, customize, and monitor backups for all your production orgs and sandboxes. With it, you can also execute data subject requests to ensure backup data compliance, and have full control to customize backup schedules, frequency, and retention policies.

Key Use Cases for Salesforce Backup and Recover

To learn more about Salesforce tools for backup and restore, see Salesforce Tools for Resiliency.

Continuity Planning Patterns and Anti-Patterns

This table shows a selection of patterns to look for or build in your org, and anti-patterns to avoid or target for remediation.

✨ Discover more patterns for continuity planning in the Pattern & Anti-Pattern Explorer.

Patterns Anti-Patterns
Business Continuity In your business:
- A "recovery first" mindset is adopted with a focus on bringing the highest-priority business functions and capabilities out of impact as soon as possible.
- There is a maintenance schedule for the review of BCP test plans.
In your business:
- A "fix the problem" mentality is the only approach to incident management.
- BCP test plans aren't refreshed at regular intervals.
In your documentation:
- A BCP exists containing steps to continue processing or triage data if Salesforce becomes unavailable, a list of events that trigger the use of the BCP, and steps and intervals for BCP testing.
- The BCP includes upstream and downstream systems and dependencies.
In your documentation:
- A BCP doesn't exist, isn't complete, or includes only Salesforce.
In your test plans:
- The areas of your BCP related to processes and people are accounted for.
In your test plans:
- The areas of your BCP related to processes and people aren't accounted for.
Technology Continuity In your business:
- You have evaluated if you need to build intentional redundancy or failover systems
- Incident recovery tactics are automated wherever possible.
In your business:
- You have not evaluated the need for intentional redundancy or fail-over systems
- Incident recovery tactics are all manual.
In your documentation:
- The BCP accounts for additional resources or break-glass procedures that teams might need to respond to incidents effectively.
In your documentation:
- The BCP doesn't include operational support needs.
Backup and Restore In your documentation:
- A backup and restore strategy exists for both data and metadata.
In your documentation:
- A backup and restore strategy doesn't exist, or if it does, is incomplete, applying to only data or only metadata, not both.
At your company:
- Backups are stored in a secure location that only authorized users can access.
- Test plans and test logs show that data restores are tested in a Full or Partial Copy sandbox at least two times each year.
At your company:
- Backups aren't human readable.
- Backups are stored in locations that unauthorized business users can access.
- There is no data restoration process or the data restoration process is untested.

Salesforce Tools for Resiliency

ToolDescriptionApplication Lifecycle ManagementIncident ResponseContinuity Planning
Apex Hammer Tests Learn about Salesforce Apex testing in current and new releases. X
Apex Stub API Build a mocking framework to streamline testing. X
Backup and Recover Automatically generate backups to prevent data loss. X
Big Objects Store and manage large volumes of data on the platform. X
Field History Tracking Track and display field history. X
Get Adoption and Security Insights for Your OrganizationOpen link in new window Monitor the adoption and usage of Lightning Experience in your org. X
Manage Bulk Data Load Jobs Create update, or delete large volumes of records with the Bulk API. X
Manage Real-Time Event Monitoring Events Manage event monitoring streaming and storage settings. X
Data and Storage Resources View your Salesforce org's storage limits and usage. X
Monitor Debug Logs Monitor logs and set flags to trigger logging. X
Monitor Login Activity with Login Forensics Identify behavior that may indicate identity fraud. X
Monitor Setup Changes with Setup Audit Trail Track recent setup changes made by admins. X
Monitor Training History View the Salesforce training classes that your users have taken. X
Monitoring Background Jobs Monitor background jobs in your organization. X
Monitoring Scheduled Jobs View report snapshots, scheduled Apex jobs, and dashboard refreshes. X
Scale Test Test system performance and interpret the results. X
Proactive Monitoring Minimize disruptions by using Salesforce monitoring services. X
Salesforce Data Mask Automatically mask data in a sandbox. X
The System Overview Page View usage data and limits for your organization. X
Use force:lightning:lint Analyze and validate code via the Salesforce CLI. X

Resources Relevant to Resilient

ResourceDescriptionApplication Lifecycle ManagementIncident ResponseContinuity Planning
7 Anti-Patterns in Performance and Scale Testing Avoid common anti-patterns in performance and scale testing. X
Analyze Performance & Scale Hotspots in Complex Salesforce Apps Learn an approach for addressing performance and scalability issues in your org. X
Build a Disaster Recovery Plan (Trailhead) Build a disaster recovery plan. X
Business Continuity is More than Backup and Restore Get a comprehensive view of BCP. X
Design Standards Template Create design standards for your organization. X
Diagnostics and Monitoring tools in Salesforce Learn how to improve the quality and performance of your implementations. X
Guiding Principles for Business Continuity Planning Review the basic principles underlying effective BCP. X
How to Scale Test on Salesforce Learn the five phases of the scale testing lifecycle. X
Introduction to Performance Testing Learn how to develop a performance testing method. X
Monitor Your Organization Learn about self-service monitoring options. X
Test Strategy Template Create and customize scale and performance test plans. X
Test Strategy Template Ensure that your test strategy is complete. X
Understand Source-Driven Development (Trailhead) Learn about package development and scratch orgs. X

Tell us what you think

Help us keep Salesforce Well-Architected relevant to you; take our survey to provide feedback on this content and tell us what you'd like to see next.