Read about our update schedules here.
Resilient solutions handle change well. Resiliency is the ability to quickly and effectively recover from a problem or failure. Resilience is grounded in two key qualities: toughness and elasticity. Toughness in a system enables it to withstand and endure difficulties. Elasticity means a system is able to return to an ideal state or shape. Architecting for resilience means combining these two aspects into your systems to create strength and flexibility in the face of change — whether that change is intentional or unplanned.
In technology contexts, a system demonstrates resilient behavior by continuing to function or quickly return to a stable state even if individual pieces in the system fail. Code and configuration defects can create unexpected (and unwanted) behavior, as can network and hardware issues. As a result, every component in your architecture has the potential to fail.
You can improve the resilience of your Salesforce solutions by focusing on three key habits: application lifecycle management, incident response, and continuity planning.
Application lifecycle management (ALM) is software development practice that focuses on how software is created, delivered, and managed — from idea through end-of-life. Encompassing people, processes, and tools, ALM is a holistic way of looking at the big picture of how applications are conceived, approved, built, delivered, and managed, along with the more specific disciplines (including DevOps, specific delivery methodologies, testing strategies, governance, and CI/CD) that might be involved.
Healthy ALM means that the business can react quickly to changes, and applications can keep pace without compromising stability or quality. It is a cornerstone of resiliency. Without clear and practical ALM, teams will struggle at every stage of app creation, delivery, and maintenance. Symptoms of poor ALM include:
Because ALM touches nearly every aspect of a solution, establishing clear and effective ALM practices is a key part of architectural work.
You can build better ALM practices by focusing on three key areas: release management, environment strategy, and testing strategy.
Release management involves planning, sequencing, controlling, and migrating changes into different environments. Technically, a release can be any time you move changes into an environment. In this context, the term release refers to an intentional group of changes, moved into a given environment at the same time.
Introducing change into a stable system causes that system to transition from a stable state to a new state. During this transition, the system is vulnerable to further changes triggering an uncontrolled, unstable state — which can cause a critical incident. From an architectural point of view, designing for resilient releases is more than just ensuring individual changes undergo effective testing; it also includes planning for how changes are introduced into your systems (and to users) safely.
It’s critical to be clear about what, how, and how often changes will move into the system. If your project or business has official change management and enablement processes and teams, their work depends on predictable and accurate release information. Business stakeholders also care about release information — especially as it relates to features or bug fixes they’ve requested. Establishing consistent and clear release schedules, and shipping stable release artifacts, is an effective way to build trust in your solution and demonstrate value to your stakeholders.
To enable effective release management for Salesforce, consider:
Note: All of these release mechanisms use source-driven development and SFDX. You can mix-and-match all of these approaches to create the right release structures for your company and teams. You do not have to take an all-or-nothing approach. All of these options are fundamentally compatible with each other.
The list of patterns and anti-patterns below shows what a proper (and poor) release management looks like for a Salesforce org. You can use these to validate your designs before you build, or identify places in your system that need to be refactored.
To learn more about Salesforce tools for release management, see Tools Relevant to Resilient.
Salesforce provides a variety of environments for you to use during application development and testing cycles. An effective environment strategy for Salesforce requires understanding how to use different environments and what good management looks like. This is a key competency for healthy ALM cycles. Environments are where work gets done. Their usefulness in ALM comes from the level of fidelity to production they provide, along with their isolation from production.
Compared to a poor environment strategy (or no strategy at all), a good environment strategy provides several benefits:
Teams often struggle to realize these benefits. Challenges to getting the most out of your development environments and strategy come from many sources. One key source is the type of development model your teams follow. In the older org-based development approach, environments had more than one role to fulfill. They have to be the place where various kinds of work happen, and they also have to be the source for your release artifacts (that is, the metadata that you want to deploy in a release). This often means environments were not easy to set up or tear down, they were often overcrowded and full of metadata conflicts between teams, and they did not contribute meaningful speed or flexibility to ALM overall.
Using a source-based development model fundamentally shifts the relationship environments have to your releases and release artifacts. In this model, source control is the source of the metadata you want to release. Environments are just places where work gets done.
However, the source-based development model is not a guarantee of good environment strategy by itself. Even with source control, teams can still struggle to set up conditions to test external system integrations, configurations that depend on metadata not in source control (like managed packages or customizations that depend on data), and so on. This can lead to challenges similar issues to those seen with an org-based model.
To develop an effective environment strategy, consider:
Features | Scratch Org | Developer Sandbox | Developer Pro Sandbox | Partial Copy Sandbox | Full Sandbox |
---|---|---|---|---|---|
Supports org shape | Yes | No | No | No | No |
Supports source tracking | Yes | Yes | Yes | No | No |
Lifespan | 1 - 30 days | Manually controlled | Manually controlled | Manually controlled | Manually controlled |
Refresh Interval | N/A | 1 day | 1 day | 5 days | 29 days |
Release Preview Support | Developer controlled | Based on sandbox instance | Based on sandbox instance | Based on sandbox instance | Based on sandbox instance |
Provisioning Time | >5 minutes | Hours - Days | Hours - Days | Hours - Days | Hours - Days |
Metadata determined by | Source control | Production | Production | Production | Production |
Data determined by | Manual data load | Manual data load | Manual data load | Sandbox template | Production |
Data limit | 200 MB | 200 MB | 1 GB | 5 GB | Matches Production |
Here is how different features map to common development tasks, along with compatible environment recommendations:
Task | Org Shape | Source Tracking | Frequent Refreshes | Release Preview Support | All metadata from production | Partial Metadata from production | Large datasets from production | Partial datasets from production | Compatible Environments |
---|---|---|---|---|---|---|---|---|---|
Prototyping | X | X | X | X | X | X | X | Scratch Orgs, Developer and Developer Pro Sandboxes | |
New feature investigations or proof-of-concept development | X | X | X | X | X | X | X | Scratch Orgs, Developer and Developer Pro Sandboxes | |
User acceptance testing | X | X | X | X | X | X | Developer, Developer Pro and Partial Copy sandboxes | ||
Performance and scale testing | X | X | X | Full sandbox | |||||
User Training | X | X | X | X | X* | X | Developer Pro, Partial Copy and*Full sandboxes | ||
*If required to complete a specific kind of work, otherwise use a less resource-intense environment |
Decouple environments from release artifacts. Do not use org-based development. Treat environments as places where work happens for a fixed amount of time. View the state of metadata in an environment as orthogonal to your release artifacts. If a piece of code or configuration gets “figured out” in an environment, it should be committed to source control — that is what constitutes a release artifact.
Decouple environments from release paths. It is common to see mandatory release paths that require changes to be deployed to specific environments. Often, this is done to have some kind of proxy for validating application maturity or release stability. It can also be done in an attempt to minimize the number of environments where complex testing infrastructure has to be configured. In source-based paradigms, you have greater flexibility in how (and where) you can validate and test changes.
Build release paths for different types of changes. Not all changes require the same kinds of ALM work in the same order. It probably isn’t a valuable use of end-user time to perform acceptance testing for minor changes to back-end components of a system. User acceptance and scale testing might be tremendously valuable during early-stage development of a mobile application. Identify release paths for different kinds of change. Example categories include:
Do not allow overcrowded environments to exist. Lack of discipline in prioritizing, scoping, and sequencing work will inevitably lead to overloaded development environments — with volumes of work that are just too much, too many, too different. Overcrowded environments create high levels of stress, ambiguity, and conflict among development teams. They also create noise within your development pipelines and impede quality control efforts. In addition to these negative impacts, overcrowded development environments are serious threats to environment maintenance and security. View overcrowding as a symptom of potential problems in your ALM processes. Investigate for any root cause issues and address them. If you still face overcrowding, you can purchase additional sandboxes.
The list of patterns and anti-patterns below shows what a proper (and poor) environment management looks like for a Salesforce org. You can use these to validate your designs before you build, or identify areas of your system that need to be refactored.
To learn more about Salesforce tools for environment management, see Tools Relevant to Resilient.
A test strategy is the guiding principles and standards for how you plan and run tests that gauge the success/failure of your applications during your ALM processes. Test strategy keeps every stakeholder involved in testing aligned with the priority, purpose, and scope of a given test, and helps project teams create effective and thoughtful test plans.
Typically, developers or quality assurance/testing experts will be involved in creating and executing specific tests. Test strategy helps ensure that these individuals know what kinds of tests need to be conducted for a given project, in what sequence tests should occur, and what is needed for building well-formed tests, test plans, and artifacts (for example, test data sets, devices, traffic or network simulators, and so on).
Effective testing strategy creates a clear picture of how, when, where, and why to run different test types (including unit tests, UI tests, and regression tests) in various combinations and conditions to uncover how your system (including any in-flight changes) will behave. An effective test strategy produces tests that better show you how well the system conforms to non-functional requirements (such as scalability, reliability, and usability) that can be difficult to measure through a single kind of test.
To create effective testing strategies for Salesforce, consider:
The following table shows a selection of patterns to look for (or build) in your org and anti-patterns to avoid or target for remediation.
✨ Discover more patterns for ALM in the Pattern & Anti-Pattern Explorer.
Patterns | Anti-Patterns | |
---|---|---|
Release Management | In production:
- Metadata shows use of stable release mechanisms, such as: -- Metadata organized into unlocked packages -- DevOps Center is active and installed -- Deployments via Metadata API use source format
- Deployment logs show no failed deployments within the available history - Deployment history shows clear release cadences and fairly uniform deployment clusters within release windows | In production:
- Metadata indicates use of org-based release mechanisms, such as: -- Active use of change sets -- Deployments via Metadata API use package.xml format
- Deployment logs show repeated instances of failed deployments within the available history - Deployments have no discernable cadence or show uneven clusters of deployments (signs of hot-fix and ad hoc rollbacks) - DevOps Center is not enabled and installed |
In your roadmap and documentation:
- Release names are clear - Features are tied clearly to a specific, named release - Release names are searchable and discoverable - Teams can find and follow clear guidelines for tagging artifacts, development items, and other work with the correct release names - It is possible to pull together a clear view of a release manifest by release name - Quality threshholds for generative AI apps are defined for different development stages |
In your roadmap and documentation:
- Release names are absent - Features are not tied clearly to a specific release - Release names are ad hoc or do not exist - Teams refer to artifacts, development items, and other work in different ways - It is not possible to pull together a clear view of a release manifest using a release name - Quality thresholds for generative AI apps are not defined, or are not defined at different development stages |
|
Environment Strategy | In your orgs:
- A source-driven development and release model is adopted - Source tracking is enabled for Developer and Developer Pro sandboxes - Metadata in a given environment is independent from your release artifacts - Environments do not directly correspond to a release path - Release paths for a change depend on the type of the change (high risk, medium risk, low risk) - Overcrowded environments do not exist - Risky configuration changes are never made directly in production - No releases occur during peak business hours |
In your orgs:
- An org-based development and release model is adopted - Source tracking is not enabled for Developer and Developer Pro sandboxes - Metadata in a given environment is your release artifact - Environments directly correspond to a release path - The release path for every change is the same - Overcrowded environments exist - Risky configuration changes are made directly in production - Releases occur during peak business hours |
Testing Strategy | Within your business:
- Usability tests employ a variety of devices and assistive technology - Simulators are used to replicate production-like conditions for scalability and performance testing - Tests are automated to run when changes come into source control - Endurance, stress, performance, and scale tests are run at several intervals in the application development cycle and considered on-going tasks - You include scale testing as part of your QA process when you have B2C-scale apps, large volumes of users, or large volumes of data - Your scale tests are focused on priority aspects of the system - Your scale tests have well-defined criteria - You conduct scale testing in a Full sandbox - Prompt engineering includes a quality review by a human |
Within your business:
- Usability tests are not conducted, or are conducted on a limited set of devices - Production-like volumes of user requests, API traffic, and variations in network speed are not tested - Test automation is not in place - Endurance, stress, performance, scale tests are considered a phase or stage of development - You don't conduct scale tests as a part of your QA process and you have B2C-scale apps, large volumes of users, or large volumes of data - Your scale tests aren't prioritized - Your scale tests don't have well-defined criteria - You conduct scale tests in a Partial Copy or Developer sandbox - Prompt engineering lacks a quality review by a human |
In your org:
- All test data is scrubbed of sensitive and identifying data | In your org:
- Test data is identical to production data |
|
In Apex:
- Data factory patterns are used for unit tests - Mock/stubs are used to simulate API responses | In Apex:
- Your unit tests are reliant on org data - Mocks/stubs are not used |
|
In your design standards and documentation:
- Environments are classified by what type of tests they can support - Appropriate test regimes are specified according to risk, use case, or complexity | In your design standards and documentation:
- It is not clear which environment can support what type of tests - Test regimes are not categorized by risk, use case, or complexity |
In security and site reliability engineering (SRE), incident response is focused on how teams identify and address events impacting the overall availability or security of a system, as well as how teams work to address root causes and prevent future issues. Incident response involves the processes and tools as well as the organizational behaviors required to address issues in real-time and in the period after an issue occurs.
As an architect, you may not be the person monitoring your solution’s operations on a day-to-day basis once it goes live. Part of architecting for resilience is designing capabilities that enable support teams to perform first-level diagnosis, stabilize systems, and effectively hand over the investigation and root cause mitigation to development or maintenance teams. Teams directly supporting users on a day-to-day basis may not have deep understanding or expertise in the architecture of the system. It is essential for these teams to have appropriate tools and processes for monitoring daily operations, accessing information from the system when diagnosing a potential incident, and helping them serve as effective first-responders for any issues impacting availability.
You can improve how well teams respond to incidents in your Salesforce solutions by focusing on time to recover, ability to triage, as well as monitoring and alerting.
When an incident occurs, the first priority must be restoring systems to a stable operational state. Often, businesses think the only way to recover from an incident is to “fix the problem”. This is directionally sound — accurate root cause analysis and remediation is how you ultimately resolve critical issues in a system. However, this approach is not the most practical in the early stages of crisis response. Depending on the severity of an incident, every second of an outage or incident could create revenue (or reputation) loss for the business.
Often, attempting to diagnose and address root causes will delay efforts to restore the system to operation. Logistically, adopting an approach that asks incident responders to address root causes puts tremendous strain on subject matter experts (SMEs) and support staff at your company. Working to find and fix root causes during an incident requires SMEs to be on-call for every incident, and can block front-line/customer-facing support staff from taking action. It can also result in teams releasing changes that, in turn, create in more incidents. Ultimately, such an approach increases costs, consumes bandwidth across teams, and creates behaviors in times of crisis that can erode customer trust and brand reputation.
The right incident management paradigm is to prioritize and focus on recovery as a first step. After the system is restored to stability, then follow up with blameless postmortems, incident investigations, root cause remediation, and similar activities. This order of operations better enables incident response staff to triage, diagnose, and execute recovery tactics, alerting relevant SMEs to assist only as necessary. It also enables SMEs to identify and fix root causes with less pressure from a ticking incident clock.
To adopt a recovery-first mindset to incident response, consider:
Incident Type | Apparent Trigger | Recovery Tactics |
---|---|---|
System outage | Corrupted logins or issues with account access | Carry out account recovery policy |
Service unavailable | Activate redundant/backup service, Manual workarounds | |
Production bug | Recent change | Deployment rollback or prior version de-deploy |
Emergent / unexplained bug | Manual workarounds, disable non-essential features, escalate to SMEs |
The list of patterns and anti-patterns below shows what architecting to prioritize recovery looks like within a Salesforce solution. You can use these to validate your designs before you build, or identify areas of your system that need to be refactored.
To learn more about Salesforce tools to help with time to recover, see Tools Relevant to Resilient.
In the context of technology, triage involves assigning categories and levels of severity to issues and support requests. No matter how well planned your solution is, user support issues and requests will arise. These can range from issues that stem from lack of sufficient training or change management, gaps in UI/UX, and unexpected end-user behaviors, to urgent system issues not caught by monitoring or alerting.
Support and operations teams need the ability to investigate user support queries efficiently and diagnose them quickly. Triaging issues to filter out less severe concerns and quickly spot critical system incidents is a key competency for these teams. Poor triaging slows all levels of user support, prolongs critical incidents, and increases the risk of further disruptions to your customers and your business.
Although you may not be involved in day-to-day operation and support, as an architect, it is your responsibility to help ensure support and operations teams can effectively triage issues in any solution you create on the Salesforce platform.
To enable teams to effectively triage issues within your Salesforce solutions, consider:
The list of patterns and anti-patterns below shows what architecting for effective triaging looks like within a Salesforce solution. You can use these to validate your designs before you build, or identify areas of your system that need to be refactored.
To learn more about Salesforce tools to help with triage, see Tools Relevant to Resilient.
Monitoring and alerting are widely used terms in site reliability engineering. In the context of system resiliency, monitoring is the ability to continuously assess the current state of a system and alerting is the ability to automate notifications to stakeholders about potential concerns about the state of the system. Effective monitoring and alerting is a key part of decoupling the scale and growth of your system from the scale and growth of your support staff.
Salesforce provides a variety of built-in capabilities to monitor behaviors in your system. Salesforce also offers real-time event monitoring as an add-on or as part of Salesforce Shield. In any Salesforce solution, designs architected for monitoring and alerting provide:
To architect for effective monitoring and alerting within your Salesforce solutions, consider:
The list of patterns and anti-patterns below shows what architecting for effective monitoring and alerting looks like within a Salesforce solution. You can use these to validate your designs before you build, or identify areas of your system that need to be refactored.
To learn more about Salesforce tools for monitoring and alerting, see Tools Relevant to Resilient.
The following table shows a selection of patterns to look for (or build) in your org and anti-patterns to avoid or target for remediation.
✨ Discover more patterns for incident response in the Pattern & Anti-Pattern Explorer.
Patterns | Anti-Patterns | |
---|---|---|
Time to Recover | Within your business:
- Recovery protocols are practiced on regular intervals - Teams know what services in production they are responsible for owning | Within your business:
- Recovery protocols don't exist or aren't practiced on regular intervals - It is unclear what teams are responsible for different services in production |
In your documentation:
- Recovery tactics are defined and classified by incident type and trigger - Exit criteria for incident responses exist in your SLOs and are clear - Activation criteria and assignment logic for elevated permissions during incidents are clear - Incident response permission sets and authorizations are clearly listed |
In your documentation:
- Incident response is performed ad hoc - Exit criteria for incident responses do not exist - Elevated permissions are not assigned, or assigned ad hoc - Incident response permission sets and authorizations are not listed |
|
In your org:
- Session-based permission sets for incident response exist and can be assigned to support staff during recovery - Setup Audit Trail shows designated recovery testers have logged into testing environment on agreed upon time and have followed recovery test scripts |
In your org:
- Session-based permission sets do not exist for incident response, or are not authorized for support staff to use - Setup Audit Trail shows designated recovery testers have not logged into the testing environment or did not follow recovery test scripts |
|
In your test plans:
- Test scripts for recovery testing exist and are repeatable - Environments for incident simulations are clearly listed |
In your test plans:
- Test scripts do not exist for recovery testing - Environments are not established for incident simulations |
|
Ability to Triage | Within your business:
- SMEs or stakeholders who should be alerted to support complex issues are identified before an incident occurs - The hand-off between delivery and support teams is a part of go-live - If consulted, Salesforce architects respond quickly and help the team stay focused on recovery |
Within your business:
- SMEs or stakeholders who should be alerted aren't identified until an incident occurs - The hand-off between delivery teams and support teams isn't a part of the release process - Salesforce architects consider incident response to be outside their scope of work |
In your documentation:
- System and design patterns used in a given solution are discoverable and readable by support staff |
In your documentation:
- System and design patterns used in a given solution are not readily available to support staff |
|
In your org: - Logging and custom error messages are incorporated into execution paths throughout the system |
In your org: - Logging and custom error messages are not used |
|
Monitoring and Alerting | In your org:
- Alerts are only used to inform users of scenarios that require human intervention; other failures are logged and reportable - Alerts are sent to users who are capable of responding to them - When possible, alerts are delivered in advance of a potential failure |
In your org:
- Alerts are sent when any type of failure occurs, regardless of whether follow-on actions are required - Alerts about issues requiring technical solutions are delivered to business users - Alerts are only delivered in response to failures that have already occurred |
In your documentation:
- Entry criteria for prompt tuning alerts are defined based on direct and indirect generative AI feedback metrics |
In your documentation:
- There are no criteria defined for triggering prompt tuning alerts for generative AI apps |
A key to business resilience is continuity planning, which focuses on how to enable people and systems to function through issues caused by an unplanned event. Business continuity plans (BCPs) take a people-oriented view of how to keep processes moving forward through crisis. Technical aspects of continuity planning are contained in the disaster recovery portions of a BCP. For more on this topic, see Technology Continuity.
Without adequate continuity plans, your organization may be paralyzed in the event of a crisis or system outage. Ineffective continuity planning can have catastrophic impact on customers, stakeholders, and business. In the wake of an adverse event, each moment that passes without maintaining or recovering critical processes risks financial damage, reputational damage, employee safety, and even regulatory compliance.
You can build better continuity planning into your systems by focusing your efforts in three areas: defining business continuity for Salesforce, planning for technology continuity, and building backup and restore capabilities.
Your company may already have a BCP in place. If this is the case, make sure Salesforce is included. If your company doesn’t have a BCP, work with your stakeholders to create one that covers your Salesforce org(s).
Salesforce will likely play a unique role in business continuity plans, because of the role it occupies in the system landscape. Salesforce is often relied upon to be a source of truth for customer data and essential business processes, across many business divisions. As such, the role Salesforce plays in a BCP may differ from other systems. It is likely that Salesforce will be involved in many high-priority areas for recovery.
To create relevant business continuity planning for Salesforce systems, consider:
The list of patterns and anti-patterns below shows what proper (and poor) continuity planning looks like for a Salesforce solution. You can use these to validate your designs before you build, or identify places in your system that need to be refactored.
To learn more about Salesforce tools for defining business continuity, see Tools Relevant to Resilient.
The goal of technology continuity is to make sure the business won’t be prevented from maintaining essential operations due to issues with the components in a system. Salesforce prioritizes maintaining our services at the highest levels of availability, and providing transparent information about any issues. You can see real-time information about Salesforce system performance and issues at trust.salesforce.com. As an architect building on Salesforce, your solutions benefit from the site reliability, security, and performance capabilities that Salesforce provides across the entire platform.
However, the overall continuity of your Salesforce solutions extends beyond the built-in services Salesforce provides. From an architectural perspective, Salesforce technology continuity planning has to begin with asking (and answering) questions about how Salesforce fits into your larger enterprise landscape. What kind of systems integrate with Salesforce? How do external systems depend on processes or information in Salesforce? In your Salesforce orgs, what processes or functionality rely on AppExchange solutions? Do your users access Salesforce through third-party identity services or SSO?
To build better technology continuity in your Salesforce systems, consider:
Treat any items that come out your post-incident reviews like other development items, and add them to your planning systems to be prioritized and worked on.
The list of patterns and anti-patterns below shows what proper (and poor) technology continuity planning looks like within a Salesforce solution. You can use these to validate your designs before you build, or identify places in your system that need to be refactored.
To learn more about Salesforce tools for technology continuity planning, see Tools Relevant to Resilient.
Restoring backed-up copies of data or metadata can help return your org to its last known stable state and provide a failover system that can be used in the event of a catastrophic system failure or outage. Backing up your data and metadata regularly and storing your encrypted, backed-up copies in a secure location adds an additional layer of resilience to your architecture.
Without backup and restore strategies you will not be able to restore clean versions of your production data and metadata when data is maliciously corrupted, when defects inadvertently make their way into production, or when a failure during a large data load corrupts production data. Any one of these scenarios can result in your business-critical production data becoming corrupt or even permanently lost. Setting up backup and restore technology offers a number of advantages aside from continuity planning, including assisting with large data volume mitigation strategies, adhering to compliance-related retention policies, and more.
To help ensure continuity with backup and restore strategies in your Salesforce solutions, consider:
Get started. The first step to having a good backup and restore strategy is to have one in the first place. Even something as simple as making nightly backups of all of your org’s data and metadata can save your business from losing critical information or functionality in the event of a disaster.
Restrict access to backups. System administrators are the only users who should have access to backed-up copies of your data to prevent any chance of a business user being able to view records in a backup copy that they wouldn’t be authorized to view in your org.
Test your restore process regularly. Regardless of what backup and restore strategy you choose to implement, test your restore process in a Full or Partial Copy sandbox regularly to be sure that it will work correctly when you need it.
Align your backup and restore strategy with your data archival strategy. When records are archived or purged from your system, evaluate what should happen with that data in your backups or archives. (For more on this topic, see data volume).
You may require a more granular backup strategy if your data volumes are so large that a full backup doesn’t have time to complete before the next backup starts running or if your organization’s data changes so frequently that the updates are mission-critical to your organization.
Here are ways to shape a more granular backup strategy:
Always continue to perform full backups. It’s important to note that you should never eliminate full backups completely, even if data volumes result in long run times. In the case of large data volumes, plan for regular, but infrequent full backups (weekly for example) in conjunction with more frequent partial or object-specific backups (nightly or every X number of hours, for example). This will give you the flexibility to reconstruct the most complete and accurate dataset to use in your restore processes.
The list of patterns and anti-patterns below shows what proper (and poor) backup and restore capabilities look like within a Salesforce solution. You can use these to validate your designs before you build, or identify places in your system that need to be refactored.
To learn more about Salesforce tools for backup and restore, see Tools Relevant to Resilient.
The following table shows a selection of patterns to look for (or build) in your org and anti-patterns to avoid or target for remediation.
✨ Discover more patterns for continuity planning in the Pattern & Anti-Pattern Explorer.
Patterns | Anti-Patterns | |
---|---|---|
Business Continuity | Within your business:
- A "recovery first" mindset is adopted with a focus on bringing the highest priority business functions and capabilities out of impact as soon as possible - There is a maintenance schedule for the review of BCP test plans | Within your business:
- A "fix-the-problem" mentality is the only approach to incident management - BCP test plans are not refreshed at regular intervals |
In your documentation:
- A BCP exists containing: steps to continue processing or triage data if Salesforce becomes unavailable, a list of events that can trigger the use of the BCP, steps and intervals for BCP testing - Your BCP includes upstream and downstream systems and dependencies |
In your documentation:
- A BCP does not exist, is incomplete, or includes only Salesforce |
|
In your test plans:
- The areas of your BCP related to processes and people are accounted for |
In your test plans:
- The areas of your BCP related to processes and people are not accounted for |
|
Technology Continuity | Within your business:
- You have evaluated if you need to build intentional redundancy or fail-over systems - Incident recovery tactics are automated wherever possible |
Within your business:
- You have not evaluated the need for intentional redundancy or fail-over systems - Incident recovery tactics are all manual |
In your documentation:
- Your BCP accounts for additional resources or break-glass procedures teams might need to respond to incidents effectively |
In your documentation:
- Your BCP does not include operational support needs |
|
Backup and Restore | In your documentation:
- A backup and restore strategy exists for both data and metadata |
In your documentation:
- A backup and restore strategy does not exist or the strategy is incomplete (it applies to only data or metadata, not both) |
At your company:
- Backups are stored in a secure location accessibly by only authorized users - Test plans and test logs show data restores are tested in a full or partial copy sandbox at least two times each year | At your company:
- Backups are not human readable - Backups are stored in locations that unauthorized business users can access - There is no data restoration process or the data restoration process is untested |
Tool | Description | Application Lifecycle Management | Incident Response | Continuity Planning |
---|---|---|---|---|
Apex Hammer Tests | Learn about Salesforce Apex testing in current and new releases | X | ||
Apex Stub API | Build a mocking framework to streamline testing | X | ||
Backup and Restore | Automatically generate backups to prevent data loss | X | ||
Big Objects | Store and manage large volumes of data on-platform | X | ||
Field History Tracking | Track and display field history | X | ||
Get Adoption and Security Insights for Your Organization | Monitor adoption and usage of Lightning Experience in your org | X | ||
Manage Bulk Data Load Jobs | Create update, or delete large volumes of records with the Bulk API | X | ||
Manage Real-Time Event Monitoring Events | Manage event monitoring streaming and storage settings | X | ||
Monitor Data and Storage Resources | View your Salesforce org’s storage limits and usage | X | ||
Monitor Debug Logs | Monitor logs and set flags to trigger logging | X | ||
Monitor Login Activity with Login Forensics | Identify behavior that may indicate identity fraud | X | ||
Monitor Setup Changes with Setup Audit Trail | Track recent setup changes made by admins | X | ||
Monitor Training History | View the Salesforce training classes your users have taken | X | ||
Monitoring Background Jobs | Monitor background jobs in your organization | X | ||
Monitoring Scheduled Jobs | View report snapshots, scheduled Apex jobs and dashboard refreshes | X | ||
Performance Assistant | Test system performance and interpret the results | X | ||
Proactive Monitoring | Minimize disruptions with Salesforce monitoring services | X | ||
Salesforce Data Mask | Automatically mask data in a sandbox | X | ||
The System Overview Page | View usage data and limits for your organization | X | ||
Use force:lightning:lint | Analyze and validate code via the CLI | X |
Resource | Description | Application Lifecycle Management | Incident Response | Continuity Planning |
---|---|---|---|---|
7 Anti-Patterns in Performance and Scale Testing | Avoid common anti-patterns in performance and scale testing | X | ||
Analyze Performance & Scale Hotspots in Complex Salesforce Apps | An approach to address performance and scalability issues in your org | X | ||
Build a Disaster Recovery Plan (Trailhead) | Build a disaster recovery plan | X | ||
Business Continuity is More than Backup and Restore | Take a comprehensive view of BCP | X | ||
Design Standards Template | Create design standards for your organization | X | ||
Diagnostics and Monitoring tools in Salesforce | Learn how to improve the quality and performance of your implementations | X | ||
Guiding Principles for Continuity Planning | Review the basic principles underlying effective BCP | X | ||
How to Scale Test on Salesforce | Approach scale testing in five steps | X | ||
Introduction to Business Continuity Planning for Architects (Trailhead) | Get started with business continuity planning | X | ||
Introduction to Performance Testing | Learn how to develop a performance testing method | X | ||
Monitor Your Organization | Learn about self service monitoring options | X | ||
Scale Test Strategy Checklist | Create and customize scale and performance test plans | X | ||
Site Reliability Engineering At Salesforce | Learn what Salesforce SRE does and how they do it | X | ||
Test Strategy Template | Ensure completeness of your test strategy | X | ||
Understand Source Driven Development (Trailhead) | Learn about package development and scratch orgs | X |
Help us keep Salesforce Well-Architected relevant to you; take our survey to provide feedback on this content and tell us what you’d like to see next.