business-continuity-and-disaster-recovery: CDA.Wiki (Print)

# Business Continuity and Disaster Recovery

Definition

Business continuity planning (BCP) and disaster recovery (DR) are complementary disciplines that ensure an organization can maintain critical operations during a disruptive event and restore full operational capability afterward. BCP addresses the broader organizational response: which business functions are critical, how they continue operating during disruption, what alternative processes or locations are available, and how the organization communicates with stakeholders. DR addresses the specific technical recovery of IT systems and data: which systems are restored first, how they are restored, and how quickly.

Together, BC/DR ensures that a ransomware event, natural disaster, cloud provider outage, supply chain disruption, or infrastructure failure does not become an existential threat. The distinction between an organization that experiences a disruption and recovers within hours versus one that is disrupted for weeks or permanently is almost always the presence (or absence) of a tested BC/DR plan.

BC/DR is not a cybersecurity-specific discipline. It predates the cybersecurity industry by decades, originating in physical disaster planning (fire, flood, earthquake, power failure). Cybersecurity events (ransomware, destructive malware, data center compromise) are now the most common trigger for BC/DR activation, which is why BC/DR sits in the RGA (Risk Governance and Assurance) domain of the Planetary Defense Model: it is a governance function that ensures the organization can sustain operations through any disruption, cyber or physical.

How It Works

Business Impact Analysis (BIA)

BC/DR planning begins with a Business Impact Analysis: a structured assessment that identifies which business functions are critical, what resources (systems, data, personnel, facilities) they depend on, and what the financial and operational impact of losing those functions is over time.

The BIA produces two outputs that drive every subsequent BC/DR decision:

Recovery Time Objective (RTO). The maximum acceptable time that a business function can be unavailable before the impact becomes unacceptable. RTO is defined per function, not per organization. The payment processing system may have a 4-hour RTO (every hour offline costs measurable revenue). The marketing website may have a 72-hour RTO (inconvenient but not revenue-impacting). The employee intranet may have a 24-hour RTO (productivity impact but not customer-facing).

Recovery Point Objective (RPO). The maximum acceptable amount of data loss measured in time. An RPO of 1 hour means the organization can tolerate losing up to 1 hour of data (anything created or modified in the last hour before the disruption may be lost). RPO determines backup frequency: a 1-hour RPO requires hourly backups. A 15-minute RPO requires continuous data protection. A 24-hour RPO can tolerate daily backups.

RTO and RPO are business decisions, not technical decisions. They are set by business leaders based on operational and financial impact, then translated into technical requirements by the IT and security teams. A CTO who sets a 4-hour RTO for the ERP system is committing the organization to the infrastructure, processes, and testing cadence required to achieve 4-hour restoration. If the infrastructure cannot support it, the RTO must be revised or the infrastructure must be upgraded.

Plan Development

A BC/DR plan documents the specific actions the organization will take before, during, and after a disruption.

Pre-incident preparation. Maintain the backup infrastructure that supports the defined RTOs and RPOs (see Backup and Recovery Architecture). Identify alternate processing locations (hot site, warm site, cold site, cloud-based DR). Maintain current contact lists for the BC/DR team, executive leadership, key vendors, and external stakeholders. Document step-by-step recovery procedures for each critical system. Keep the plan current (a plan that references a system architecture from two years ago is a plan that will fail during execution).

During-incident response. Declare the disaster (who has authority to invoke the BC/DR plan? what are the criteria?). Activate the BC/DR team. Communicate with stakeholders (employees, customers, partners, regulators, media). Execute recovery procedures in priority order (critical systems first, based on RTO tiers). Document actions taken and decisions made (for post-incident review and potential legal/insurance requirements).

Post-incident recovery. Restore full operations. Verify data integrity. Conduct a post-incident review to identify what worked, what did not, and what changes are needed. Update the plan based on lessons learned. Communicate recovery completion to all stakeholders.

Recovery Strategies

Recovery strategy selection depends on RTO requirements and budget:

Hot site. A fully equipped, continuously synchronized alternate location that can assume production operations within minutes to hours. Highest cost, lowest RTO. Used for critical systems that cannot tolerate extended downtime (financial trading, healthcare systems, emergency services).

Warm site. A partially equipped alternate location with infrastructure in place but not continuously synchronized. Recovery requires data restoration from backups and system configuration before operations can resume. RTO of hours to days. Moderate cost. The most common approach for mid-market organizations.

Cold site. An empty facility with power and connectivity but no pre-installed infrastructure. Everything must be provisioned after the disaster. RTO of days to weeks. Lowest cost. Used for non-critical functions or as a last resort.

Cloud-based DR. Replicate critical systems to a cloud provider (AWS, Azure, GCP) as standby instances that can be activated when needed. Cloud DR offers flexibility (scale up or down based on the disaster's scope), geographic distribution (replicate to a different region), and cost efficiency (pay for standby resources at reduced rates, pay full rates only when activated). Cloud DR has become the standard for organizations with cloud-native or hybrid infrastructure.

Failover clustering. For systems that require near-zero RTO, active-active or active-passive clustering provides automatic failover. If the primary system fails, the secondary system assumes operations without manual intervention. Used for databases, application servers, and network infrastructure where even minutes of downtime are unacceptable.

Testing

A BC/DR plan that has never been tested is a hypothesis. Testing validates that the plan works under realistic conditions and reveals gaps that documentation review cannot identify.

Tabletop exercise. The BC/DR team walks through a disruption scenario verbally, discussing their response actions, decision points, and communication procedures. Low cost, low disruption, high value for identifying gaps in roles, communication chains, and decision authority. CDA's RGA-D02 mission (Business Continuity Exercise, 16 hours) includes tabletop exercises.

Walkthrough test. Team members physically walk through the recovery procedures (without actually restoring systems) to verify that documentation is accurate, resources are available, and the sequence of steps makes sense.

Simulation test. Execute the recovery procedures for a subset of systems in a non-production environment. Verify that backups restore correctly, that the restored systems function, and that the recovery timeline matches the defined RTO. The backup recovery drill (DPS-D02, 12 hours) is a simulation test for the data restoration component.

Full interruption test. Shut down production systems and execute the full recovery plan. The most realistic test and the most disruptive. Typically conducted annually for critical systems, during planned maintenance windows. Full interruption testing is the only test that proves the organization can actually recover under real conditions.

Test frequency. Tabletop exercises should occur quarterly. Simulation tests should occur semi-annually for critical systems. Full interruption tests should occur annually. CDA's engagement tier determines the testing cadence: Confidential tier clients test annually, Secret tier semi-annually, Top Secret tier quarterly.

Why It Matters

Ransomware Is the Primary Trigger

Ransomware is now the most common trigger for BC/DR activation in the private sector. A ransomware event that encrypts production systems, deletes backups, and demands payment is functionally equivalent to a physical disaster that destroys the data center: the systems are unavailable, the data is inaccessible, and the organization must execute recovery from whatever backup infrastructure survived.

The Change Healthcare incident (2024) demonstrated the consequences of inadequate BC/DR at scale. A single ransomware event disrupted healthcare payment processing for the entire United States for weeks, affecting millions of patients and thousands of providers. Total costs exceeded $1 billion. The incident exposed that the organization's recovery capability did not match the criticality of its role in the healthcare infrastructure.

Organizations that survive ransomware without paying the ransom are organizations with tested backup architecture (DPS) and executable recovery procedures (BC/DR). Both are required. Backups without recovery procedures are raw materials without a construction plan. Recovery procedures without backups are a plan without materials.

Regulatory Requirements

BC/DR is mandated by every major compliance framework. NIST CSF 2.0 RC (Recover) function includes recovery planning and communication. ISO 27001 A.5.29 (Information Security During Disruption) and A.5.30 (ICT Readiness for Business Continuity) require BC/DR planning and testing. PCI DSS Requirement 12.10 includes business continuity provisions. HIPAA requires contingency planning including data backup, disaster recovery, and emergency mode operations. SOC 2 CC9.1 (Recovery) requires demonstrated recovery capability. Federal agencies must comply with NIST SP 800-34 (Contingency Planning Guide).

Auditors evaluate not just whether a plan exists but whether it has been tested, when the last test occurred, what the results were, and whether gaps identified in testing were remediated. An untested plan is a finding in every serious audit.

Beyond Cyber

BC/DR is not exclusively a cybersecurity control. Natural disasters (hurricane, earthquake, flood, fire), infrastructure failures (power outage, ISP outage, cloud provider incident), supply chain disruptions, and pandemic-related workforce disruptions all trigger BC/DR. An organization with a comprehensive BC/DR program is resilient against every category of disruption.

The February 2021 Texas power crisis affected data centers, cloud regions, and thousands of businesses. The July 2024 CrowdStrike update incident caused a global outage affecting 8.5 million Windows devices, disrupting airlines, hospitals, banks, and government agencies. Neither was a cyberattack. Both required BC/DR activation. Resilience is category-agnostic.

CDA Perspective

BC/DR sits in the RGA (Risk Governance and Assurance) domain of the Planetary Defense Model. RGA is the strategic envelope: it ensures the governance structures exist to sustain operations through disruption. BC/DR is the specific RGA function that plans for, tests, and executes operational continuity.

CDA's Perpetual Compliance Assurance (PCA) methodology applies to BC/DR through continuous plan maintenance and recurring testing. "Compliance is not an event. It is a state." A BC/DR plan that was written in 2023 and not updated since is a plan for an environment that no longer exists. PCA ensures that BC/DR documentation stays current with infrastructure changes, that testing occurs on the defined cadence, and that test findings are remediated.

Two TOP missions connect directly to BC/DR:

RGA-B03 (Business Continuity Planning): Develop the BCP/DR plan. Conduct the BIA. Define RTOs and RPOs per system and business function. Document recovery procedures. Identify alternate processing strategies. Establish communication plans. 32 estimated hours.
RGA-D02 (Business Continuity Exercise): Test the plan. Execute tabletop exercises, simulation tests, and (for mature programs) full interruption tests. Identify gaps. Remediate. 16 estimated hours.

The interaction with adjacent DPS missions is direct. DPS-B04 (Backup and Recovery Architecture, 24 hours) builds the backup infrastructure that BC/DR depends on for data restoration. DPS-D02 (Backup Recovery Drill, 12 hours) tests the backup component specifically. BC/DR and backup are not the same thing (BC/DR is broader: it includes business process continuity, communication, alternate processing, and organizational recovery beyond just data restoration), but BC/DR without tested backups is a plan without a foundation.

The asteroid metaphor from the PDM applies directly. Asteroids are low-probability, high-impact events. Most miss. Some hit. The ones that hit can cause extinction-level damage. BC/DR is asteroid defense: track the risk, prepare the response, rehearse the recovery, and hope you never execute it for real. The organizations that have rehearsed survive the impact. The ones that have not do not.

Key Takeaways

BC/DR ensures the organization can maintain critical operations during disruption and restore full capability afterward. BCP addresses business function continuity. DR addresses IT system and data recovery.
The BIA defines RTOs (time to recover) and RPOs (data loss tolerance) per business function. These are business decisions that drive technical requirements.
Recovery strategies range from hot sites (minutes to hours, highest cost) to cold sites (days to weeks, lowest cost). Cloud-based DR offers flexibility and cost efficiency for most organizations.
Testing is mandatory. A tabletop quarterly, a simulation semi-annually, and a full interruption test annually is the recommended cadence. An untested plan is a hypothesis.
Ransomware is the most common BC/DR trigger. Organizations that survive ransomware without paying have tested backup architecture (DPS) and executable recovery procedures (BC/DR).

Sources

National Institute of Standards and Technology (NIST). "Contingency Planning Guide for Federal Information Systems: SP 800-34 Rev. 1." U.S. Department of Commerce, May 2010.
National Institute of Standards and Technology (NIST). "Cybersecurity Framework (CSF) 2.0: RC (Recover)." U.S. Department of Commerce, 2024.
International Organization for Standardization. "ISO 22301:2019: Security and Resilience , Business Continuity Management Systems." ISO, 2019.
International Organization for Standardization. "ISO/IEC 27001:2022, Annex A.5.29 and A.5.30." ISO, 2022.
UnitedHealth Group. "Form 10-K: Annual Report, 2024." SEC Filing. (Change Healthcare incident timeline and cost disclosure.)

Word count: 1,947

Business Continuity and Disaster Recovery