TOP Mission TID-D01: Incident Post-Mortem Process

TOP Mission TID-D01: Incident Post-Mortem Process | CDA.Wiki | CDA.Wiki

# TOP Mission TID-D01: Incident Post-Mortem Process

Definition

A structured incident post-mortem is a formal, time-bounded review conducted after a security incident closes. Its purpose is to extract accurate causal understanding, identify control failures, and produce documented improvements that reduce the probability or impact of similar events. The process exists because incident response, however effective, terminates with containment and recovery, not with learning. Without a deliberate review mechanism, organizations repeat failures, accumulate unresolved technical debt in their defenses, and lose institutional knowledge when personnel change.

TID-D01 operationalizes this review as a repeatable mission within CDA's Theater of Operations Playbook, ensuring that every significant incident produces actionable intelligence rather than a filed report that collects dust. An incident post-mortem, also called an after-action review (AAR) in some frameworks, is a structured analytical process applied to a resolved security incident. Its scope spans the full incident lifecycle: the conditions that enabled the event, the detection timeline, the response actions taken, the decisions made under uncertainty, and the recovery steps executed.

This mission is specifically concerned with blameless post-mortems. The blameless model, which originated in site reliability engineering and was popularized by organizations including Google and Etsy, holds that individuals make decisions based on the information and tools available to them at the time. When failures occur, the system, the process, or the environment is the subject of analysis, not the individual. This distinction is critical because blame-oriented reviews suppress honest disclosure, which destroys the evidentiary quality of the post-mortem.

How It Works

The post-mortem process begins the moment an incident is closed and a severity determination is made. TID-D01 structures the process into five sequential phases, each with specific deliverables and quality gates that prevent progression to the next phase until completion criteria are met.

Phase 1: Scheduling and Scoping

Within 24 to 72 hours of incident closure, the incident commander or a designated post-mortem facilitator schedules the review meeting and defines scope. Scope determination answers four questions: which systems were involved, which teams participated in response, what the business impact was, and whether external parties (regulators, customers, vendors) were affected. The facilitator role is critical. This person runs the meeting, enforces the blameless norm, and owns the final document. The facilitator should not be the primary incident responder, because that person cannot simultaneously lead the review and be an objective subject of it.

The scoping phase also establishes the post-mortem classification. Category A incidents (data exfiltration, ransomware deployment, regulatory notification triggers) require full post-mortems with executive review and 30-day follow-up cycles. Category B incidents (failed attacks with control gaps exposed, privilege escalations contained before lateral movement) warrant abbreviated post-mortems focused on the specific control failure. Category C incidents (reconnaissance activity, failed phishing attempts) may be aggregated for quarterly batch review to identify systematic patterns.

Phase 2: Timeline Construction

Before the meeting, the facilitator collects raw data from all available sources: SIEM logs, ticketing system entries, chat logs (Slack, Teams, or equivalent), email threads, firewall and EDR telemetry, and verbal accounts from responders. These data points are assembled into a chronological timeline with timestamps accurate to the minute where possible. The timeline construction phase is where most post-mortems succeed or fail. Organizations that skip this step and attempt to reconstruct events during the meeting produce timelines corrupted by hindsight bias and incomplete recollection.

A concrete example illustrates why precision matters. In a ransomware event where encryption began at 02:14 and detection did not occur until 04:47, the two-hour and thirty-three-minute gap is the central analytical fact. Every question in the post-mortem flows from that gap: why did no alert fire, what telemetry existed but went unreviewed, and where was the on-call analyst during that window. The timeline also establishes what information was available to decision-makers at each point. If the backup verification process required 45 minutes and began at 05:30, the decision to rebuild rather than restore at 06:15 was reasonable given the information available. Without timestamps, this decision appears delayed.

Phase 3: The Post-Mortem Meeting

The meeting follows a structured agenda designed to extract maximum information while maintaining psychological safety. The facilitator opens by stating the blameless norm explicitly: "We are here to understand how our systems and processes performed, not to evaluate individual performance. Every person in this room made the best decisions they could with the information available to them."

The timeline is reviewed collectively, with participants adding missing detail and correcting errors. The facilitator then guides the group through structured questions: What went well? What went poorly? Where did our tools fail us? Where did our processes fail us? Where did our documentation fail us? What would we do differently knowing what we know now?

The facilitator documents all inputs in real time, projecting the working document for the group. No conclusion is accepted without supporting evidence from the timeline. Speculative causation ("probably the firewall rule") is recorded separately as a hypothesis requiring verification, not as a finding. The "five whys" technique is applied to each identified failure. If the detection gap is the problem, the team asks why the alert did not fire. The answer might be that the detection rule was not enabled. Why was it not enabled? Because the rule was added to the standard configuration in the last CIS Controls update but not applied to legacy segment hosts. Why were legacy hosts excluded? Because no one owned the task of applying new rules to that segment.

Meeting participants include all personnel who participated in response, the system owners for affected infrastructure, and a representative from management with authority to commit resources for corrective actions. The meeting duration should not exceed two hours for Category B incidents or four hours for Category A incidents. Longer meetings produce diminishing returns as participant attention degrades.

Phase 4: Documentation and Corrective Action Assignment

The post-mortem document is finalized within five business days. It contains: an executive summary (one paragraph), the verified timeline, the root cause statement, contributing factors, what went well, corrective actions with owners and due dates, and metrics that will confirm each corrective action is complete. The document quality standard requires that a technical professional unfamiliar with the incident could read the post-mortem and understand exactly what occurred, why it occurred, and what specific actions will prevent recurrence.

Corrective actions must be specific and verifiable. "Improve detection" is not a corrective action. "Deploy detection rule CIS-SEC-004 to all hosts in network segment 172.16.8.0/24 by [date], verified by a scan of that segment confirming rule presence" is a corrective action. Each action requires three elements: a technical specification, a named owner (not a team), and a verification method that produces a pass/fail result.

The root cause statement distinguishes between the immediate technical cause (the specific control that failed) and the systemic cause (the process or organizational factor that allowed the control to remain vulnerable). A SQL injection attack might have an immediate technical cause of unvalidated user input, but a systemic cause of no secure development lifecycle requirements for externally facing applications.

Phase 5: Tracking and Closure

Corrective actions are entered into the organization's ticketing or project management system with specific due dates. Progress is reviewed at a standing meeting, typically monthly, until all items close. Incomplete items at the 90-day mark are escalated to the executive sponsor defined in the mission prerequisites. This tracking loop is what separates a post-mortem from a postmortem report, which is a document that generates no change.

The closure phase includes a effectiveness review conducted 60 days after all corrective actions are complete. This review asks whether the implemented changes would have prevented or detected the original incident more effectively. Organizations that skip this verification step often discover that corrective actions were implemented but did not address the actual failure mode.

Why It Matters

Organizations that skip post-mortems or conduct them without follow-through pay compounding costs. Each unaddressed control gap from one incident remains available to the next threat actor. When a gap enables a second incident, the organization absorbs the full cost of that event in addition to the reputational damage of repeating a known failure. The business impact extends beyond immediate incident costs to include regulatory scrutiny, customer trust degradation, and board-level confidence erosion.

A well-documented example is the 2017 Equifax breach. The Apache Struts vulnerability (CVE-2017-5638) exploited in that breach had a patch available 63 days before exploitation began. The post-mortem conducted internally and subsequently reviewed by congressional investigators revealed that the organization's patch management process had a specific failure mode: critical patches required manual routing through a tracking system, and a process breakdown caused this particular patch to go unprocessed. An effective post-mortem process applied to a prior patching failure, even a minor one, might have surfaced that systemic weakness before it became a 147-million-person data exposure.

The cost of that systemic weakness, measured after the breach, included $1.4 billion in direct costs, $690 million in consumer restitution, regulatory fines exceeding $700 million, and incalculable reputational damage. A post-mortem culture that identified and corrected the patch management failure mode after a minor incident would have prevented this compounding cost.

A common misconception is that post-mortems are valuable only for catastrophic incidents. This is incorrect. Minor incidents, including failed phishing attempts that nevertheless exposed a credential, or a brief unauthorized configuration change that was quickly reverted, often contain early indicators of systemic weaknesses. An aggregate post-mortem reviewing ten minor authentication events in a quarter may reveal that a specific application is consistently targeted because its login page is indexed by search engines, a correctable issue that no single incident review would surface.

Another misconception is that the post-mortem document itself is the product. The document is a record. The product is the executed corrective actions. Organizations that file post-mortem reports without tracking completion produce documentation that creates regulatory liability (evidence that problems were identified and not remediated) without security benefit. This documentation risk is particularly acute in regulated industries where post-mortem findings can become evidence of negligence in litigation or regulatory enforcement actions.

Without this mission, threat intelligence programs accumulate knowledge about external threats while remaining blind to internal failure patterns. The most actionable intelligence an organization can possess is a precise understanding of how its own defenses fail under real-world conditions.

CDA Perspective

CDA addresses incident post-mortems through the Threat Intelligence and Defense (TID) domain of the Planetary Defense Model (PDM). The placement of TID-D01 in TID reflects a deliberate analytical position: post-mortem outputs are internal threat intelligence. The failure modes, detection gaps, and response delays documented in a post-mortem are as operationally significant as any external threat feed and often more actionable because they describe the specific attack surface of the organization rather than generalized threat patterns.

CDA's methodology, Predictive Defense Intelligence (PDI), is organized around the principle of seeing the threat before it sees you. Post-mortems are a direct input to that predictive capability. When a post-mortem documents that an attacker dwelled in an environment for 19 days before detection, that dwell time is a data point that PDI uses to calibrate detection thresholds, review logging coverage, and update the threat models applied to similar environments across the organization's attack surface.

What CDA does differently is treat the post-mortem as a structured intelligence collection event rather than an administrative requirement. Each post-mortem produces a set of structured findings that are classified by control domain, mapped to MITRE ATT&CK techniques where applicable, and ingested into CDA's threat modeling cycle. A detection failure in one incident becomes a hypothesis to test across all similar environments. A social engineering vector that succeeded against one team triggers awareness and simulation activities across adjacent teams.

CDA also enforces a quality standard for corrective actions that exceeds conventional industry practice. Vague corrective actions are rejected at the documentation stage. The facilitator, working from CDA's post-mortem template, is required to document each action with a specific technical description, a named owner, a completion date, and a verification method. This rigor is not administrative overhead; it is the mechanism by which post-mortem findings become permanent improvements rather than temporary intentions.

Finally, CDA integrates post-mortem findings into its quarterly threat briefings to executive stakeholders. This connection between incident-level detail and strategic security investment decisions is where TID-D01 produces its highest organizational value. Executive teams receive not just threat intelligence about external adversaries but precise intelligence about which internal controls fail under pressure and what investments are required to address those failures.

Key Takeaways

Assign a dedicated facilitator who was not the primary incident responder; this separation preserves objectivity and allows responders to participate honestly without managing the meeting
Build the incident timeline before the post-mortem meeting using raw log and communication data; a timeline constructed from memory during the meeting will contain gaps and distortions that corrupt root cause analysis
Write every corrective action with a named owner, a specific technical description, a due date, and a verification method; anything less is a wish, not an action item
Track corrective action completion in your existing ticketing system and review progress monthly; the 90-day mark is your escalation threshold, not your deadline extension
Conduct aggregate post-mortems on clusters of minor incidents at least quarterly; individual events rarely reveal systemic patterns, but clusters of similar minor events almost always do

TOP Mission TID-D02: Threat Intelligence Feed Management
TOP Mission IRP-D01: Incident Response Plan Development and Testing
Predictive Defense Intelligence (PDI): See the Threat First
CDA PDM Domain: Threat Intelligence and Defense (TID) Overview
MITRE ATT&CK Integration for Internal Threat Intelligence

Sources

NIST Special Publication 800-61 Revision 2, "Computer Security Incident Handling Guide." National Institute of Standards and Technology, August 2012. https://csrc.nist.gov/publications/detail/sp/800-61/rev-2/final

Center for Internet Security, "CIS Controls Version 8." CIS Control 17: Incident Response Management, May 2021. https://www.cisecurity.org/controls/v8

MITRE ATT&CK Framework, Enterprise Matrix. The MITRE Corporation. https://attack.mitre.org/

ISO/IEC 27035-1:2023, "Information Technology -- Information Security Incident Management -- Part 1: Principles and Process." International Organization for Standardization. https://www.iso.org/standard/78973.html

Google Site Reliability Engineering, "Postmortem Culture: Learning from Failure." Google SRE Book, Chapter 15. https://sre.google/sre-book/postmortem-culture/

Table of Contents

Definition

How It Works

Why It Matters

CDA Perspective

Key Takeaways

Sources

Related CDA Missions

Related Articles

Lazarus Group (HIDDEN COBRA / Diamond Sleet)

Salt Typhoon

Digital Forensics Evidence Handling

Discussion

The Academy

The Command Post

The Armory