Collection Techniques (MITRE ATT&CK TA0009)

Collection Techniques (MITRE ATT&CK TA0009) | CDA.Wiki | CDA.Wiki

# Collection Techniques (MITRE ATT&CK TA0009)

Definition

Collection is the tactic that bridges discovery and exfiltration. In MITRE ATT&CK, Tactic TA0009 covers the techniques adversaries use to gather data before moving it out of the target environment. The attacker has completed their reconnaissance (TA0007), knows where data lives and what is worth taking, and now systematically acquires it.

Collection is distinct from exfiltration (TA0010). Collection is the act of identifying and assembling the target data. Exfiltration is the act of moving it outside the victim environment. The distinction matters operationally because many attacks involve an extended collection phase, sometimes spanning weeks or months, before any data leaves the network. Organizations that detect exfiltration have detected the end of the collection phase. Organizations that detect collection behaviors have an opportunity to intervene before any data leaves.

MITRE ATT&CK TA0009 covers a wide range of collection techniques: email harvesting, data theft from information repositories like SharePoint and Confluence, collection from cloud storage services, local file system collection, clipboard and screen capture, input capture via keylogging, and data staging in preparation for exfiltration. Each technique reflects a different data type or collection context.

Within the Planetary Defense Model (PDM), collection detection is a TID responsibility governed by Predictive Defense Intelligence (PDI). The operational mandate is to detect collection behaviors before exfiltration completes. The Change Healthcare breach (2024) demonstrated the consequences of missing the collection phase: attackers maintained persistent access for weeks, collecting over six terabytes of sensitive healthcare data before the breach was detected.

How It Works

Email Collection

Email is one of the highest-value data targets in most organizations. Corporate email contains financial information, personnel data, legal communications, merger and acquisition planning, intellectual property discussions, credentials sent in cleartext, and the operational context that makes other data interpretable. Attackers treat email as a primary collection target.

MITRE ATT&CK T1114 covers three distinct email collection sub-techniques:

T1114.001 (Local Email Client Data) targets email stored on the local machine. Microsoft Outlook stores email in PST (Personal Storage Table) files and OST (Offline Storage Table) files. An attacker with access to the filesystem can copy these files directly and access the entire email history of the account, including deleted items and sent messages. This technique requires no special access beyond the user's file system permissions.

T1114.002 (Remote Mailbox Access) targets the mail server directly rather than the local client. Protocols including MAPI (Messaging Application Programming Interface), Exchange Web Services (EWS), and Microsoft Graph API allow authenticated access to mailboxes hosted in Exchange or Microsoft 365. An attacker with valid credentials, including OAuth tokens obtained through consent phishing or token theft, can programmatically access and download entire mailboxes without touching the local machine. This technique scales: a single set of administrative credentials can be used to access and download the mailboxes of every user in the organization.

T1114.003 (Email Forwarding Rules) is particularly insidious because it converts the victim's email system into an ongoing collection mechanism. The attacker creates an inbox rule that automatically forwards copies of incoming email to an external address. Once the rule is established, the attacker no longer needs active access to the environment. Every email the victim receives is automatically copied to the attacker's collection point. Forwarding rules can be configured to forward all email, email from specific senders, or email matching keywords.

Data from Information Repositories

Modern organizations centralize substantial amounts of institutional knowledge in collaboration and documentation platforms. SharePoint, Confluence, internal wikis, code repositories (GitHub, GitLab, Bitbucket, Azure DevOps), and ticketing systems (Jira, ServiceNow) contain operational documentation, architecture diagrams, runbooks, source code, API keys committed in error, credentials in ticket comments, and strategic planning documents.

MITRE T1213 (Data from Information Repositories) covers collection from these platforms. The attacker uses legitimate authenticated access, often through compromised credentials or a valid OAuth token, to systematically download or export content. Source code repositories are particularly high-value targets because they frequently contain hardcoded secrets (API keys, database connection strings, private keys) that were committed unintentionally.

The MOVEit breach (2023) is a relevant example: the Cl0p ransomware group exploited a SQL injection vulnerability in Progress Software's MOVEit Transfer platform specifically because MOVEit is used as a managed file transfer system. Organizations used it to transfer sensitive data. By compromising MOVEit, the attackers gained access to files that organizations were actively sharing through the platform, effectively turning the file transfer system into a collection mechanism.

Data from Cloud Storage

Cloud storage services including Amazon S3, Microsoft OneDrive, SharePoint Online, and Google Drive have become primary data repositories for most organizations. An attacker with access to cloud credentials, particularly credentials with broad storage permissions, can enumerate and download large quantities of data through standard cloud provider APIs.

MITRE T1530 (Data from Cloud Storage) covers this collection vector. In AWS environments, S3 buckets with permissive IAM policies are a common target. AWS CLI commands like aws s3 sync can copy entire bucket contents to attacker-controlled storage. In Microsoft 365 environments, the Microsoft Graph API provides programmatic access to OneDrive and SharePoint, meaning that any OAuth application with Files.Read.All or Sites.Read.All permission can download data from the entire tenant.

Cloud storage collection is particularly difficult to detect based on access patterns alone because the same API calls used in legitimate bulk operations (backup, archival, migration) are used in collection. Volume, timing, and destination are the primary detection signals.

Data from Local System

T1005 (Data from Local System) is the most direct collection technique: the attacker browses the local file system and copies files of interest to a staging location or directly to an exfiltration channel. Files targeted include documents, spreadsheets, database files, configuration files, private keys, and browser credential stores.

Browser credential stores are a frequently targeted local data source. Chrome, Firefox, Edge, and other browsers store saved passwords in local files that can be accessed with user-level permissions. Credential stealer malware routinely targets browser credential stores as part of automated collection, often within minutes of initial access.

Data Staged

T1074 (Data Staged) covers the staging behavior that often precedes exfiltration: the attacker aggregates collected data in a single location before moving it out. Common staging patterns include creating compressed archives (ZIP, RAR, 7z), sometimes with encryption, in temporary directories or locations with high write volume where archive creation would be less anomalous.

Staging behavior is an important detection signal. Large archive files created by non-administrative accounts in temporary directories, or unusually large data transfers between internal hosts to a single staging host, indicate that collection is being prepared for exfiltration. The staging step often produces more detectable signals than the collection step because it involves moving data in bulk between internal locations before the single exfiltration event.

Clipboard Data and Screen Capture

T1115 (Clipboard Data) and T1113 (Screen Capture) are collection techniques that target data in use rather than data at rest. Clipboard monitoring captures any data the user copies: passwords copied from a password manager, sensitive data copied from a document, API keys copied from a browser. Screen capture records the user's activity at regular intervals or in response to specific triggers (window titles containing keywords like "banking" or "password").

These techniques are common components of commodity information stealers (malware designed to collect credentials and other sensitive data). They are also used in targeted attacks where the attacker wants to capture data the user accesses but does not store in files they can directly enumerate.

Input Capture

T1056 (Input Capture) includes keylogging: recording every keystroke the user enters. Keyloggers capture credentials as they are typed (bypassing browser-based credential stores entirely), capture data entered into web forms, and capture any information the user types including email content, document drafts, and chat messages.

Keyloggers can be implemented at multiple levels: user-space software keyloggers (the most common), kernel-level drivers (more persistent, harder to detect), and hardware keyloggers (physical devices interposed between keyboard and computer, relevant to physical access scenarios).

Why It Matters

The Change Healthcare Collection Phase

The Change Healthcare ransomware attack (2024) is the canonical example of extended collection preceding catastrophic exfiltration. ALPHV/BlackCat affiliates used stolen credentials to access Change Healthcare's network through a Citrix portal with no multi-factor authentication. After gaining initial access, attackers maintained persistent access and moved laterally over a period of approximately nine days before deploying ransomware.

During that nine-day window, the attackers collected over six terabytes of data containing protected health information (PHI) for approximately one-third of all Americans. The data included medical records, insurance information, billing data, and personally identifiable information. The collection phase was completed before the ransomware deployment made the breach apparent.

The lesson is direct: the organization had nine days to detect collection activity and interrupt the attack before the most damaging phase (ransomware detonation and potential data publication). Detection capabilities focused only on ransomware detonation missed the entire collection phase.

Email Forwarding Rules Are Silent and Persistent

Email forwarding rule attacks (T1114.003) are particularly difficult to detect because the malicious behavior is invisible in normal email monitoring. The rule runs server-side, automatically forwarding email without any indication in the victim's sent mail or client-side monitoring. The only reliable detection mechanism is auditing inbox rule creation events.

In Microsoft 365 environments, the Unified Audit Log records inbox rule creation events. Without monitoring these logs, organizations have no visibility into whether forwarding rules are being used to exfiltrate email data. The attacker can create a forwarding rule and remove their active access to the environment. The rule continues operating independently, providing ongoing collection without requiring further attacker presence.

Collection Determines Breach Scope

The scope of a data breach is determined by collection success, not by the initial access technique. An organization that detects collection early interrupts the breach before large quantities of data are assembled and staged. An organization that only detects exfiltration allows the full collected dataset to leave. The difference between a breach involving ten gigabytes and one involving six terabytes is determined by how long the collection phase operated before detection.

Technical Details

Exchange and Microsoft 365 Audit Log Monitoring for Email Collection

Exchange Online and Microsoft 365 Unified Audit Log events relevant to email collection monitoring include: MailboxLogin events from unexpected IP addresses, New-InboxRule creation events (especially rules with external forwarding addresses or conditions that delete email after forwarding to hide the rule's existence), and ExportMailbox or eDiscovery operations initiated from non-standard accounts.

In Microsoft 365, monitoring for OAuth application consent grants is also important: T1114.002 attacks frequently use malicious OAuth applications that request email access permissions and are granted those permissions through consent phishing. The OfficeActivity table in Microsoft Sentinel and the Unified Audit Log both record application consent events.

Data Staging Detection Signals

Detection rules for data staging target several behavioral indicators:

Volume anomalies: a host creating archive files significantly larger than its historical baseline. Time-of-day anomalies: large archive creation outside business hours from user accounts. Location anomalies: archive files created in temporary directories or locations not associated with legitimate backup or archive operations. Process anomalies: archive creation by processes (cmd.exe, powershell.exe, or non-standard archive utilities) rather than legitimate backup software.

File entropy analysis provides a complementary detection signal: encrypted archives have high entropy, and the sudden creation of high-entropy files in user profile directories is anomalous when the user does not normally create encrypted archives.

Cloud Storage Collection Detection

Cloud provider audit logs (AWS CloudTrail, Azure Activity Logs, GCP Cloud Audit Logs) capture the API calls associated with cloud storage collection. Relevant indicators include: GetObject or s3:GetObject calls at volumes significantly above baseline, ListBuckets combined with bulk GetObject calls from a single credential, Microsoft Graph API calls to Files.Read.All endpoints from OAuth applications not in the approved application catalog, and data download volumes from cloud storage that exceed typical user activity.

Behavioral baselines are essential because the API calls themselves are identical to legitimate operations. The detection relies on identifying activity that deviates from the established patterns of legitimate use.

CDA Perspective

TID Detection Must Cover the Full Collection Lifecycle

The Predictive Defense Intelligence (PDI) methodology requires detection coverage across the full kill chain, not just at initial access or exfiltration endpoints. The Change Healthcare incident demonstrated that a nine-day collection window with no detection is operationally indistinguishable from having no detection capability at all. The attacker completed their objective before any detection fired.

CDA's TID domain mission TID-H01 (threat detection engineering) builds detection coverage specifically for TA0009 behaviors as part of a comprehensive kill chain detection library. The prioritization framework for TA0009 detection rule development is based on the technique's prevalence in confirmed breaches and the availability of reliable detection signals. Email forwarding rule creation (T1114.003) is high-priority because it is both common in confirmed attacks and reliably detectable through audit log monitoring. Large-scale cloud storage access (T1530) is high-priority because the volume signals are detectable through cloud audit logs.

The DPS (Data Protection and Sovereignty) domain is the complementary control layer. Data Loss Prevention (DLP) policies, data classification, and access controls limit what an attacker can collect even if initial access succeeds. TID detection and DPS data controls operate in combination: detection aims to catch the attacker during collection; DPS controls limit the blast radius if collection succeeds before detection fires.

The Insider Threat Problem Shares Detection Infrastructure

Collection technique detection is not exclusively focused on external attackers. The same behaviors (bulk email download, large archive creation, bulk file copying from repositories, cloud storage download) appear in insider threat scenarios. An employee preparing to leave who downloads client data before resignation exhibits T1074 and T1005 behaviors. An employee exfiltrating intellectual property to a competitor exhibits T1530 behaviors using authorized credentials.

This overlap is operationally significant: the detection rules built for external threat actor collection scenarios provide coverage for insider threat collection scenarios without requiring separate detection engineering investment. CDA's approach treats the behavioral signal as the detection target regardless of whether the actor is external or internal.

Key Takeaways

Collection (MITRE ATT&CK TA0009) is the phase between discovery and exfiltration where attackers systematically gather data. Detecting collection before exfiltration completes limits breach scope and provides containment opportunity.
Email forwarding rule creation (T1114.003) converts the victim's email system into an ongoing collection mechanism. Detection requires monitoring Exchange and Microsoft 365 audit logs for inbox rule creation, particularly rules with external forwarding addresses.
The Change Healthcare breach (2024) illustrates the cost of missing the collection phase: nine days of access produced more than six terabytes of collected data before ransomware deployment made the breach apparent.
Data staging behaviors (T1074) produce detectable signals: large archive file creation, high-entropy files in unexpected locations, and bulk data movement between internal hosts before exfiltration.
Cloud storage collection (T1530) requires cloud audit log monitoring with behavioral baselines. The API calls are identical to legitimate operations; detection depends on volume and pattern deviation from established baselines.
TID detection for TA0009 and DPS data controls are complementary: detection aims to catch collection in progress, DPS controls limit what an attacker can collect if detection is delayed.

Discovery Techniques (MITRE ATT&CK TA0007)
Exfiltration Techniques (MITRE ATT&CK TA0010)
Initial Access Techniques (MITRE ATT&CK TA0001)
Change Healthcare Ransomware Attack (Case Study)
Data Protection and Sovereignty (DPS Domain Overview)
Detection Engineering for Threat Intelligence Teams

Sources

MITRE ATT&CK. "Collection (TA0009)." MITRE Corporation, 2024. https://attack.mitre.org/tactics/TA0009/

CISA. "Advisory: ALPHV Blackcat Affiliates and Change Healthcare (AA24-131A)." CISA, 2024. https://www.cisa.gov/news-events/cybersecurity-advisories/aa24-131a

Microsoft. "Hunt for threats using Microsoft Sentinel." Microsoft Learn, 2024. https://learn.microsoft.com/en-us/azure/sentinel/

Mandiant. "M-Trends 2024 Special Report." Google Cloud, 2024. https://www.mandiant.com/m-trends

Verizon. "2024 Data Breach Investigations Report." Verizon Business, 2024. https://www.verizon.com/business/resources/reports/dbir/

Amazon Web Services. "Security best practices in AWS CloudTrail." AWS Documentation, 2024. https://docs.aws.amazon.com/awscloudtrail/latest/userguide/

CDA, LLC. "Predictive Defense Intelligence (PDI) Methodology Reference." CDA Canon, 2026.

Table of Contents