Security Data Lake Architecture

Security Data Lake Architecture | CDA.Wiki | CDA.Wiki

# Security Data Lake Architecture

Definition

A Security Data Lake Architecture is a purpose-built data storage and processing framework that centralizes security telemetry from across an organization's infrastructure into a scalable, schema-flexible repository. Unlike traditional security information and event management (SIEM) systems that ingest only pre-filtered, normalized log streams, a security data lake retains raw, high-fidelity data at scale, enabling retrospective analysis, long-horizon threat hunting, and machine learning model training.

The architecture exists because modern threat detection requires more data, held longer, in more varied formats than legacy security tools can accommodate. Advanced persistent threats operate across timelines measured in months, not days. Supply chain compromises may remain dormant for over a year before activation. Nation-state actors establish persistent access and move laterally over extended periods. Traditional SIEM architectures with 30 to 90-day retention windows discard the very evidence needed to detect and investigate these threats.

Security data lakes solve three fundamental problems. First, they eliminate data loss caused by aggressive filtering at ingestion time. Second, they extend retention periods from months to years without exponential cost increases. Third, they enable arbitrary query schemas, allowing analysts to ask questions that were not anticipated when the data was originally collected. This capability is critical for threat hunting, where the investigative process discovers attack patterns that existing detection rules missed.

The architecture combines principles from big data engineering with the operational requirements of a security operations center (SOC), including low-latency alerting, high-throughput ingestion, controlled access, and auditability. It is built on cloud-native or hybrid object storage, layered with metadata catalogs, query engines, and integration points for detection logic and response orchestration.

How It Works

Security data lake architecture follows a multi-stage pipeline pattern with six functional components: ingestion infrastructure, landing zones, transformation pipelines, storage tiers, query engines, and detection frameworks.

Ingestion Infrastructure

Data enters the lake from multiple source types operating at different scales and reliability requirements. Endpoint detection and response (EDR) agents generate high-volume, high-frequency telemetry including process creation events, file access logs, and network connections. Network infrastructure produces flow data, DNS queries, and firewall decisions. Cloud providers generate API audit logs, identity authentication events, and resource configuration changes. Third-party services contribute threat intelligence feeds, vulnerability scan results, and security assessment data.

Ingestion pipelines, typically implemented using Apache Kafka, AWS Kinesis, or Azure Event Hubs, provide buffering, ordering, and delivery guarantees. These systems handle backpressure when downstream processing cannot keep pace with data arrival rates, ensuring that data loss does not occur during peak traffic periods or system maintenance windows.

A concrete scale example: an organization with 10,000 endpoints running comprehensive EDR telemetry generates approximately 2 TB of raw security data daily. Network flow collection from a campus with 50,000 users adds another 500 GB per day. Cloud audit logs from a multi-account AWS deployment contribute 100 GB daily. Without proper ingestion architecture, this 2.6 TB daily volume overwhelms traditional SIEM platforms and forces filtering decisions that eliminate critical forensic evidence.

Landing Zones and Raw Storage

Raw data lands in an immutable bronze layer implemented on object storage such as Amazon S3, Azure Data Lake Storage, or Google Cloud Storage. This layer preserves data exactly as received, without modification, normalization, or filtering. Retention in the bronze layer is governed by legal, compliance, and investigative requirements, typically ranging from 2 to 7 years depending on industry and jurisdiction.

The bronze layer serves as the authoritative forensic record. When detection logic produces false positives, when normalization introduces errors, or when new threat intelligence requires historical analysis, the bronze layer provides the ground truth for reprocessing and investigation.

Transformation and Enrichment Pipelines

Transformation pipelines read from the bronze layer and produce normalized, enriched records in the silver layer. Normalization maps vendor-specific field names to common schemas such as the Elastic Common Schema (ECS) or Open Cybersecurity Schema Framework (OCSF). This standardization enables portable detection rules that work consistently across different data sources and vendor platforms.

Enrichment appends contextual information that increases analytical value. IP addresses receive geolocation data, autonomous system numbers, and threat intelligence reputation scores. Hostnames are matched against asset inventory systems to append department ownership, criticality ratings, and software inventories. User identities are enriched with organizational context, access privilege levels, and behavioral risk scores.

A specific enrichment scenario: when a Windows endpoint reports a process creation event for "powershell.exe", the enrichment pipeline appends the user's department (from Active Directory), the host's criticality rating (from the asset database), recent authentication anomalies for that user (from the identity security system), and known malware associations for any command-line arguments (from threat intelligence feeds). This enriched record provides analysts with immediate context that would otherwise require manual research across multiple systems.

Storage Tiering and Optimization

The gold layer contains aggregated, analytics-ready datasets optimized for specific detection and hunting use cases. Data is organized into hot, warm, and cold storage tiers based on query frequency and access patterns. Hot storage holds the most recent 30 days of data in high-performance storage optimized for interactive queries. Warm storage contains data from 30 to 365 days in standard storage with moderate query performance. Cold storage archives data beyond one year in compressed, infrequently accessed storage with higher latency but significantly lower cost.

Partitioning strategies organize data by time windows, data source types, and organizational boundaries to optimize query performance and control access. A common partition scheme organizes data by day and source type, enabling queries that scan specific date ranges or data sources to read only relevant partitions, reducing query time and cost.

Query Engines and Detection Frameworks

Query engines such as Apache Spark, Presto, Amazon Athena, or Azure Synapse provide SQL interfaces for interactive analysis and scheduled detection jobs. Two detection patterns operate simultaneously: batch processing for historical analysis and threat hunting, and streaming processing for near-real-time detection.

Batch detection jobs run on scheduled intervals, scanning historical data for indicators of compromise, behavioral anomalies, and attack patterns that span extended time periods. These jobs can correlate events across weeks or months to identify slow-moving threats such as insider attacks or advanced persistent threats.

Streaming detection evaluates incoming events in near-real-time, typically within seconds of ingestion. These rules focus on high-confidence indicators that require immediate response, such as known malware signatures, impossible travel scenarios, or privilege escalation attempts.

Integration and Response Orchestration

Detection results integrate with existing SIEM platforms, security orchestration and automated response (SOAR) systems, and ticketing platforms through APIs and standard formats. The data lake does not replace these operational systems but rather enhances them with deeper analytical capabilities and longer retention horizons.

Why It Matters

Security data lake architecture addresses three critical gaps in traditional security infrastructure: detection latency for sophisticated attacks, investigative depth for incident response, and compliance evidence for regulatory requirements.

Detection Capability and Threat Dwell Time

The Mandiant M-Trends 2023 report documents a median attacker dwell time of 16 days from initial compromise to detection. However, this statistic reflects only detected attacks. Advanced persistent threats and supply chain compromises often remain undetected for months or years. Organizations with 90-day log retention cannot investigate attacks discovered after their retention window, creating a structural detection gap for the most sophisticated threats.

The 2020 SolarWinds supply chain attack demonstrates this gap's real-world consequences. The malicious code was first distributed in March 2020 but was not discovered until December 2020. Organizations with standard 90-day retention windows could not determine when the compromise began, which systems were affected, or what data was accessed during the nine-month window. Organizations with data lake architectures and multi-year retention were able to trace the attack timeline, scope the impact, and provide detailed forensic evidence for legal and regulatory proceedings.

Incident Response and Forensic Analysis

Modern incident response requires the ability to correlate events across multiple data sources and extended time periods. A comprehensive investigation might need to trace user authentication patterns over six months, correlate network traffic with endpoint activity across multiple compromised systems, and identify data exfiltration patterns that span weeks of attacker activity.

Traditional SIEM architectures force investigators to work within retention windows and pre-filtered data sets. Critical evidence such as process creation events, detailed network flows, or cloud API calls may have been filtered out during ingestion to manage storage costs and query performance. Data lake architectures preserve this evidence in its original form, enabling retrospective analysis with full fidelity.

Compliance and Regulatory Requirements

Regulatory frameworks including GDPR, HIPAA, PCI DSS, SOX, and sector-specific requirements such as NERC CIP mandate the ability to audit access to sensitive data, investigate security events, and retain evidence for specified periods. These requirements often specify retention periods of 3 to 7 years and require immutable storage to prevent tampering.

A security data lake with proper access controls, encryption, and immutable storage provides the infrastructure these frameworks require. Organizations can demonstrate to auditors that they have complete, tamper-evident records of security events and can investigate incidents that occurred years earlier.

Cost Structure and Scalability

A common misconception is that comprehensive data retention is prohibitively expensive. Cloud-native storage with automatic tiering reduces long-term storage costs to pennies per gigabyte per month for cold storage. The total cost of ownership for a security data lake is often lower than scaling traditional SIEM platforms to equivalent data volumes, particularly when query costs are managed through appropriate partitioning and access patterns.

Another misconception is that data lakes are only suitable for large enterprises. Pay-per-query cloud services make security data lake capabilities accessible to mid-sized organizations without the capital expenditure of dedicated infrastructure. A 1,000-employee organization can implement a comprehensive security data lake for less than the annual cost of expanding their existing SIEM to handle equivalent data volumes.

CDA Perspective

Within the Planetary Defense Model (PDM), security data lake architecture spans both the SPH (Security Posture and Hygiene) and TID (Threat Intelligence and Detection) domains. CDA's Autonomous Posture Command (APC) methodology operates on the principle that posture adapts and hygiene never sleeps. A security data lake provides the telemetry foundation that enables both adaptive posture assessment and continuous hygiene monitoring.

In the SPH domain, CDA treats data availability as a fundamental hygiene requirement, not an advanced capability. An organization that cannot query 12 months of authentication logs cannot assess whether its access control posture has degraded over time. An organization that discards DNS logs after 30 days cannot evaluate the effectiveness of its network segmentation controls. CDA conducts SPH assessments that explicitly score organizations on their ability to answer specific investigative questions using their existing data infrastructure.

CDA's approach to data lake implementation differs from conventional thinking in three areas. First, CDA mandates schema standardization as a security control, not an engineering preference. Detection rules that depend on vendor-specific field names create technical debt and break when data sources are replaced or upgraded. CDA requires adoption of OCSF or ECS as a condition of detection coverage validation.

Second, CDA enforces separation between storage and compute layers to ensure that compromise of query infrastructure cannot result in tampering with the immutable evidence record. This architectural principle is non-negotiable for clients operating under SOC 2 Type II or FedRAMP requirements.

Third, CDA structures data lake implementations around detection coverage mapping rather than storage optimization. This means identifying which MITRE ATT&CK techniques current data sources can detect, which techniques require additional telemetry, and how storage tiering decisions affect retrospective detection capability. CDA designs data lakes against specific threat models and adversary profiles, not against abstract technical requirements.

Within the TID domain, CDA emphasizes that data lakes are force multipliers for threat detection, not standalone solutions. The architecture provides the analytical substrate for threat hunting programs, machine learning model training, and behavioral analysis that would be impossible with traditional SIEM retention windows. However, real-time alerting, case management, and analyst workflow remain the domain of traditional SIEM platforms. The two systems are complementary components of a complete detection capability.

CDA operationalizes this integration through quarterly detection coverage assessments that validate the organization's ability to detect specific attack techniques using data lake queries, measure query performance against investigative timelines, and ensure that detection rules remain effective as data sources evolve.

Key Takeaways

Map retention requirements to threat dwell times, not storage costs: Define minimum retention periods based on the timeline characteristics of threats in your specific threat model. A 90-day retention window creates a known detection gap for advanced persistent threats and supply chain compromises.

Preserve raw data in an immutable bronze layer: Maintain unmodified source data to provide a forensic record that can be reprocessed when normalization errors are discovered or when new analytical techniques require different data transformations.

Standardize on a common schema before implementing detection logic: Schema standardization using ECS or OCSF is a prerequisite for portable, maintainable detection rules that work consistently across multiple data sources and survive vendor changes.

Design storage tiering around actual query patterns: In cloud-native implementations, compute costs for queries often exceed storage costs. Partition and tier data based on real access patterns to control operational expenditure.

Integrate with existing security operations, do not replace them: Security data lakes provide analytical depth and historical perspective that complement SIEM capabilities for real-time alerting and case management. Design the architecture to enhance existing workflows rather than disrupting them.

SIEM Architecture and Deployment Patterns
Open Cybersecurity Schema Framework (OCSF) Implementation Guide
Threat Hunting Program Design and Methodology
Cloud Security Logging and Audit Requirements
MITRE ATT&CK Detection Coverage Assessment

Sources

National Institute of Standards and Technology. NIST SP 800-137: Information Security Continuous Monitoring for Federal Information Systems and Organizations. https://csrc.nist.gov/publications/detail/sp/800-137/final

MITRE Corporation. ATT&CK Framework: Data Sources Matrix. https://attack.mitre.org/datasources/

Mandiant. M-Trends 2023: A View from the Frontlines. https://www.mandiant.com/resources/reports/m-trends-2023

Open Cybersecurity Schema Framework. OCSF Schema Specification v1.0.0. https://schema.ocsf.io/

Center for Internet Security. CIS Controls Version 8: Control 8 - Audit Log Management. https://www.cisecurity.org/controls/audit-log-management

Table of Contents

Definition

How It Works

Why It Matters

CDA Perspective

Key Takeaways

Sources

Related CDA Missions

Related Articles

AWS Security Hub

HashiCorp Vault Assessment

Wireshark Network Analysis

Discussion

The Academy

The Command Post

The Armory