security-data-lakehouse: CDA.Wiki (Print)

# Security Data Lakehouse

Definition

A security data lakehouse is a unified data architecture that combines the scalability and cost-efficiency of a data lake with the structured query performance and ACID compliance of a data warehouse, purpose-built for security operations. It serves as the central repository for all security telemetry (logs, alerts, network flows, endpoint data, cloud events, threat intelligence) in a format optimized for both real-time detection and long-term historical analysis. The lakehouse paradigm eliminates the need to choose between cheap storage and fast queries by using open table formats like Apache Iceberg, Delta Lake, or Apache Hudi.

How It Works

The security data lakehouse architecture consists of several layers:

Ingestion Layer: Security telemetry from diverse sources (SIEM, EDR, cloud providers, network devices, identity platforms) is ingested through streaming pipelines (Apache Kafka, AWS Kinesis, Azure Event Hubs) and batch connectors. Data arrives in raw format and is stored in object storage (S3, GCS, Azure Blob).

Storage Layer: An open table format (Iceberg, Delta Lake, or Hudi) provides ACID transactions, schema evolution, and time-travel capabilities on top of object storage. This means you get warehouse-like reliability on lake-like infrastructure at lake-like costs.

Processing Layer: A compute engine (Apache Spark, Databricks, Snowflake, or DuckDB) processes data for transformations, enrichments, and analytics. Security-specific processing includes log normalization (to OCSF or ECS schema), threat intelligence enrichment, and detection rule evaluation.

Query Layer: Analysts and detection engines query the lakehouse using SQL or dataframe APIs. Real-time detection runs on streaming data; threat hunting and investigations query historical data, sometimes spanning years.

Governance Layer: Access controls, data classification, retention policies, and audit logging ensure the lakehouse itself is secured and compliant.

Key differentiators from traditional SIEM:

Cost: Object storage costs pennies per GB versus dollars per GB for traditional SIEM indexing
Retention: Store years of telemetry affordably instead of months
Flexibility: Query using any compute engine, not just the SIEM vendor's interface
Ownership: Data stays in your infrastructure in open formats, avoiding vendor lock-in

Why It Matters

Traditional SIEMs charge by data volume, creating a perverse incentive to ingest less data. Security teams are forced to choose which logs to keep and which to discard, directly undermining detection capabilities. When an incident occurs, the missing data is always the data you chose not to ingest.

The security data lakehouse eliminates this trade-off. By decoupling storage from compute, organizations can ingest everything, retain it for years, and query it on demand. This fundamentally changes the economics of security operations:

Detection coverage improves because no telemetry is excluded for cost reasons
Threat hunting becomes practical over multi-year datasets
Incident investigation can trace attacker activity across months of historical data
Compliance requirements for log retention (often 1-7 years) are met affordably

Major vendors have recognized this shift. Snowflake, Databricks, and AWS all offer security-specific lakehouse solutions. The OCSF (Open Cybersecurity Schema Framework) provides a standard schema for normalizing security data across sources.

Real-World Applications

Large Enterprises: Replace SIEM as the primary data store, using the SIEM only for real-time alerting while the lakehouse handles investigation, hunting, and long-term analytics.
MSSPs/MDR Providers: Manage security telemetry from hundreds of clients in a multi-tenant lakehouse, reducing per-client costs dramatically.
Compliance-Heavy Industries: Financial institutions and healthcare organizations store 7+ years of security telemetry for regulatory compliance at a fraction of SIEM costs.
Threat Intelligence: Correlate years of network flow data against emerging threat intelligence to identify historical compromises.

CDA Perspective

CDA positions the security data lakehouse within the Threat Intelligence & Defense (TID) domain under the Predictive Defense Intelligence (PDI) methodology. Our principle: you cannot defend against what you cannot see, and you cannot see what you did not store.

Operational integration:

M-TID-B02 architects the security data lakehouse, selecting the appropriate open table format and compute engine based on client scale and existing infrastructure
M-TID-H01 configures detection rules and threat hunting queries against the lakehouse
M-TID-R01 assesses current telemetry coverage gaps and data retention limitations during reconnaissance

CDA.Now integrates with client lakehouses to surface threat intelligence without requiring data to leave the client's environment, a direct application of Zero Possession Architecture.

Key Takeaways

Security data lakehouses combine data lake cost efficiency with data warehouse query performance
Open table formats (Iceberg, Delta Lake, Hudi) provide ACID transactions on object storage
Decoupling storage from compute eliminates the SIEM cost versus coverage trade-off
Organizations can ingest all telemetry and retain it for years at affordable costs
OCSF provides a standard schema for normalizing security data across diverse sources
The lakehouse paradigm enables practical threat hunting over multi-year historical datasets