# Security Data Lakehouse
A security data lakehouse is a unified data architecture that combines the scalability and cost-efficiency of a data lake with the structured query performance and ACID compliance of a data warehouse, purpose-built for security operations. It serves as the central repository for all security telemetry (logs, alerts, network flows, endpoint data, cloud events, threat intelligence) in a format optimized for both real-time detection and long-term historical analysis. The lakehouse paradigm eliminates the need to choose between cheap storage and fast queries by using open table formats like Apache Iceberg, Delta Lake, or Apache Hudi.
The security data lakehouse architecture consists of several layers:
Ingestion Layer: Security telemetry from diverse sources (SIEM, EDR, cloud providers, network devices, identity platforms) is ingested through streaming pipelines (Apache Kafka, AWS Kinesis, Azure Event Hubs) and batch connectors. Data arrives in raw format and is stored in object storage (S3, GCS, Azure Blob).
Storage Layer: An open table format (Iceberg, Delta Lake, or Hudi) provides ACID transactions, schema evolution, and time-travel capabilities on top of object storage. This means you get warehouse-like reliability on lake-like infrastructure at lake-like costs.
Processing Layer: A compute engine (Apache Spark, Databricks, Snowflake, or DuckDB) processes data for transformations, enrichments, and analytics. Security-specific processing includes log normalization (to OCSF or ECS schema), threat intelligence enrichment, and detection rule evaluation.
Query Layer: Analysts and detection engines query the lakehouse using SQL or dataframe APIs. Real-time detection runs on streaming data; threat hunting and investigations query historical data, sometimes spanning years.
Governance Layer: Access controls, data classification, retention policies, and audit logging ensure the lakehouse itself is secured and compliant.
Key differentiators from traditional SIEM:
Traditional SIEMs charge by data volume, creating a perverse incentive to ingest less data. Security teams are forced to choose which logs to keep and which to discard, directly undermining detection capabilities. When an incident occurs, the missing data is always the data you chose not to ingest.
The security data lakehouse eliminates this trade-off. By decoupling storage from compute, organizations can ingest everything, retain it for years, and query it on demand. This fundamentally changes the economics of security operations:
Major vendors have recognized this shift. Snowflake, Databricks, and AWS all offer security-specific lakehouse solutions. The OCSF (Open Cybersecurity Schema Framework) provides a standard schema for normalizing security data across sources.
CDA positions the security data lakehouse within the Threat Intelligence & Defense (TID) domain under the Predictive Defense Intelligence (PDI) methodology. Our principle: you cannot defend against what you cannot see, and you cannot see what you did not store.
Operational integration:
CDA.Now integrates with client lakehouses to surface threat intelligence without requiring data to leave the client's environment, a direct application of Zero Possession Architecture.