data-classification: CDA.Wiki (Print)

# Data Classification

Definition

Data classification is the process of organizing data into categories based on its sensitivity, regulatory requirements, and business value, then applying appropriate protection controls to each category. It is the foundational operation of the Data Protection and Sovereignty (DPS) domain in the Planetary Defense Model because every DPS control depends on it: you cannot encrypt what you have not identified, you cannot restrict access to what you have not labeled, and you cannot monitor exfiltration of what you have not classified.

The U.S. government has practiced data classification for decades: Unclassified, Confidential, Secret, Top Secret, and Sensitive Compartmented Information (SCI). Each level carries specific handling requirements, access controls, and storage mandates. The system works because it is universal (every piece of government information has a classification), enforceable (violations have consequences), and operational (the classification determines the handling, not the other way around).

Most private-sector organizations do not have a functioning data classification program. They treat all data the same, which means either every document gets the same nominal protection (insufficient for sensitive data) or every document gets the same maximum protection (operationally impractical and unsustainable). The result is that sensitive data receives less protection than it needs while non-sensitive data consumes more resources than it deserves.

How It Works

Classification Tiers

A standard enterprise data classification scheme uses four tiers:

Public. Information intended for public consumption. Marketing materials, published financial reports, press releases, job postings. No access restrictions. No special handling. Loss or exposure has no business impact.

Internal. Information intended for use within the organization but not public. Internal memos, meeting notes, general project documentation, organizational charts. Access restricted to employees and authorized contractors. Loss or exposure causes minor business impact (embarrassment, competitive disadvantage) but no regulatory consequence.

Confidential. Information that would cause significant harm if disclosed. Customer personally identifiable information (PII), employee records, financial data, intellectual property, legal documents, strategic plans, merger and acquisition details. Access restricted to specific roles with a business need. Loss or exposure causes regulatory consequences (GDPR fines, breach notification obligations), competitive damage, or legal liability.

Restricted. The most sensitive information the organization holds. Trade secrets, source code (for software companies), classified government data (for contractors), payment card data (for PCI DSS scope), protected health information (for HIPAA scope), cryptographic key material. Access restricted to named individuals with explicit authorization. Loss or exposure causes severe regulatory penalties, existential business damage, or national security impact.

Some organizations use three tiers (combining Internal and Confidential). Others use five or more (adding sub-tiers for regulatory-specific data). The number of tiers matters less than the consistency of application: every data asset must be classified, and the classification must determine the handling.

The Classification Process

Classification operates in three phases:

Discovery. Identify where data exists across the environment: databases, file shares, cloud storage, SaaS applications, email, endpoints, backup systems, development environments, analytics platforms. Data spreads. It copies. It migrates to places nobody intended. Discovery must be comprehensive and recurring, because new data is created daily and existing data moves continuously.

Automated discovery tools (data discovery and classification platforms from Varonis, Microsoft Purview, Spirion, BigID) scan repositories and identify sensitive data patterns: Social Security numbers, credit card numbers, health record identifiers, passport numbers, and custom patterns defined by the organization. Manual discovery supplements automation: interviewing data owners, reviewing application architectures, and mapping data flows to identify sensitive data that automated tools miss (intellectual property, strategic documents, and context-dependent sensitivity).

Labeling. Apply the classification label to each data asset. Labeling can be automated (the discovery tool applies labels based on content patterns), manual (the data owner assigns the classification), or hybrid (automation suggests, humans confirm). Labels should be persistent (they travel with the data when it moves), visible (users know the classification of the data they are handling), and machine-readable (DLP tools, access controls, and encryption policies can consume the label and enforce appropriate controls).

Microsoft Information Protection (MIP), Google Workspace labels, and similar tools embed classification labels directly into files, emails, and documents. These labels can drive downstream controls: a file labeled "Restricted" is automatically encrypted, cannot be forwarded outside the organization, and triggers a DLP alert if copied to a USB drive.

Control application. Each classification tier maps to a set of protection controls:

| Tier | Encryption | Access Control | DLP Monitoring | Retention | Disposal | |------|-----------|---------------|----------------|-----------|---------| | Public | Optional | None | None | Indefinite | Standard deletion | | Internal | In transit (TLS) | Employee access | Basic monitoring | Per policy | Standard deletion | | Confidential | At rest and in transit | Role-based, need-to-know | Active monitoring and blocking | Regulated retention | Verified secure disposal | | Restricted | At rest, in transit, and in use where possible | Named individuals, explicit authorization, MFA-required | Aggressive monitoring, real-time blocking, alert to security team | Minimum necessary, strict retention | Cryptographic erasure or physical destruction, chain of custody documented |

The control matrix is the operational output of classification. Without it, classification is a labeling exercise that changes nothing. With it, classification drives encryption decisions, access control policies, DLP rules, retention schedules, and disposal procedures.

Why It Matters

Every DPS Control Depends On It

Data Loss Prevention cannot function without classification. DLP rules need to know what data to monitor: "Block transmission of Restricted data to external recipients" requires knowing which data is Restricted. Without classification labels, DLP either monitors everything (generating overwhelming false positives) or monitors nothing specific (missing actual exfiltration).

Encryption prioritization depends on classification. An organization that encrypts everything equally wastes resources on public data while potentially under-investing in key management for restricted data. Classification enables targeted encryption: Restricted data gets hardware security module (HSM) managed keys with automated rotation. Internal data gets standard TLS in transit. The investment matches the risk.

Access control scoping depends on classification. Role-based access should map to data classification tiers. A finance analyst needs access to Confidential financial data. They do not need access to Restricted source code. Without classification, access control is either too broad (everyone can access everything) or too narrow (nobody can access anything without an exception request, which creates operational friction and shadow IT workarounds).

Regulatory Mandates

GDPR requires organizations to identify personal data and apply appropriate safeguards. HIPAA requires identification and protection of Protected Health Information (PHI). PCI DSS requires identification and segmentation of cardholder data environments. NIST 800-171 requires identification and marking of Controlled Unclassified Information (CUI). Every major regulation begins with "know what sensitive data you have" because you cannot comply with protection requirements for data you have not identified.

Organizations that cannot demonstrate a functioning classification program during regulatory audits face findings, remediation requirements, and in severe cases, fines. GDPR fines for inadequate data protection have exceeded 1 billion euros in aggregate. The classification program is the evidentiary foundation that demonstrates the organization knows what data it holds and how it protects it.

The Shadow Data Problem

The most dangerous data in any organization is the data nobody knows about. A copy of the production database in a developer's personal cloud storage. A spreadsheet of customer Social Security numbers emailed to a personal account "for working from home." An analytics export containing health records downloaded to an unencrypted laptop. A backup of the financial system stored on an NAS device in a closet that nobody has inventoried.

Shadow data is unclassified, unprotected, and invisible to security controls. It is the DPS equivalent of a geological fault line: invisible under normal conditions, catastrophic when stressed. Classification programs that include recurring discovery (not just a one-time project) surface shadow data before attackers do.

CDA Perspective

Data classification is the first operation in the DPS domain of the Planetary Defense Model. CDA's Sovereign Data Protocol (SDP) begins with classification because every subsequent SDP operation depends on it. "Your data lives where you decide. Period." That decision is meaningless if you do not know what data you have, where it lives, and how sensitive it is.

The geological parallel is precise. The Earth's geological strata are layers of different material at different depths, each with different properties. Sedimentary rock at the surface behaves differently than the igneous core. Data classification creates the same layered structure: Public data sits at the surface, accessible to all. Restricted data sits at the core, accessible only to those who have specific authorization and the capability to reach it. The protection architecture matches the layer.

CDA's own access model mirrors government classification. CDA.Wiki content is classified by clearance level: Unclassified (free, public), CUI (requires free Nexus ID), Confidential (requires Cadet membership), Secret (requires Enlisted), Top Secret (requires Officer), and TS/SCI (Crew only). The classification determines the access control. The access control enforces the classification. The system is consistent from the U.S. government's classification framework through CDA's content model through the DPS controls CDA deploys for clients.

Three TOP missions connect directly to data classification:

DPS-R01 (Data Asset Discovery): Discover where sensitive data exists across the environment. 16 estimated hours. This is the prerequisite for everything else.
DPS-R02 (Data Classification Assessment): Assess the current classification state. Is data classified? Is the classification accurate? Are controls aligned to classification tiers? 12 estimated hours.
DPS-B01 (Data Classification Policy): Build the classification policy: define tiers, handling requirements, labeling procedures, control matrices, and accountability. 24 estimated hours.

The interaction with adjacent domains: VSD discovers exposed data assets (a database visible to the internet is a VSD finding with DPS implications). SPH maintains the configuration of classification tools and DLP policies. IAT controls who can access data at each classification tier. TID detects unauthorized access or exfiltration of classified data. RGA mandates the classification program through compliance frameworks and tracks coverage as a governance metric.

CDA's approach to classification differs from conventional consultancies in one way: we treat classification as an operational program, not a project. A conventional consultancy runs a data discovery scan, produces a report, and leaves. CDA's DPS-R01 and DPS-R02 establish the baseline. DPS-B01 builds the program. The program runs continuously through C-COMMAND (DPS-C01: Data Protection Operations, DPS-C02: Data Governance Program). Discovery is recurring. Classification is maintained. Controls are enforced. Drift is detected. The program operates.

Key Takeaways

Data classification organizes data by sensitivity and applies appropriate controls to each tier. Every other DPS control (encryption, DLP, access control, retention, disposal) depends on classification.
A standard four-tier scheme (Public, Internal, Confidential, Restricted) with a defined control matrix is sufficient for most organizations. Consistency of application matters more than number of tiers.
Discovery must be recurring, not one-time. Shadow data (copies, exports, backups in uncontrolled locations) is the most dangerous data because it is unclassified and unprotected.
Every major regulation (GDPR, HIPAA, PCI DSS, NIST 800-171) begins with "identify your sensitive data." Classification is the evidentiary foundation for compliance.
CDA treats classification as a continuous operational program, not a one-time project. Discovery recurs. Labels persist. Controls enforce. Drift is detected.

Sources

National Institute of Standards and Technology (NIST). "Guide to Protecting the Confidentiality of Personally Identifiable Information: SP 800-122." U.S. Department of Commerce, April 2010.
International Organization for Standardization. "ISO/IEC 27001:2022: Information Security Management Systems, Annex A.5.12 (Classification of Information)." ISO, 2022.
National Archives and Records Administration. "Controlled Unclassified Information (CUI) Registry." CUI.gov, updated continuously.
European Parliament and Council. "General Data Protection Regulation (GDPR): Regulation (EU) 2016/679, Articles 5 and 32." Official Journal of the European Union, 2016.
PCI Security Standards Council. "PCI DSS v4.0: Requirement 3 (Protect Stored Account Data)." PCI SSC, March 2022.

Word count: 1,878

Data Classification