Semi-Supervised Learning Approach for Automated Sensitive Data Classification in Unstructured Text Documents
Keywords:
Sensitive Data Classification, Semi-Supervised Learning, Unstructured Text Processing, Privacy ComplianceAbstract
Effective data protection demands accurate identification and categorization of sensitive information across organizational repositories. Manual classification methodologies introduce substantial temporal overhead while generating inconsistent taxonomic assignments that undermine governance frameworks. This research develops a semi-supervised learning architecture for automating sensitive data classification within unstructured textual environments, addressing the fundamental challenge of limited labeled training instances. We construct a probabilistic framework integrating self-training mechanisms with confidence-weighted label propagation, achieving 87.3% classification accuracy using merely 12% labeled data. The methodology applies natural language processing techniques for feature extraction from heterogeneous document formats including emails, reports, and collaborative workspace content. Our experimental evaluation across three organizational datasets demonstrates precision improvements of 14.6% over baseline supervised approaches under constrained labeling scenarios. We establish a four-tier sensitivity taxonomy aligned with regulatory compliance requirements and implement adaptive decision boundaries for human validation workflows. Performance analysis reveals consistent recall rates exceeding 82% across personally identifiable information, financial records, and proprietary intelligence categories. The framework reduces manual annotation requirements by 73% while maintaining classification fidelity sufficient for regulatory compliance auditing.

