Semi-Supervised Learning Approach for Automated Sensitive Data Classification in Unstructured Text Documents

Juan Li; Wenkun Ren; Xiaolan Wu

Authors

Juan Li Shanghai Jiao Tong University Master of Science in Communication and Information Systems Author
Wenkun Ren Information Technology and Management, Illinois Institute of Technology, Chicago, IL Author
Xiaolan Wu Northeastern University Computer Science Author

Keywords:

Sensitive Data Classification, Semi-Supervised Learning, Unstructured Text Processing, Privacy Compliance

Abstract

Effective data protection demands accurate identification and categorization of sensitive information across organizational repositories. Manual classification methodologies introduce substantial temporal overhead while generating inconsistent taxonomic assignments that undermine governance frameworks. This research develops a semi-supervised learning architecture for automating sensitive data classification within unstructured textual environments, addressing the fundamental challenge of limited labeled training instances. We construct a probabilistic framework integrating self-training mechanisms with confidence-weighted label propagation, achieving 87.3% classification accuracy using merely 12% labeled data. The methodology applies natural language processing techniques for feature extraction from heterogeneous document formats including emails, reports, and collaborative workspace content. Our experimental evaluation across three organizational datasets demonstrates precision improvements of 14.6% over baseline supervised approaches under constrained labeling scenarios. We establish a four-tier sensitivity taxonomy aligned with regulatory compliance requirements and implement adaptive decision boundaries for human validation workflows. Performance analysis reveals consistent recall rates exceeding 82% across personally identifiable information, financial records, and proprietary intelligence categories. The framework reduces manual annotation requirements by 73% while maintaining classification fidelity sufficient for regulatory compliance auditing.

Author Biography

Xiaolan Wu, Northeastern University Computer Science

Semi-Supervised Learning Approach for Automated Sensitive Data Classification in Unstructured Text Documents

Authors

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Manu

For Authors

About Journal

Editorial Team

Make a Submission

Ready to Publish