Semi-Supervised Learning Approach for Automated Sensitive Data Classification in Unstructured Text Documents

Authors

  • Juan Li Shanghai Jiao Tong University Master of Science in Communication and Information Systems Author
  • Wenkun Ren Information Technology and Management, Illinois Institute of Technology, Chicago, IL Author
  • Xiaolan Wu Northeastern University Computer Science Author

Keywords:

Sensitive Data Classification, Semi-Supervised Learning, Unstructured Text Processing, Privacy Compliance

Abstract

Effective data protection demands accurate identification and categorization of sensitive information across organizational repositories. Manual classification methodologies introduce substantial temporal overhead while generating inconsistent taxonomic assignments that undermine governance frameworks. This research develops a semi-supervised learning architecture for automating sensitive data classification within unstructured textual environments, addressing the fundamental challenge of limited labeled training instances. We construct a probabilistic framework integrating self-training mechanisms with confidence-weighted label propagation, achieving 87.3% classification accuracy using merely 12% labeled data. The methodology applies natural language processing techniques for feature extraction from heterogeneous document formats including emails, reports, and collaborative workspace content. Our experimental evaluation across three organizational datasets demonstrates precision improvements of 14.6% over baseline supervised approaches under constrained labeling scenarios. We establish a four-tier sensitivity taxonomy aligned with regulatory compliance requirements and implement adaptive decision boundaries for human validation workflows. Performance analysis reveals consistent recall rates exceeding 82% across personally identifiable information, financial records, and proprietary intelligence categories. The framework reduces manual annotation requirements by 73% while maintaining classification fidelity sufficient for regulatory compliance auditing.

Author Biography

  • Xiaolan Wu, Northeastern University Computer Science

     

     

Downloads

Published

2024-07-07

How to Cite

Semi-Supervised Learning Approach for Automated Sensitive Data Classification in Unstructured Text Documents. (2024). Journal of Global Engineering Review, 2(2), 1-17. https://gereview.com/index.php/jger/article/view/4