Semi-Supervised Learning for Criminal Records Classification
Government Agency
85%
Reduction in labeling effort
92%
Classification accuracy
47
Offense categories classified
Government Agency
Public Sector
The Challenge
A government agency responsible for managing criminal justice records was struggling with a massive backlog of unclassified documents. Their records system contained millions of entries spanning decades, including arrest reports, court filings, sentencing documents, and parole records. Each document needed to be classified by offense type, severity, and jurisdiction to support searchability, compliance reporting, and inter-agency data sharing. The problem was that only a small fraction of the records (roughly 5%) had been manually labeled by staff over the years. Hiring a team to label the remaining millions of records would have taken years and cost millions of dollars. Traditional supervised learning approaches could not achieve acceptable accuracy when trained on such a small labeled subset, and the agency could not afford to wait for a larger labeled dataset.
Our Solution
We designed a semi-supervised learning system that used the small pool of labeled records to bootstrap a much larger training signal from the unlabeled data. The approach combined self-training with consistency regularization: we first trained a teacher model on the labeled subset, then used it to generate pseudo-labels for high-confidence unlabeled records, and iteratively retrained the model on the expanding labeled pool. To ensure the pseudo-labels were reliable, we implemented a confidence threshold with human-in-the-loop verification for borderline cases, keeping the agency's domain experts involved without overwhelming them. We also used text augmentation techniques (synonym replacement, back-translation) to increase the effective training set diversity. The final model was a fine-tuned transformer architecture that classified documents across 47 offense categories with high accuracy. The system included an active learning component that identified the most informative unlabeled records to send for human review, maximizing the value of every hour of manual labeling. We deployed the pipeline on AWS with a simple review interface so agency staff could approve, correct, or flag classifications as they came through.
Key Implementation Highlights
- Used self-training with confidence thresholds to generate reliable pseudo-labels
- Built active learning pipeline to prioritize the most valuable records for human review
- Classified documents across 47 offense categories with 92% accuracy
- Reduced manual labeling effort by 85%, saving thousands of staff hours
- Deployed review interface for agency staff to verify and correct edge cases
The Results
85%
Reduction in labeling effort
92%
Classification accuracy
47
Offense categories classified
More Case Studies

AI-Powered Clinical Empowerment Platform
A mobile-first clinical intelligence platform that uses AI-driven documentation and computer vision wound measurement to reduce clinician burden and improve care consistency.

Curengo - AI-Powered Rehabilitation & Post-Acute Care Platform
An AI-powered rehabilitation platform that unifies fragmented clinical workflows, automates documentation, and connects patients, clinicians, and consultants across the care journey.
Ready to Transform with AI Agents?
Schedule a consultation with our team to explore how AI agents can revolutionize your operations and drive measurable outcomes.