The AI revolution has created a paradox for companies handling sensitive data. We need the power of advanced language models, but we can’t risk exposing confidential information. For Ingedata, working with healthcare and defence clients, this isn’t just a technical challenge – it’s a compliance requirement.
The solution? Smart data redaction using Named Entity Recognition (NER) before anything reaches external AI services.
Traditional approaches to data privacy often involve either:
None of these work when you need to leverage cutting-edge LLMs while maintaining strict data protection standards like those required in healthcare or defense sectors.
Named Entity Recognition is a natural language processing technique that identifies and classifies entities in text. Think of it as an intelligent pattern matcher that can recognize:
Unlike simple regex patterns that look for specific formats, NER uses machine learning models trained on language patterns to understand context. It knows that “Apple” in “Apple Inc.” is an organization, while “apple” in “ate an apple” is just fruit.
The magic happens through MITIE (MIT Information Extraction), a library developed at MIT specifically for named entity recognition. MITIE uses:
What makes MITIE particularly valuable is its balance of accuracy and performance. It’s fast enough for real-time processing while being sophisticated enough to handle complex text with high accuracy.
Here’s the workflow we use at Ingedata:
For example:
Original: "Dr. Sarah Johnson at Boston Medical needs the lab results for patient ID 12345"
Redacted: "[PERSON_1] at [LOCATION_1] needs the lab results for patient ID [ID_1]"
The key advantage of NER over basic regex filtering is contextual understanding. Consider these examples:
Traditional regex would either miss these variations or create too many false positives.
When implementing NER-based redaction, several factors matter:
Model Selection: MITIE provides good general-purpose models, but domain-specific training can improve accuracy for specialized terminology (medical terms, technical jargon).
Confidence Thresholds: Setting appropriate confidence scores prevents false positives. We typically use higher thresholds (0.75+) for critical data types.
Performance vs. Accuracy: NER processing adds latency compared to regex. For high-volume applications, consider batch processing or async workflows.
Consistency: When processing multiple related documents, maintaining consistent placeholder mapping ensures coherent responses from AI services.
In our healthcare projects, this approach has enabled:
The defense sector applications show similar benefits, particularly for processing classified documents where even location names or project codes need protection.
NER-based redaction isn’t perfect; it requires ongoing model maintenance and careful threshold tuning. But it represents the practical middle ground between complete AI avoidance and costly on-premise solutions.
As language models become more sophisticated, so do the privacy protection techniques we need. NER gives us a foundation that can evolve with both the threats and the opportunities ahead.
For organisations handling sensitive data, the question isn’t whether to use AI – it’s how to use it safely. Intelligent redaction through NER provides that path forward.
Proudly awarded as an official contributor to the reforestation project in Madagascar
(Bôndy - 2024)