Protecting Sensitive Data

Using NER for Intelligent Redaction

The AI revolution has created a paradox for companies handling sensitive data. We need the power of advanced language models, but we can’t risk exposing confidential information. For Ingedata, working with healthcare and defence clients, this isn’t just a technical challenge – it’s a compliance requirement.

The solution? Smart data redaction using Named Entity Recognition (NER) before anything reaches external AI services.

The Privacy Problem

Traditional approaches to data privacy often involve either:

Avoiding AI altogether (limiting capabilities)
Building expensive on-premise AI infrastructure (often impractical)
Manual data scrubbing (slow and error-prone)

None of these work when you need to leverage cutting-edge LLMs while maintaining strict data protection standards like those required in healthcare or defense sectors.

What is Named Entity Recognition?

Named Entity Recognition is a natural language processing technique that identifies and classifies entities in text. Think of it as an intelligent pattern matcher that can recognize:

People’s names (“Dr. Smith” or “Patient Johnson”)
Locations (“Boston Medical Center” or “Room 302”)
Organizations (“Ingedata” or “Massachusetts General Hospital”)
Structured data (credit cards, phone numbers, social security numbers)

Unlike simple regex patterns that look for specific formats, NER uses machine learning models trained on language patterns to understand context. It knows that “Apple” in “Apple Inc.” is an organization, while “apple” in “ate an apple” is just fruit.

MITIE: The Engine Behind the Intelligence

The magic happens through MITIE (MIT Information Extraction), a library developed at MIT specifically for named entity recognition. MITIE uses:

Support Vector Machines for classification
Pre-trained language models that understand context
Confidence scoring to reduce false positives

What makes MITIE particularly valuable is its balance of accuracy and performance. It’s fast enough for real-time processing while being sophisticated enough to handle complex text with high accuracy.

How It Works in Practice

Here’s the workflow we use at Ingedata:

Incoming text arrives containing sensitive information
NER processing identifies and categorizes entities
Redaction replaces sensitive data with placeholders like [PERSON_1] or [LOCATION_2]
Safe processing sends the cleaned text to external AI services
Response restoration maps placeholders back to original values

For example:

				
					Original: "Dr. Sarah Johnson at Boston Medical needs the lab results for patient ID 12345"
Redacted: "[PERSON_1] at [LOCATION_1] needs the lab results for patient ID [ID_1]"

Beyond Simple Pattern Matching

The key advantage of NER over basic regex filtering is contextual understanding. Consider these examples:

Context matters: “Will Smith” (person) vs “will smith the metal” (action)
Variations: “Dr. Johnson”, “Johnson, MD”, “Sarah Johnson” all refer to people
Partial matches: “Johnson called about…” where only the surname appears

Traditional regex would either miss these variations or create too many false positives.

Implementation Considerations

When implementing NER-based redaction, several factors matter:

Model Selection: MITIE provides good general-purpose models, but domain-specific training can improve accuracy for specialized terminology (medical terms, technical jargon).

Confidence Thresholds: Setting appropriate confidence scores prevents false positives. We typically use higher thresholds (0.75+) for critical data types.

Performance vs. Accuracy: NER processing adds latency compared to regex. For high-volume applications, consider batch processing or async workflows.

Consistency: When processing multiple related documents, maintaining consistent placeholder mapping ensures coherent responses from AI services.

Real-World Benefits

In our healthcare projects, this approach has enabled:

Compliance maintenance with HIPAA and other regulations
Full AI capability access without data exposure risks
Scalable processing of large document volumes
Audit trails showing exactly what data was protected

The defense sector applications show similar benefits, particularly for processing classified documents where even location names or project codes need protection.

Looking Forward

NER-based redaction isn’t perfect; it requires ongoing model maintenance and careful threshold tuning. But it represents the practical middle ground between complete AI avoidance and costly on-premise solutions.

As language models become more sophisticated, so do the privacy protection techniques we need. NER gives us a foundation that can evolve with both the threats and the opportunities ahead.

For organisations handling sensitive data, the question isn’t whether to use AI – it’s how to use it safely. Intelligent redaction through NER provides that path forward.