Introduction
This text is a summary of notes I wrote for myself while working on removing personal information from unstructured text, primarily to document my own adventures. As such, they're disorganized and intentionally skip over details. For all intents and purposes, they belong to the pre-LLM era in this field, something that's now radically changing the landscape, since it has become feasible to run smaller models on managed GPUs.
Tools
Microsoft Presidio is an open-source SDK that detects and anonymizes PII in text (and images). Under the hood, it combines pattern-based recognizers for the parts that are usually trivial:
- Emails
- Phone numbers
- Credit cards
With NLP via spaCy's NER models for entities like names and locations. It is designed to be customizable and extensible.
Expectations
This part might be a bit jarring. Given how mature many of today's tech solutions have become it can be hard to internalize, and even harder to communicate, their limitations.
There is no guarantee of finding all personal information in an unstructured text. There's an easy part: entities that follow fixed patterns, like email addresses. And there's a part that remains fundamentally unsolved, and theoretically unsolvable, such as classifying names and addresses, due to the inherent ambiguities of natural language.
We can throw different approaches at the problem, but this needs to be crystal clear: there is no perfect solution. It becomes a risk management exercise, and a human reviewer is essential if the task is sensitive. On the flip side, with enough effort, you can get pretty good at this.
The reason this is theoretically unsolvable comes down to ambiguity: "Victoria" can be a first name, a city, or a queen; "Jordan" can be a person, a country, or a brand, and no amount of clever engineering can resolve every case without context that often simply isn't there. And that's before we even consider adversarial cases, users who deliberately obfuscate their PII to slip past detection, writing "my name at gmail dot com" or splitting a phone number across sentences. If someone actively tries to evade your filters, they probably will manage.
Machine learning models
Hugging Face is a model hosting provider that also maintains a Python library for downloading and using models with just a few lines of code. It has massively democratized access to ML models. spaCy provides (at least for the languages I was interested in) classical models, whereas on Hugging Face I found transformer-based models backed by academic institutions. Transformer models are trained in two steps. First, they're pre-trained on a massive amount of data to learn patterns, context, and relationships. Then they're fine-tuned with annotations in a second step. When looking for NER models, choose ones that say "finetuned" or have "NER" in the name, or check their model card to confirm they do token classification. You'll need to look at the label types they return and map them to Presidio's expected format.
Why casing matters for NER
Case information is one of the strongest signals NER models rely on to identify entities. The word "pond" is almost always lowercase, but when it's "Pond" after "James," it's a person's name. Without proper casing, NER systems lose this critical feature and their performance tanks.
Simple heuristics go a long way
The IBM paper "tRuEcasIng" proposes a sophisticated approach using trigram language models and Viterbi decoding at the sentence level. But here's the thing: even their trivial unigram baseline, just pick the most frequent casing for each word, gets you surprisingly far. And before you even touch statistical models, dead-simple heuristics can recover a lot of signal.
Say you know you are working with a dataset that contains letters. The recipient on top usually follows this format:
MR. NAME SURNAME
ALMOND STREET 123
Other text, With different structure
Converting it to:
Mr. Name Surname
Almond Street 123
Other text, With different structure
This is a trivial string manipulation but hugely beneficial for downstream NER.
Subword tokenization alignment
Transformer models don't tokenize by whole words, but by subwords. These solves different problems:
- Vocabulary explosion: A word-level tokenizer would need millions of entries to cover English reasonably well, and the embedding matrix would become computationally prohibitive.
- Novel words: The model can reason about words it has never seen before.
- Morphological structure: Words like "uncomfortable" can be split into "un" (negation), "comfort", and "able" (capable of doing something).
BERT uses WordPiece, others use BPE (Byte Pair Encoding). The ## prefix indicates that the token is a continuation of the previous one, not a new word.
The model can return a different prediction for each subtoken, and sometimes these can be inconsistent. To reconstruct the word, then, you need an aggregation strategy:
- First token (most common, use the first subtoken's label)
- Max (the most frequent label)
- Average (average the probabilities)
Partitioning texts for our RAG
We partition the texts we index in our RAG primarily to achieve these goals:
-
Embedding quality: A single vector trying to represent a very long text becomes a semantically diluted "average." It loses precision. A smaller chunk captures a concrete idea much better.
-
Retrieval granularity: If you embed entire documents, when someone asks something specific you'll retrieve a 10-page document where the answer is buried somewhere. With smaller chunks, you retrieve the relevant fragment directly.
Since we're anonymizing this information, it is absolutely vital not to cut in the wrong place. We must not split proper names, because that would make it impossible for the model to recognize them later.
The algorithm I wrote partitions anonymized text while treating anonymization tags (like [[[PERSON]]]) as atomic units that must never be
split across chunks. It goes as follows:
First splits the input into segments, either regular text or tags, then tokenizes each segment.
Finally, it greedily packs segments into partitions up to the token limit.
Testing pipeline
Ultimately, one of the things I found to be fundamental in this process is maintaining a test suite based on real cases to verify effectiveness as the work progresses. In the end, it's the only way to know if you're actually making progress. Maintaining this test suite is a problem in itself, it's hard to scale, requires human intervention, and if it contains private data, you're suddenly facing liability issues around who can run it and who is responsible for its custody (even if the data is yours).