
Nov 14, 2025
Redaction for Real Documents
The PII Problem One exposed phone number costs you 20 million euros or 4% of revenue. GDPR gives you 72 hours to report a breach. Organizations scrub everything before sharing. Personal data is bigger than you think. Claim identifiers. License plates. Handwritten addresses. Signatures. The story someone wrote about their accident. All Personal Identifiable Information (PII). Current Tools Break OCR, regex, named-entity recognition, LLMs, specialized detectors—they all stumble on real documents. Handwriting. Bad scans. Multilingual content. The messy stuff that actually matters. Pleias Cypher is One Pass Solution Our system processes each page once. Finds every sensitive region. Labels it. Masks according to your rules. Clean documents. Consistent redaction. Fraud teams can still work. Finance can still audit. And you've got proof of every decision you made.
We build Pleias Startum as a universal layer for any document automation workflow that speeds your time to market, takes data out of the silos where it is stuck and makes it ready for any automation solution that you plan to apply. Naturally, automated PII redaction is a part of Stratum.
PII in Real Workflows is Broader Than “Name + Phone Number”
In regulated environments such as insurance, automotive, healthcare, and public sector handling personal data appears in many forms. It is rarely limited to clearly labeled fields. See below for some representative examples.

Typical PII surface include:
Structured Identifiers
Policy Numbers
Internal routing numbers
Claim IDs
These values are often unique to an individual incident or policyholder and are therefore considered identifying.
Contact Details
Phone numbers
Personal email address
Home address in the invoice footers
Narrative Free Text
“Passenger: Anna Dubois, DOB 02.11.1979, notified at 22:15”
This is often the most sensitive content in the file
From a regulatory standpoint, any one of these items can create exposure if it leaves the organization unredacted—including low-resolution scans and handwritten notes. The standard is outcome-based: if the content can identify a person, it must not leak.
The Regulatory Reality
Since May 2018, GDPR fines have reached 5.88 billion euros. Enforcement spans all sectors, not just big tech. Organizations sharing documents externally must secure personal data or face substantial penalties.
The Core Requirements
Article 32 mandates ‘appropriate technical and organisational measures’ to protect data. Redaction qualifies as such a measure. The UK ICO (https://ico.org.uk/media2/migrated/2617736/cyber-security-trends-q4201920-csv.csv) is clear: inadequate redaction = data breach.
In Q4 2019 alone, the ICO recorded 97 cases of failure to redact. Italy’s authorities have fined organizations specifically for manual redaction that left content recoverable by basic technical means.
A maximum penalties: 20 million or 4% of global annual turnover whichever is higher. This applies when organizations fail to properly anonymise or redact personal data, whether responding to access requests or sharing documents externally.
Real-world consequences include:
Danish taxi company Taxa 4x35 was fined for retaining customer phone numbers longer than necessary and for failed anonymisation attempts—the authority found it was still possible to connect individuals with their personal data despite anonymisation efforts. (https://www.edpb.europa.eu/news/national-news/2019/danish-data-protection-agency-proposes-dkk-12-million-fine-danish-taxi_en)
Avast (cybersecurity firm) was fined for claiming full anonymisation of 100 million users' browsing data when re-identification remained technically possible through combined datasets. (https://www.edpb.europa.eu/news/news/2024/czech-sa-imposed-fine-139-million-eur-infringement-art-6-and-art-13-gdpr_en)
The Standard approach - and Why It Fails at Scale
Most in-house and off-the-shelf solutions follow a version of this pipeline:
Run OCR on a document to extract text
Apply a Named-entity recognizer (NER) or LLMs to identify entities such as PERSON, PHONE_NUMBER, ADDRESS
Apply regular expression and rule-based logic to capture structured values such as policy numbers, VINs and IBANs
Run a separate vision model to detect faces, signatures, and license plates in photos
Merge all these detections and draw black boxes on top of the original PDF or image.
This architecture appears reasonable on paper, but it tends to break under real operating conditions for reasons that compound over time.
The fundamental issue with OCR-dependent pipelines is that failures are silent. A redaction report may show 47 Phone numbers masked, 12 claim IDs protected, and 8 faces obscured, but it cannot account for content that OCR never extracted the handwritten mobile number in the margin, the barely visible contact information. Downstream models cannot redact what they were never given to evaluate. The document appears processed, but unredacted personal data remains.
Template variations compound this problem. Systems tuned for standard forms collapse when format changes. Smartphone photos instead of scans. Non-standard layouts. New jurisdictions. Positional heuristics fail immediately. “Name appears in the upper-right table” - only when that table exist. Each variation requires rule updates. Each update risks breaking something else. IT teams spend recurring hours on adjustments without ever gaining confidence that coverage is complete.
Multilingual documents amplify all of these failure modes. German NER model misses the Polish name “Wojciech Kowalski”. Single European claims files contains German, English, French, Polish across different pages. Each language needs separate OCR configuration. Each NER model performs differently. Detection rate drops. Under-supported languages? Forgot it. The system wasn’t built for this complexity.
When a complaint arrives alleging improper disclosure, a PDF with black boxes proves masking happened. It doesn’t prove which policies triggered each decision, what was removed , or whether low-confidence detections were escalated or ignored. You’re defending against accusations with no evidence of your own progress.
Performance Reality on Operational Documents
To understand the challenge this task presents, we tested state-of-the-art open-weight language models using zero-shot prompting on real-world, visually rich documents from our internal dataset.
Model | Precision | Recall | F1 |
|---|---|---|---|
Qwen3-A3B-30B | 0.75 | 0.80 | 0.76 |
Qwen3-4B | 0.65 | 0.73 | 0.65 |
LlaMA-3.3-70B | 0.60 | 0.69 | 0.60 |
Gemma-3-27B | 0.60 | 0.74 | 0.63 |
Even the strongest model achieves only a 0.76 F1 score on our test set of invoices, forms, and corporate emails—with significant variance across entity types. These results reflect the inherent difficulty of handling diverse document formats, multilingual content, and varying quality in production environments. Even the strongest performer, Qwen3-A3B-30B, achieves 0.76 F1 across our test set—and this aggregate figure masks substantial variation at the entity level. Performance on specific entity types ranges from strong (names, phone numbers) to weak (locations, dates on certain document formats). This variance reflects the fundamental challenge: production environments are complex. Your invoices can have inconsistent layouts. Your multilingual forms can shift mid-document. Your degraded scans can have critical information appearing faintly or in margins.
Our Redaction Process
We approached the problem with three requirements in mind: full coverage across modalities, controllable masking policy and traceability.
Unified Detection of Sensitive Regions
For every page whether it is a PDF page, a scan or a smartphone photo our system identifies all regions that may contain personally identifying information. This includes:
Contact details (names, addresses, phone numbers, emails)
Policy/claim identifiers (claim ID, policy number)
Vehicle identifiers (VINs, license plates)
Quasi-identifier
Faces
Signatures
Each detected region is returned with its location on the page (bounding box) and its category (for example: PHONE_NUMBER, CLAIM_ID, FACE)
Our approach uses a custom vision-language model trained specifically for PII detection across document types. The model jointly reasons over both visual and textual information within a document, analyzing the entire page as a unified input in a single pass. This means that handwritten phone numbers, printed claim IDs, faces in photographs, license plates, text layout, and embedded annotations are all detected through the same underlying process
Policy-Based Masking
After detection, masking is driven by policy, not by ad hoc scripts
Different categories of PII are subject to different treatment rules. For example:
Faces are fully masked
Phone numbers, personal emails, personal addresses are fully masked
VIN or claim IDs may be partially masked (e.g retain the last 4 characters) if downstream fraud or audit teams still require linkage.
This is important in practice: redaction must satisfy compliance and legal requirements, but leaving documents completely blacked out makes them unusable for internal teams. Policy-driven masking reconciles both obligations and provides two benefits:
Compliance teams gain evidence of what was removed and why, rather than relying on visual inspection after the fact.
Operational teams gain a structured way to review false negatives or edge cases and improve policies over time.
Conclusion
Redacting PII in operational documents is not simply a matter of hiding a few names. For many organizations identifying information can appear as a VIN in a scanned form, a phone number written in the margin of a workshop invoice, a signature on top of an address block, or a license plate in an accident photo.
Traditional pipelines - OCR, regex, NER, a separate face detector, and a merge script tend to miss handwritten or low-quality content, struggle with multilingual inputs, and produce limited audibility. They also force teams to choose between “compliant” and “usable”.
Our process is designed to address those gaps directly.
A single pass over each page identifies all potentially sensitive regions, including both text and visual identifiers.
Masking behaviour is governed by explicit policy per PII type, so compliance and business needs can be balanced.
CONTACT US