Privacy filter: OpenAI's open-source PII scrubber
OpenAI's Apache-licensed PII scrubber and a uv/Python script to run it.
Yesterday OpenAI released Privacy Filter under Apache 2.0 on Hugging Face and GitHub (announcement). It detects and masks eight categories of PII in text: names, addresses, emails, phone numbers, URLs, dates, account numbers, and secrets like API keys. 1.5B total parameters, 50M active (mixture-of-experts), 128k context, 96% F1 on PII-Masking-300k.
You can run it pretty easily with uv, as described in a previous post.
Here’s a python script that runs using uv with dependencies declared inline (also copied below).
First run pulls ~2.8 GB to your HF cache. After that it’s pretty fast.
Use cases: pre-scrubbing free-text fields, clinical notes, or support transcripts before they hit a frontier model provider. Caveat: this isn’t a standalone anonymization guarantee. OpenAI says so plainly in the model card. Missed identifiers and over-redaction both happen.
Here’s an example:
./privacy-filter.py “My name is Stephen Turner. Social is 123-45-6789. You can reach me at 434-555-1234 or notmyrealemail@example.com.”
Output:
My name is [PRIVATE_PERSON]. Social is [ACCOUNT_NUMBER]. You can reach me at [PRIVATE_PHONE] or [PRIVATE_EMAIL].
You might try to reach for Ollama or llama.cpp. Don’t. Privacy Filter is not a generative model. It’s a bidirectional token classifier with a Viterbi decoder on top, built on a gpt-oss-style backbone with the LM head swapped for a BIOES span classifier over 33 labels. You feed it text, it returns labeled spans. No chat, no completions. In other words, Ollama and llama.cpp can’t run it. No GGUF exists because GGUF is for causal generative models. You need the transformers library.
Here’s the privacy-filter.py script used above:
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "transformers>=4.50",
# "torch",
# ]
# ///
"""
Run openai/privacy-filter over text passed as the first argument.
Usage: ./privacy_filter.py "My name is Stephen and my phone is 555-1234"
"""
import sys
from transformers import pipeline
if len(sys.argv) < 2:
sys.exit("usage: privacy_filter.py TEXT")
text = sys.argv[1]
classifier = pipeline(
task="token-classification",
model="openai/privacy-filter",
aggregation_strategy="simple",
)
spans = classifier(text)
print(f"Input: {text}\n")
if not spans:
print("No PII detected.")
sys.exit(0)
print(f"Detected {len(spans)} span(s):")
for s in spans:
span_text = text[s["start"]:s["end"]]
print(f" [{s['entity_group']}] {span_text!r} (score={s['score']:.3f})")
redacted = text
for s in sorted(spans, key=lambda x: x["start"], reverse=True):
redacted = redacted[:s["start"]] + f"[{s['entity_group'].upper()}]" + redacted[s["end"]:]
print(f"\nRedacted: {redacted}")
