Bottom line first — here’s what the OpenAI Privacy Filter (OPF) actually costs per call at each trust boundary in our agent, on a laptop CPU:
| Case | Tokens | Mean (ms) | p95 (ms) |
|---|---|---|---|
| tool_arg.clean.readfile | 14 | 98 | 107 |
| tool_arg.webfetch.url_with_token | 25 | 281 | 440 |
| tool_arg.bash.curl_with_bearer | 29 | 294 | 385 |
| tool_arg.write_file.env_snippet | 35 | 402 | 692 |
| host_bcast.batch.mixed_update | 83 | 577 | 679 |
| tool_res.read.env_file | 120 | 1108 | 1353 |
| tool_res.bash.shell_log | 155 | 1573 | 1837 |
| model_req.system_prompt | 601 | 3493 | 3809 |
| tool_res.http.api_json_response | 781 | 8110 | 9160 |
| tool_res.read.code_with_hardcoded_key | 838 | 6217 | 7010 |
| tool_res.clean.typescript_file | 1627 | 10395 | 10767 |
| tool_res.read.large_file | 2696 | 15564 | 17165 |
| model_req.message_history_8k | 3847 | 25377 | 27075 |
Small stuff is 100-400 ms. 10 KB of tool output is 15 seconds. An 8 KB message history is 25 seconds. Running OPF on every agent turn, on CPU, is not happening. Running it at a few specific small boundaries is absolutely fine.
The rest of this post is how I got to that table: what I was trying to do, how I measured, what torch.compile changed, and what an autoresearch loop found when I pointed it at OPF’s runtime knobs.
I’ve been looking at adding OPF to our agent stack, at every place data crosses a trust boundary involving the model: tool arguments going out, tool results coming back, the outbound payload to Anthropic/OpenAI, and the batched updates we stream to the renderer.
The design doc writes itself. The real question was what this actually costs per turn.
So I pulled OPF down, ran it locally (it’s a Python package, weights cached to ~/.opf/privacy_filter on first call, no network after that), wrote a small benchmark harness with realistic fixtures for each boundary, and measured. Then I flipped on torch.compile and measured again. Then I pointed an autoresearch loop at it and let it tune knobs for a while.
What OPF is, quickly
A bidirectional token classifier, ~1.5B params, MoE with ~50M active. Eight labels — names, emails, phones, addresses, dates, URLs, account numbers, secrets. One forward pass over the whole input, then a Viterbi decode over BIOES tags. Not autoregressive, so cost is linear in input tokens. Runs on CPU or CUDA. I ran it on Apple Silicon via Metal (MPS) once I got past the basic CPU numbers.
A few links worth having open:
- openai/privacy-filter — the repo
- README — labels, policy knobs, operating points, and the caveats OpenAI would like you to respect
- Weights on HuggingFace — ~3 GB, auto-downloaded to
~/.opf/privacy_filteron first call
The “linear in input tokens” part is worth holding onto. For most LLM work you think of latency as proportional to the output. Here it’s proportional to whatever you hand it, and the input can easily be 10x larger than the output you’d normally generate.
What I measured
Fixtures grouped by the boundary they simulate:
TOOL_ARG— short model→tool argument strings (curl with bearer, webfetch URL with a token in the query string,write_fileenv snippet, plus a clean control).TOOL_RES— tool output coming back into the model’s context (.env file, ~4 KB of TS with a hardcoded key, shell log, JSON customer list with PII, ~10 KB mixed file, clean control).MODEL_REQ— outbound payload to the model provider (system prompt, ~8 KB prior-turn history).HOST_BCAST— one batched update to the renderer.
All PII and credentials in the fixtures are synthetic (obviously).
Per fixture I recorded cold ms, p50/p95/p99, tokens/sec, and what OPF flagged vs. what I expected it to flag. That last one is a sanity check, not a quality eval — real quality work belongs on a labeled set, not on fixtures I made up.
The harness lives in
scripts/privacy-filter-bench/.bench.pyruns against the Python package directly,compare.pydiffs two runs, and the results below came out ofcomparison.htmlafter aggregating four PyTorch eager runs and one compiled run.
The raw numbers
The table at the top is PyTorch eager, Apple M-series, CPU only, 5 iterations per case, 4 runs averaged. Two things worth reading off it before moving on:
On the small stuff (TOOL_ARG, HOST_BCAST) we’re at roughly 100-600 ms per call. That’s fine for a boundary that’s gating a network call or a disk write anyway — the filter’s latency gets lost in what’s happening on the other side of it.
On bigger inputs it falls apart. ~150 tok/s on CPU means a 4K-token history takes ~25 seconds, and the cost is linear — for each extra 1K tokens you pay another ~7 seconds. You can’t put that in front of every outbound message without the whole thing feeling broken.
Does torch.compile help?
Yes on the small stuff, barely on the big stuff. Which, once you stare at it for a minute, is what you’d expect — at small input sizes a lot of the wall clock is Python and dispatch overhead, and a compiled graph wipes a bunch of that out. At 15K tokens you’re just computing the transformer, and the BLAS kernels under both backends are the same.
Per boundary:
| Boundary | Eager (ms) | Compiled (ms) | Speedup |
|---|---|---|---|
| HOST_BCAST | 577 | 567 | 1.02× |
| MODEL_REQ | 14435 | 12820 | 1.13× |
| TOOL_ARG | 269 | 73 | 3.66× |
| TOOL_RES | 7161 | 4697 | 1.52× |
The biggest win was tool_arg.write_file.env_snippet at 4.98× (402 → 81 ms). The smallest was host_bcast.batch.mixed_update at 1.02× (577 → 567). Overall ~1.34× mean across the 13 cases.
One honest caveat: on long inputs Dynamo kept hitting config.recompile_limit=8 because every distinct window length is a new trace, and some late iterations fell back to eager. dynamic=True would probably buy some of that back.
Tuning with autoresearch on pi
torch.compile is the easy lever. After that I wanted to know which of OPF’s runtime knobs actually help on our workload, and specifically on the small TOOL_ARG fixtures — that’s the boundary I’d ship first, so that’s the one worth tuning.
I wired the bench into pi with the autoresearch plugin. The loop is simple:
./autoresearch.sh opf-smallruns the small suite and emitsMETRIC privacy_filter_opf_tool_arg_p50_ms=…lines.- The plugin proposes a candidate change (env var, flag, runtime setting), applies it, re-runs, compares against the kept baseline.
- Keep if the metric improves, auto-revert otherwise. Everything is logged to JSONL.
33 runs, six kept:
| Run | Change | p50 (ms) | Δ |
|---|---|---|---|
| 6 | Baseline (CPU, Viterbi, typed output) | 190.23 | — |
| 10 | Device = MPS (Apple Metal) | 185.19 | −2.6% |
| 13 | Decode mode = argmax (was Viterbi) | 180.83 | −5% |
| 16 | discard_overlapping_predicted_spans | 179.10 | −5.9% |
| 25 | OPF_EXPERTS_PER_TOKEN=3 (top-3 routing, MoE default is top-4) | 137.67 | −27.6% |
| 29 | + OPF_ATTN_LOW_PRECISION=1 | 136.48 | −28.3% |
And a few things I thought would help but didn’t:
OPF_ALLOW_TF32=1was slightly worse on small inputs (run 32: 142.6 ms). TF32 helps when you’re compute-bound on matmul, andTOOL_ARGjust isn’t that.OPF_MOE_FUSED_SWIGLU_W2=0was noise (runs 28, 31).- A sweep of
context_window_lengthfrom 64 to 2048 was all within run-to-run jitter (runs 17, 21-24). Short inputs don’t care.
The big win was dropping MoE top-k from 4 to 3. It’s a classifier, not a generator, so you don’t really need the full expert mixture on every token — top-3 covered basically the same ground, and the recall on the fixtures stayed in the same band. Combined with MPS + argmax decoding + overlap discard + low-precision attention, the TOOL_ARG p50 went from ~190 ms to ~136 ms. Stack torch.compile on top of that and the cost per call starts to genuinely disappear.
I wouldn’t have found the top-k knob on my own. I’d have tried MPS, maybe one more thing, called it a day. Having the loop running in the background (“come back in an hour, see which candidates were kept”) is a genuinely different way to optimize.
Where this leaves me
Running OPF on every agent turn, on a laptop CPU, is not happening. A typical turn (1 MODEL_REQ + ~2 TOOL_ARG + ~2 TOOL_RES) burns ~29 seconds of filter time on eager CPU, ~22 seconds with torch.compile. You can’t put that on a user and call it shipped.
But when I look at it by boundary, the story’s different:
TOOL_ARGis the one I’m most excited about. Small inputs, ~35 tokens, ~73 ms withtorch.compile, ~136 ms without on the tuned MPS config. Catching a bearer token the model hallucinated into abashorweb_fetcharg before the call happens — for a latency that’s lost in the noise of the network round-trip anyway — is a real win.HOST_BCASTis fine too. ~500 ms for display-layer redaction on a batched update is not something a user will notice.TOOL_RESis the one with a sharp edge. ~300 ms on a 500-token result is OK, ~13 seconds on a 10 KB file is not. The answer is probably bounding result size before filtering, or running a cheap regex/entropy pre-pass and only invoking OPF on suspicious spans.MODEL_REQI’m going to treat as GPU-only for now. If you’re doing (1), (2), and (3) right, the history going to the provider is already mostly clean — the outbound sweep is belt-and-suspenders, not the primary line of defence.
Next things on the list are batching (the filter is independent across inputs, so a turn with three tool calls and one Read should be one batched call, not four serial ones), and running in shadow mode for a week before turning on redaction — the recall varied from 0.25 to 1.0 across the fixtures, and that’s just a wiring check, not an eval. I’d like real numbers from real traffic before I start dropping spans.
If you’re doing something similar and want to skip a few days of figuring out where the cliff is, the summary is: measure at each boundary separately, the backend matters most where you think it matters least, and let an autoresearch loop try the knobs you wouldn’t have thought to try.