paper web signal

PrivacyAlign Trains Agent Privacy Alignment From 599 Annotators

TL;DR

  • PrivacyAlign contains 1,350 scenarios annotated by 599 people documenting where LLM agents leak private information contextually.
  • Annotation-conditioned reward modeling uses human labels to score agent responses during reinforcement learning training.
  • Small open-weight agents trained with this method showed strong gains on PrivacyAlign and existing privacy benchmarks.

When an AI agent passes along information it shouldn't, the failure rarely looks like a breach. It looks like a helpful response. A new paper on arXiv argues that this is precisely why privacy needs to be treated as an alignment problem for agents, not just a policy flag or a downstream filter.

PrivacyAlign: Contextual Privacy Alignment for LLM Agents introduces a dataset of 1,350 scenarios that document cases where current large language models actually leak private information. Building it required 599 unique annotators contributing 3,516 annotations, giving the benchmark genuine human grounding in the contextual social norms that determine whether sharing information is appropriate in a given situation.

The technical method pairs that dataset with what the authors call annotation-conditioned reward modeling. Rather than training a reward model on abstract criteria, this approach conditions LLM judges on human annotations during reinforcement learning. The paper reports that small open-weight agents trained this way showed strong gains on both PrivacyAlign and existing privacy benchmarks, which suggests the training signal generalizes beyond the new dataset's specific scenarios.

The honest caveat is that benchmark gains and reliable real-world behavior are different things. The 1,350 scenarios cover situations the annotators chose to document, and agentic applications encounter edge cases no fixed dataset fully anticipates. The paper does not address the composition of its annotator pool, which matters because privacy norms differ substantially across cultural and demographic contexts.

For teams building privacy-sensitive agentic applications, the practical question is whether annotation-conditioned RL can close the gap between small locally-run models and large hosted APIs without sacrificing the privacy properties that motivated running a smaller model in the first place. The results suggest it moves meaningfully in that direction.