Back to Event

Realistic Prompt Generators Fail to Boost Petri Audit Realism

Trained models generate highly realistic single-turn prompts indistinguishable from WildChat data, outperforming chat model baselines. However, integrating these generators as tools into the Petri auditing agent improves only first-turn prompt realism (23% to 47% win rate), not full multi-turn audit transcripts. Analysis reveals audit realism bottlenecks at higher-level features like harmful scenarios and unnatural conversation structures, not individual prompt quality.

Selected: x-ai/grok-4.1-fast

Use #{story_texts} as placeholder | Ctrl+Enter to generate

Output

Click "Generate" to see output

Story Texts Preview (972511 characters)
1. Tools to generate realistic prompts help surprisingly little with Petri audit realism
<h1>TLDR</h1><ul><li>We train and many-shot prompt base models to generate user prompts that are harder to distinguish from deployment (WildChat) prompts. </li><li>Then we give <a href="https://alignment.anthropic.com/2025/petri/">Petri</a>, an automated auditing agent, a tool to use a prompt generator model for sycophancy audits. It doesn’t help with making the full audit transcripts harder to distinguish from deployment transcripts, despite more realistic single-turn prompts. </li><li>We find evidence that full audit realism is bottlenecked not by prompt quality, but by higher-level features: the harmful nature of the scenarios and the unnatural structure of the multi-turn conversations.</li></ul><p>💻<a href="https://github.com/ckkissane/realistic-prompt-generators"><u>Code</u></a></p><p>Research done as part of the Anthropic Fellows Program.</p><h1>Introduction</h1><p>Automated auditing agents like <a href="https://alignment.anthropic.com/2025/petri/"><u>Petri</u></a> are increasingly used to evaluate the propensities of LLMs. But one growing challenge is evaluation awareness: frontier models can often detect when they’re in an eval vs deployment, making audits harder to trust. As models improve at detecting LLM generated text, they may identify some audit scenarios by noticing that their input prompts are synthetic rather than human written. This motivates the need for better methods to generate realistic synthetic prompts.</p><p>In this post we:</p><ol><li>Train and many-shot prompt base models to generate realistic user prompts, and show that these produce prompts that are hard to distinguish from real WildChat data, substantially outperforming the simple baseline of asking a chat model (Claude Haiku 4.5).</li><li>Give one of these generators as a tool to a Petri auditor, but fail to improve audit realism. The tool was not helpful at generating long, realistic sycophancy audits, despite individual prompts being harder to distinguish from deployment.</li><li>Analyze the reasoning of the realism judge to study this discrepancy. We find evidence that while the prompt generator tool does reduce rates of unrealistic user behavior, audit realism is bottlenecked by high-level features: the narrative structure of the audit, and the harmful nature of the scenario itself. </li></ol><p>Overall we think this method shows signs of life, but in its current form, we do not find it to be worth integrating into automated auditing agents like Petri. Our findings suggest that future work on Petri realism might be best targeted towards improving structural and scenario-level aspects of the audits, rather than individual turns. For example: grounding agentic audits with real deployment data (which we explore in forthcoming work), eliciting the auditor to plan more natural multi-turn interactions, or designing new scenarios as seeds.</p><figure class="image"><img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/z1vgwkb5ps9xy42xyji8" srcset="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/xfowf4ibsahyiaddlnhh 210w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/fvcj9gjhlgk9h2yk5lyz 420w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/o59ovfpnpy4gebd67v3x 630w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/xxrta1eevjmfctkwosby 840w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/zg3umdqevbfswprmqoyw 1050w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/qnl6so3d1e3cwhcjnx9x 1260w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/abmxi3cgogchiondomk9 1470w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/roz9cvx69evwaxu30kxq 1680w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/gmv6wignlq7ukqxyfzhd 1890w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/jdoDvKGLbaaJWnDpE/dct5bd980qo2i2nlesjs 2048w"><figcaption>Figure 1: Giving the auditor a realistic prompt generator tool improves realism of the first user prompt of an audit (left), but not entire audit transcripts (right). Realism win rate measures how often an LLM judge (Sonnet 4.5 with thinking) selects the audit transcript over a real WildChat transcript as more realistic.</figcaption></figure><h1>Methods</h1><h2>Training realistic user prompt generators</h2><p>We want to produce generator models that take as input a text description of a user prompt, and output a prompt that 1) matches thi

... (truncated)