Agent policy A/B testing

Know which agent policy actually works.

Run a controlled experiment between two prompt versions, models, or routing rules. Measure resolution rate, escalation rate, and override rate — not just output quality.

Get started free →See how it works

Experiment · Support agent CoT vs baseline · 847 sessions
condition
resolution
escalation
baseline_policy
61.2%
18.4%
cot_policy
74.8%
11.1%
P(cot_policy better) = 97.3% · lift +13.6pp resolution

The problem

Agents deployed. Improvement unclear.

Most teams ship a new prompt, watch aggregate metrics, and call it a win or a wash. There is no controlled comparison — just noise.

No business-level outcome tracking

Evals check output quality, not whether the agent actually resolved the ticket, closed the deal, or reduced rework.

Prompt changes are uncontrolled

A new system prompt ships and you look at dashboards hoping they went up. There is no comparison group.

You don't know which version is running

Multiple engineers editing prompts, no version on the logged outcome, no way to attribute results to a specific policy.

How it works

Three steps. Two API calls.

Step 1

Create the experiment

Define your two policies as conditions, pick the metrics you care about, and set a target observation count. You get back an experiment ID.

curl -X POST https://api.dooperator.ai/api/v1/orgs/{org_id}/experiments \
  -H "Authorization: Bearer sk_live_…" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Support agent — CoT vs baseline",
    "hypothesis": "Chain-of-thought prompt reduces escalation rate",
    "domain": "agentops",
    "design_type": "ab",
    "subject_type": "session",
    "conditions": [
      { "name": "baseline_policy",  "is_control": true },
      { "name": "cot_policy",       "is_control": false }
    ],
    "outcome_metrics": ["resolution_rate", "escalation_rate", "override_rate"],
    "target_observations": 200
  }'

Step 2

Log each outcome

After each agent run, POST the condition it used and the observed outcomes. Binary metrics (0/1), rates, latency, cost — all supported.

# After each agent run, log the outcome via API key
curl -X POST https://api.dooperator.ai/api/v1/orgs/{org_id}/experiments/{exp_id}/ingest \
  -H "X-Api-Key: dop_live_…" \
  -H "Content-Type: application/json" \
  -d '{
    "condition_followed": "cot_policy",
    "observations": {
      "resolution_rate": 1,
      "escalation_rate": 0,
      "override_rate": 0
    },
    "notes": "ticket_id:t-4821"
  }'

Step 3

Read the results

The Decision Process dashboard shows Bayesian posterior probabilities, effect sizes, and credible intervals per metric. Share a read-only link with stakeholders.

GET /experiments/{exp_id}/results
{
"metric": "resolution_rate",
"conditions": {
"baseline_policy": 0.612,
"cot_policy": 0.748
} ,
"prob_treatment_better": 0.973,
"effect_size_d": 0.84
}

What you measure

Business outcomes, not just output quality.

Any numeric metric works. These are the ones that matter most for production agents.

resolution_rate

Did the agent resolve without human intervention?

escalation_rate

How often did the agent hand off to a human?

override_rate

How often did a reviewer correct the agent's action?

reopen_rate

Did the task come back after the agent marked it done?

cost_per_task

Token cost for the full agent run.

latency_p95

95th-percentile end-to-end response time.

Scope

A policy experiment tool, not an agent framework.

DoOperator does not build agents, host prompts, or run inference. It sits alongside your existing agent infrastructure — LangChain, LlamaIndex, OpenAI Agents, Bedrock, whatever — and gives you a controlled experiment layer on top of it.

You route traffic to policies however you like. After each run, you POST the outcome. DoOperator accumulates the evidence and tells you which policy wins.