Run a controlled experiment between two prompt versions, models, or routing rules. Measure resolution rate, escalation rate, and override rate — not just output quality.
The problem
Most teams ship a new prompt, watch aggregate metrics, and call it a win or a wash. There is no controlled comparison — just noise.
Evals check output quality, not whether the agent actually resolved the ticket, closed the deal, or reduced rework.
A new system prompt ships and you look at dashboards hoping they went up. There is no comparison group.
Multiple engineers editing prompts, no version on the logged outcome, no way to attribute results to a specific policy.
How it works
Step 1
Define your two policies as conditions, pick the metrics you care about, and set a target observation count. You get back an experiment ID.
curl -X POST https://api.dooperator.ai/api/v1/orgs/{org_id}/experiments \
-H "Authorization: Bearer sk_live_…" \
-H "Content-Type: application/json" \
-d '{
"title": "Support agent — CoT vs baseline",
"hypothesis": "Chain-of-thought prompt reduces escalation rate",
"domain": "agentops",
"design_type": "ab",
"subject_type": "session",
"conditions": [
{ "name": "baseline_policy", "is_control": true },
{ "name": "cot_policy", "is_control": false }
],
"outcome_metrics": ["resolution_rate", "escalation_rate", "override_rate"],
"target_observations": 200
}'Step 2
After each agent run, POST the condition it used and the observed outcomes. Binary metrics (0/1), rates, latency, cost — all supported.
# After each agent run, log the outcome via API key
curl -X POST https://api.dooperator.ai/api/v1/orgs/{org_id}/experiments/{exp_id}/ingest \
-H "X-Api-Key: dop_live_…" \
-H "Content-Type: application/json" \
-d '{
"condition_followed": "cot_policy",
"observations": {
"resolution_rate": 1,
"escalation_rate": 0,
"override_rate": 0
},
"notes": "ticket_id:t-4821"
}'Step 3
The Decision Process dashboard shows Bayesian posterior probabilities, effect sizes, and credible intervals per metric. Share a read-only link with stakeholders.
What you measure
Any numeric metric works. These are the ones that matter most for production agents.
Did the agent resolve without human intervention?
How often did the agent hand off to a human?
How often did a reviewer correct the agent's action?
Did the task come back after the agent marked it done?
Token cost for the full agent run.
95th-percentile end-to-end response time.
Scope
DoOperator does not build agents, host prompts, or run inference. It sits alongside your existing agent infrastructure — LangChain, LlamaIndex, OpenAI Agents, Bedrock, whatever — and gives you a controlled experiment layer on top of it.
You route traffic to policies however you like. After each run, you POST the outcome. DoOperator accumulates the evidence and tells you which policy wins.
Create a free org in Decision Process. Define two conditions, pick your metrics, and start logging outcomes. Results appear as data comes in.