Essays
4 postsAccessible takes on why rigorous experimentation matters — for researchers, builders, and anyone trying to understand what actually works.
"Correlation is not causation" is one of the most-repeated phrases in empirical research. It is also, as usually understood, a dramatic understatement of the actual difficulty. The real challenge is not distinguishing correlation from causation — it is identifying which causal story is correct when several are consistent with the same data.
DoOperator ResearchMay 29, 2026Read → Organizations run thousands of A/B tests every year and congratulate themselves on being data-driven. Most of those tests are statistically invalid. Here is why — and what rigorous experimentation actually requires.
DoOperator ResearchMay 27, 2026Read → Randomized trials on populations measure average effects in heterogeneous groups. N-of-1 trials measure what actually happens to one specific person. For individual decision-making, the latter is usually more relevant.
DoOperator ResearchMay 26, 2026Read → Most findings from nutrition, psychology, and medicine that shaped your beliefs about health and behavior have not replicated. This is not a footnote — it changes what you should believe and how you should act.
DoOperator ResearchMay 24, 2026Read → Practitioner's Guides
10 postsConcise, applied guides to specific methods and concepts — what they are, when to use them, and what the evidence says.
You've spent three weeks tuning a 12-layer policy network for a continuous control task. The reward signal is sparse—your agent gets a non-zero reward maybe once every 200 steps. Policy gradients are producing gradient estimates with variance so high that your loss curve looks li
DoOperator ResearchMay 11, 2026Read → You've just launched a new feature on your platform. Your product team is confident it will increase engagement. The A/B test shows a statistically significant 0.3% lift in daily active users (p = 0.04, N = 500,000). Your VP wants to ship it tomorrow. But you've been burned befor
DoOperator ResearchMay 10, 2026Read → You're the CTO of a robotics startup. Your warehouse robot has logged 10,000 hours of pick-and-place trajectories, stored as state-action-reward-next_state tuples on a NAS drive. Your competitor is deploying a new fleet next quarter. You need a policy that outperforms your curren
DoOperator ResearchMay 9, 2026Read → You're the head of experimentation at a large e-commerce platform. Your team has just launched a new recommendation algorithm, but you can't run a standard A/B test because the algorithm needs to learn from user behavior in real time—and the optimal recommendation changes as user
DoOperator ResearchMay 8, 2026Read → Your team has been running A/B tests for six months on a recommendation system. Each week, you launch a new feature variant, monitor the p-value dashboard, and stop when it crosses 0.05. Your boss wants to know: how many of your "significant" results are real? The answer, from Jo
DoOperator ResearchMay 7, 2026Read → You're the head of clinical analytics at a health system. Your team has just rolled out a new remote monitoring program for heart failure patients. The average treatment effect is positive—readmissions dropped 8% overall. Your CEO wants to expand it system-wide. But you have a na
DoOperator ResearchMay 6, 2026Read → You've just finished collecting data from a randomized experiment with 1,200 participants. Your treatment effect estimate looks promising—a 3.2 percentage point reduction in the primary outcome. But when you compute the 95% confidence interval using the standard Wald formula, you
DoOperator ResearchMay 5, 2026Read → You're a product leader at a health-tech company. Your team just ran an A/B test on a new onboarding flow: 10,000 users randomized, treatment gets the new flow, control gets the old one. The treatment group shows a 12% relative lift in 7-day retention. Your CEO wants to know: sho
DoOperator ResearchMay 4, 2026Read → You're a product manager at a major e-commerce platform. Your team wants to test a new recommendation algorithm. Engineering can deploy it to 5% of users. Your analytics team runs the A/B test, gets a statistically significant 0.3% revenue lift (p=0.04, n=500,000), and recommends
DoOperator ResearchMay 3, 2026Read → Your product team just ran an A/B test. The treatment group saw a 12% lift in conversion. You're about to ship it. But your data scientist says: "The lift only appeared in the last three days of the experiment, and those days had a holiday promotion running for the control group
DoOperator ResearchMay 2, 2026Read →