The questions we plan to investigate using DoOperator's pooled experimental data. Each study is designed to close a gap in the existing literature on behavioral interventions, adaptive experiment design, or causal inference in N-of-1 settings.
Why DoOperator can do this research
Most behavioral science studies are underpowered single experiments run once in a single lab. DoOperator generates a continuous stream of pre-registered, within-person crossover experiments across thousands of users — creating a data asset that makes previously intractable questions answerable. Users opt into pooled analysis; every experiment run on the platform is a potential contribution to the shared research record.
Gap in existing literature
The HTE literature has powerful estimation methods (X-learner, causal forests, R-learner), but almost no empirical benchmarks on how often meaningful heterogeneity exists in real behavioral interventions, and what covariate dimensions drive it. Most field experiments are underpowered to detect heterogeneity even when it's present.
Our approach
Use DoOperator's pooled experiment data to estimate treatment effect distributions across user subgroups (by baseline metric level, time-of-day, consistency of prior behavior). Compare R-learner and causal forest estimates against oracle subgroup effects. Build a public benchmark dataset.
Hypothesis
“Meaningful HTE exists in >40% of behavioral interventions but most individual experiments are too small to detect it.”
Gap in existing literature
Bandit algorithms are theoretically optimal for sequential exploration, but all theoretical guarantees assume stationary reward distributions. Human behavioral outcomes are highly non-stationary — they depend on circadian rhythms, adaptation, and carryover effects that violate the i.i.d. assumption. No empirical study has compared adaptive and fixed designs in a within-person behavioral experiment setting.
Our approach
Run a multi-arm experiment where one cohort uses a 7-day fixed crossover design and another uses Thompson sampling with 7-day arms. Compare cumulative regret, time-to-convergence, and final policy quality. Use the DoOperator platform to deploy both designs to matched user groups.
Hypothesis
“Thompson sampling reaches 80% of optimal policy quality in half the trials required by fixed crossover when outcomes are stationary, but underperforms on outcomes with >3-day carryover.”
Gap in existing literature
Standard power analysis for between-subjects designs is well-understood, but within-person N-of-1 designs have fundamentally different power curves — autocorrelation between measurements can either inflate or deflate Type I error depending on arm ordering and carryover. The field lacks empirical calibration of how many trials are actually needed for 80% power in common behavioral outcomes (sleep, HRV, focus).
Our approach
Bootstrap from existing DoOperator experiment data to construct empirical power curves for sleep score, HRV RMSSD, and focus rating. Vary: number of trials per arm (4–30), washout period (0–3 days), block structure (random vs. alternating). Publish calibration tables practitioners can use directly.
Hypothesis
“80% power for detecting a 0.3 SD effect in sleep score requires ≥10 trials per arm with washout ≥1 day, versus the standard rule-of-thumb of 5.”
Gap in existing literature
Empirical Bayes and hierarchical shrinkage are theoretically optimal when individual experiments are small, but the magnitude of improvement in realistic behavioral experiment settings is unknown. Stein's paradox guarantees improvement when k ≥ 3 units, but the practical gains depend on the ICC and the signal-to-noise ratio of the behavioral outcome — parameters that are empirically unconstrained in this domain.
Our approach
Using DoOperator's community posterior infrastructure, compare per-user effect estimates under (1) independent Bayesian analysis, (2) empirical Bayes shrinkage, and (3) full hierarchical model with horseshoe prior. Evaluate against held-out trials. Quantify the ICC for common behavioral outcomes.
Hypothesis
“Partial pooling reduces posterior variance by 30–60% for sleep outcomes and 15–30% for cognitive outcomes, with gains inversely proportional to experiment length.”
Gap in existing literature
Behavioral health outcomes (sleep, HRV, mood, performance) are causally entangled in ways that affect which experiments are valid and which confounders must be controlled. The causal graph connecting these variables is assumed but never estimated from observational within-person data. Without it, experimental designs may inadvertently block causal pathways or ignore mediators that explain heterogeneity.
Our approach
Apply PC algorithm and LiNGAM to DoOperator's longitudinal within-person observational data (continuous passive metrics from Garmin/Oura integrations). Estimate causal structure separately for each user and pool via a majority-vote meta-graph. Identify stable edges (present in >70% of users) as structural priors for future experimental designs.
Hypothesis
“Sleep quality causally precedes next-day HRV (not the reverse); physical activity affects mood primarily through sleep quality (mediated path).”
Contribute data
When you run an experiment on DoOperator and opt into pooled analysis, your data counts toward the thresholds above. The platform is designed so that individual self-experimentation and original research reinforce each other.