Research Agenda — DoOperator

Why DoOperator can do this research

Most behavioral science studies are underpowered single experiments run once in a single lab. DoOperator generates a continuous stream of pre-registered, within-person crossover experiments across thousands of users — creating a data asset that makes previously intractable questions answerable. Users opt into pooled analysis; every experiment run on the platform is a potential contribution to the shared research record.

Q1Heterogeneous Treatment EffectsActive

When does treatment effect heterogeneity matter in practice?

Gap in existing literature

The HTE literature has powerful estimation methods (X-learner, causal forests, R-learner), but almost no empirical benchmarks on how often meaningful heterogeneity exists in real behavioral interventions, and what covariate dimensions drive it. Most field experiments are underpowered to detect heterogeneity even when it's present.

Our approach

Use DoOperator's pooled experiment data to estimate treatment effect distributions across user subgroups (by baseline metric level, time-of-day, consistency of prior behavior). Compare R-learner and causal forest estimates against oracle subgroup effects. Build a public benchmark dataset.

Hypothesis

“Meaningful HTE exists in >40% of behavioral interventions but most individual experiments are too small to detect it.”

Method: Causal forest + R-learner on pooled crossover data; honest sample-splitting for inference.

Data needed: ~200 users × 14-day experiments across 3+ behavioral domains.

Q2Sequential Decision-MakingActive

Can Thompson sampling outperform fixed crossover designs for individual behavior optimization?

Gap in existing literature

Bandit algorithms are theoretically optimal for sequential exploration, but all theoretical guarantees assume stationary reward distributions. Human behavioral outcomes are highly non-stationary — they depend on circadian rhythms, adaptation, and carryover effects that violate the i.i.d. assumption. No empirical study has compared adaptive and fixed designs in a within-person behavioral experiment setting.

Our approach

Run a multi-arm experiment where one cohort uses a 7-day fixed crossover design and another uses Thompson sampling with 7-day arms. Compare cumulative regret, time-to-convergence, and final policy quality. Use the DoOperator platform to deploy both designs to matched user groups.

Hypothesis

“Thompson sampling reaches 80% of optimal policy quality in half the trials required by fixed crossover when outcomes are stationary, but underperforms on outcomes with >3-day carryover.”

Method: Beta-Bernoulli Thompson sampling vs. fixed block crossover; Bayesian regret estimation.

Data needed: 50+ users per arm, 21+ days per user.

Q3Experiment DesignPlanned

What is the minimum viable experiment for reliable within-person causal inference?

Gap in existing literature

Standard power analysis for between-subjects designs is well-understood, but within-person N-of-1 designs have fundamentally different power curves — autocorrelation between measurements can either inflate or deflate Type I error depending on arm ordering and carryover. The field lacks empirical calibration of how many trials are actually needed for 80% power in common behavioral outcomes (sleep, HRV, focus).

Our approach

Bootstrap from existing DoOperator experiment data to construct empirical power curves for sleep score, HRV RMSSD, and focus rating. Vary: number of trials per arm (4–30), washout period (0–3 days), block structure (random vs. alternating). Publish calibration tables practitioners can use directly.

Hypothesis

“80% power for detecting a 0.3 SD effect in sleep score requires ≥10 trials per arm with washout ≥1 day, versus the standard rule-of-thumb of 5.”

Method: Parametric bootstrap + Bayesian hierarchical model; compare to analytical power formulas.

Data needed: 100+ completed experiments with ≥10 trials per arm.

Q4Statistical FoundationsPlanned

How much does partial pooling improve effect estimates in small N-of-1 experiments?

Gap in existing literature

Empirical Bayes and hierarchical shrinkage are theoretically optimal when individual experiments are small, but the magnitude of improvement in realistic behavioral experiment settings is unknown. Stein's paradox guarantees improvement when k ≥ 3 units, but the practical gains depend on the ICC and the signal-to-noise ratio of the behavioral outcome — parameters that are empirically unconstrained in this domain.

Our approach

Using DoOperator's community posterior infrastructure, compare per-user effect estimates under (1) independent Bayesian analysis, (2) empirical Bayes shrinkage, and (3) full hierarchical model with horseshoe prior. Evaluate against held-out trials. Quantify the ICC for common behavioral outcomes.

Hypothesis

“Partial pooling reduces posterior variance by 30–60% for sleep outcomes and 15–30% for cognitive outcomes, with gains inversely proportional to experiment length.”

Method: Normal-Gamma conjugate hierarchy; NUTS sampler for full hierarchical model; leave-one-out CV.

Data needed: 50+ users with ≥7 trials per arm on the same experimental intervention.

Q5Causal InferencePlanned

What is the causal graph structure of common behavioral outcomes?

Gap in existing literature

Behavioral health outcomes (sleep, HRV, mood, performance) are causally entangled in ways that affect which experiments are valid and which confounders must be controlled. The causal graph connecting these variables is assumed but never estimated from observational within-person data. Without it, experimental designs may inadvertently block causal pathways or ignore mediators that explain heterogeneity.

Our approach

Apply PC algorithm and LiNGAM to DoOperator's longitudinal within-person observational data (continuous passive metrics from Garmin/Oura integrations). Estimate causal structure separately for each user and pool via a majority-vote meta-graph. Identify stable edges (present in >70% of users) as structural priors for future experimental designs.

Hypothesis

“Sleep quality causally precedes next-day HRV (not the reverse); physical activity affects mood primarily through sleep quality (mediated path).”

Method: PC algorithm with Fisher Z test; LiNGAM for orientation; bootstrap for edge stability.

Data needed: 30+ days of continuous passively collected data per user, ≥100 users.

Contribute data

Every experiment contributes

When you run an experiment on DoOperator and opt into pooled analysis, your data counts toward the thresholds above. The platform is designed so that individual self-experimentation and original research reinforce each other.

Start experimenting →