Define the sample and the potential outcomes function.
Define the treatment assignment procedure.
Create data.
Assign treatment, then estimate the effect.
Do this many times.
Examples
Complete randomization
With covariates
With cluster randomization
Example: Simulation-based power for complete randomization
# install.packages("randomizr")## Y0 is fixed in most field experiments.## So we only generate it once:make_Y0 <-function(N){ rnorm(n = N) }repeat_experiment_and_test <-function(N, Y0, tau){ Y1 <- Y0 + tau Z <-complete_ra(N = N) Yobs <- Z * Y1 + (1- Z) * Y0 estimator <-lm_robust(Yobs ~ Z) pval <- estimator$p.value[2]return(pval)}
Example: Simulation-based power for complete randomization
P0 <-declare_population(N,u0=rnorm(N))# declare Y(Z=1) and Y(Z=0)O0 <-declare_potential_outcomes(Y_Z_0 =5+ u0, Y_Z_1 = Y_Z_0 + tau)# design is to assign m units to treatmentA0 <-declare_assignment(Z=conduct_ra(N=N, m=round(N/2)))# estimand is the average difference between Y(Z=1) and Y(Z=0)estimand_ate <-declare_inquiry(ATE=mean(Y_Z_1 - Y_Z_0))R0 <-declare_reveal(Y,Z)design0_base <- P0 + A0 + O0 + R0## For example:design0_N100_tau25 <-redesign(design0_base,N=100,tau=.25)dat0_N100_tau25 <-draw_data(design0_N100_tau25)head(dat0_N100_tau25)
# Estimate the treatment effect using a t-testlm_robust(Y~Z,data=dat0_N100_tau25)$coef # estimate
(Intercept) Z
4.8457631 0.5569162
E0 <-declare_estimator(Y~Z,model=lm_robust,label="t test 1",inquiry="ATE")t_test <-function(data) { test <-with(data, t.test(x = Y[Z ==1], y = Y[Z ==0]))data.frame(statistic = test$statistic, p.value = test$p.value)}T0 <-declare_test(handler=label_test(t_test),label="t test 2")design0_plus_tests <- design0_base + E0 + T0design0_N100_tau25_plus <-redesign(design0_plus_tests,N=100,tau=.25)## Only repeat the random assignment, not the creation of Y0. Ignore warningnames(design0_N100_tau25_plus)
[1] "P0" "A0" "O0" "R0" "t test 1" "t test 2"
design0_N100_tau25_sims <-simulate_design(design0_N100_tau25_plus,sims=c(1,100,1,1,1,1)) # only repeat the random assignment# design0_N100_tau25_sims has 200 rows (2 tests * 100 random assignments)# just look at the first 6 rowshead(design0_N100_tau25_sims)
design N tau sim_ID estimator term estimate std.error
1 design0_N100_tau25_plus 100 0.25 1 t test 1 Z 0.1107677 0.2149666
2 design0_N100_tau25_plus 100 0.25 1 t test 2 <NA> NA NA
3 design0_N100_tau25_plus 100 0.25 2 t test 1 Z 0.2458211 0.2154258
4 design0_N100_tau25_plus 100 0.25 2 t test 2 <NA> NA NA
5 design0_N100_tau25_plus 100 0.25 3 t test 1 Z 0.5463041 0.2133367
6 design0_N100_tau25_plus 100 0.25 3 t test 2 <NA> NA NA
statistic p.value conf.low conf.high df outcome inquiry step_1_draw
1 0.5152789 0.6075185 -0.3158265 0.5373619 98 Y ATE 1
2 0.5152789 0.6075435 NA NA NA <NA> <NA> 1
3 1.1410944 0.2566114 -0.1816843 0.6733266 98 Y ATE 1
4 1.1410944 0.2566230 NA NA NA <NA> <NA> 1
5 2.5607596 0.0119701 0.1229443 0.9696640 98 Y ATE 1
6 2.5607596 0.0120273 NA NA NA <NA> <NA> 1
step_2_draw
1 1
2 1
3 2
4 2
5 3
6 3
Example: Using DeclareDesign
Now count the number of simulations where the p-value is less than 0.05.
# A tibble: 2 × 2
estimator pow
<chr> <dbl>
1 t test 1 0.2
2 t test 2 0.2
Power with covariate adjustment
Covariate adjustment and power
Covariate adjustment can improve power because it mops up variation in the outcome variable.
If prognostic, covariate adjustment can reduce variance dramatically. Lower variance means higher power.
If non-prognostic, power gains are minimal.
All covariates must be pre-treatment. Do not drop observations on account of missingness.
Freedman’s bias as n of observations decreases and K covariates increases.
Blocking
Blocking: randomly assign treatment within blocks
“Ex-ante” covariate adjustment
Higher precision/efficiency implies more power
Reduce “conditional bias”: association between treatment assignment and potential outcomes
Benefits of blocking over covariate adjustment clearest in small experiments
Example: Simulation-based power with a covariate
## Y0 is fixed in most field experiments. So we only generate it oncemake_Y0_cov <-function(N){ u0 <-rnorm(n = N) x <-rpois(n=N,lambda=2) Y0 <- .5*sd(u0) * x + u0return(data.frame(Y0=Y0,x=x)) }## X is moderarely predictive of Y0.test_dat <-make_Y0_cov(100)test_lm <-lm_robust(Y0~x,data=test_dat)summary(test_lm)
Call:
lm_robust(formula = Y0 ~ x, data = test_dat)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.1100 0.18801 0.5852 5.598e-01 -0.2631 0.4831 98
x 0.4404 0.08137 5.4128 4.411e-07 0.2790 0.6019 98
Multiple R-squared: 0.231 , Adjusted R-squared: 0.2232
F-statistic: 29.3 on 1 and 98 DF, p-value: 4.411e-07
Given a fixed \(N\), a clustered design is weakly less powered than a non-clustered design.
The difference is often substantial.
We have to estimate variance correctly:
Clustering standard errors (the usual)
Randomization inference
To increase power:
Better to increase number of clusters than number of units per cluster.
How much clusters reduce power depends critically on the intra-cluster correlation (the ratio of variance within clusters to total variance).
A note on clustering in observational research
Often overlooked, leading to (possibly) wildly understated uncertainty.
Frequentist inference based on ratio \(\hat{\beta}/\hat{se}\)
If we underestimate \(\hat{se}\), we are much more likely to reject \(H_0\). (Type-I error rate is too high.)
Many observational designs much less powered than we think they are.
Example: Simulation-based power for cluster randomization
## Y0 is fixed in most field experiments. So we only generate it oncemake_Y0_clus <-function(n_indivs,n_clus){# n_indivs is number of people per cluster# n_clus is number of clusters clus_id <-gl(n_clus,n_indivs) N <- n_clus * n_indivs u0 <- fabricatr::draw_normal_icc(N=N,clusters=clus_id,ICC=.1) Y0 <- u0return(data.frame(Y0=Y0,clus_id=clus_id)) }
test_dat <-make_Y0_clus(n_indivs=10,n_clus=100)# confirm that this produces data with 10 in each of 100 clusterstable(test_dat$clus_id)
# three estimators, differ in se_type:E1a <-declare_estimator(Y~Z,model=lm_robust,clusters=clusters,se_type="CR2", label="CR2 cluster t test",inquiry="ATE")E1b <-declare_estimator(Y~Z,model=lm_robust,clusters=clusters,se_type="CR0", label="CR0 cluster t test",inquiry="ATE")E1c <-declare_estimator(Y~Z,model=lm_robust,clusters=clusters,se_type="stata", label="stata RCSE t test",inquiry="ATE")design1_plus <- design1_base + E1a + E1b + E1cdesign1_plus_tosim <-redesign(design1_plus,n_clus=10,n_indivs=100,tau=.25)
Example: Simulation-based power for cluster randomization (DeclareDesign)
## Only repeat the random assignment, not the creation of Y0. Ignore warning## We would want more simulations in practice.set.seed(12355)design1_sims <-simulate_design(design1_plus_tosim,sims=c(1,1000,rep(1,length(design1_plus_tosim)-2)))design1_sims %>%group_by(estimator) %>%summarize(pow=mean(p.value < .05),coverage =mean(estimand <= conf.high & estimand >= conf.low),.groups="drop")
# A tibble: 3 × 3
estimator pow coverage
<chr> <dbl> <dbl>
1 CR0 cluster t test 0.155 0.911
2 CR2 cluster t test 0.105 0.936
3 stata RCSE t test 0.131 0.918
Example: Simulation-based power for cluster randomization (DeclareDesign)
## This may be simpler than the above:d1 <-block_cluster_two_arm_designer(N_blocks =1,N_clusters_in_block =10,N_i_in_cluster =100,sd_block =0,sd_cluster = .3,ate = .25)#d1_plus <- d1 + E1b + E1c# d1_sims <- simulate_design(d1_plus,sims=c(1,1,1000,1,1,1,1,1))
Example: Simulation-based power for cluster randomization (DeclareDesign)
Comments
Know your outcome variable.
What effects can you realistically expect from your treatment?
What is the plausible range of variation of the outcome variable?