10 Machine learning methods re-engineer how we design

10.1 Adaptive experimental designs

In a standard randomized control trial (RCT), also called an A/B test, the researcher decides before the experiment how the treatments are to be allocated among the experimental subjects. Most often, the treatments are just assigned uniformly at random, so each experimental subject has an equal chance of receiving any treatment. RCTs are standard and easy to implement, but depending on the question of interest they can also be costly and wasteful. For example, suppose that the goal of the experiment is to get precise estimates about “good” treatment arms – that is, those that have the strongest effects. After running an RCT, once the experiment is over, the researcher will typically discover that many experimental subjects were assigned to “bad” arms, those that had little or negative effect. These observations are wasted, since they don’t reveal information about the parameters we care about.

In contrast, in an adaptive experimental design, we optimize the data collection process to target a particular question of interest, dynamically updating the allocation probabilities of each treatment during the experiment depending on the data that we have already seen. The most common class of adaptive experimental designs is called multi-armed bandit algorithms, in which the goal is to allocate the maximum number of experimental subjects to the best treatment arm. Multi-armed bandit algorithms must find a balance between allocating treatments at random to learn which arm is optimal, and allocating treatments to the arm that currently has the highest likelihood of being optimal. This problem is called the “exploration-exploitation” trade-off.

An example of a multi-armed bandit algorithm is Thompson sampling; see Thompson (1933) and Russo et al. (2018). This Bayesian algorithm starts with a prior over the value of each arm, and divides the experiment into a sequence of batches. At the end of each batch, posterior probabilities about the value of each arm are updated using the most recent data. Then, in the subsequent batch, the probability that each arm is selected is proportional to the posterior probability that the arm is the best arm. That is, if the model estimates that a treatment arm has a 30% chance of being the best one, it will select this treatment arm with 30% probability, and then update its estimates later depending on how the experiment subject actually responded to the treatment. This design ensures that as we collect more data, the best-performing arms are selected more often.

An extension of multi-armed bandits called contextual bandit algorithms also tries to utilize information about the experimental subjects’ observable features, which in this literature are called “contexts.” They seek to allocate the treatment arm that works best for each experimental subject, as opposed to the treatment arm that works well for everyone on average.

A related class of adaptive experimental design methods concerns best-arm identification. Here, the goal is to output the best treatment arm with a high degree of certainty at the end of the experiment. Also, a closely related problem is minimizing “policy regret,” also called “simple regret,” where the goal is to maximize the expected value of the arm that is selected at the end of the experiment. An example of an algorithm that targets the latter objective is exploration sampling (Kasy & Sautmann (2019)), which modifies Thompson sampling probabilities by applying a concave function to them and ensuring that no arm receives more than 50% probability of being assigned. This restriction prevents the concentration of allocation probabilities in a single arm and ensures that there is enough exploration during the experiment.

One disadvantage of adaptive experiments is that we often need more sophisticated statistical tools to analyze them, because the collected data is no longer independent. In past years there has been some interest in developing new statistical methods that allow for valid frequentist estimates; see Luedtke & Van Der Laan (2016), Hadad et al. (2019), and Howard et al. (2018).