12 Existing practices for experimental design fall short

With limited resources, researchers and implementing partners often choose pilot settings and interventions that are most likely to have the largest effect for the largest number of people. We also calculate and select the smallest possible sample that will power us to detect this effect. The result of these experiments is exactly what they are designed for – they often produce a small average treatment effect, and their design and size make it hard to detect how treatments work for different people (i.e. heterogeneous effects).

We could stop there and conclude that heterogeneous treatment effects simply don’t exist. But, theories and empirical results from behavioral science suggest that people vary in their reaction to interventions (e.g. Kosorok & Moodie (2015), Hainmueller et al. (2014), Ascarza (2018)).

Instead, we consider whether we have the right data to describe heterogeneous treatment effects. If we are missing the variables that are predictive of heterogeneous behavior, we will not be able to identify the relevant subgroup with higher/lower effects. Also, the dataset may have too few observations to describe this behavior with the current methods even with the best covariates.

To overcome these limitations, we’ve worked with our advisors to create new principles for designing interventions for machine learning, which are outlined below. They include (1) Find a data-rich context, (2) Build variation into designs, and (3) Consider ethics, transparency, and accountability.

12.1 Machine learning requires a data-rich environment with the ability to tailor treatments

Size - think big	The population must be large, and the experiment must be able to include a large sample of the population.
Quality, which means wide and complete	Wide means the data include variables that are indicative of variation in behavior. In other words, the population is not homogeneous.
Measure of outcomes, or at least a proxy	Be able to measure outcome variables that capture or at least capture a proxy for social impact.
Meaningful covariates	The data describe the context and people’s previous behavior, not just the demographic details about a person.
Tailored treatments	The promise of personalization is that we can deliver the right treatment to the right individual. The context needs to support the ability to use diverse treatments, rather than stick to generic treatments that work the same for everyone, even if there is a risk of occasional mis-match between individual and treatment.

In order to effectively design personalized experiments, we need to choose settings that are conducive to a variety of intervention designs. The best environments for this approach are those in which previous research and theory suggest that different treatment arms will produce different responses across groups within the population. We should scope projects where it is feasible to vary the intervention delivery channel and administer designs across multiple touchpoints: for example, some groups might get an email, while others get a letter in the mail, or we might intervene at different moments during a long online registration process. It is also important to vary the intervention design itself within the same channel or touchpoint – when deciding whether a context is suitable for this approach we might ask, for example, can we craft multiple different messages, and send each of those messages to a different subgroup at the same time?

This further highlights the need to observe or collect the relevant information about individuals so that it is possible to reliably assign treatments that may work well for some individuals, but not so well for others. For example, a more graphical interface for a digital intervention might work well for those for whom English text is difficult to read, but it might require more bandwidth or take more time for native English speakers. It is thus important to ensure the design entails gathering the relevant information or giving users choice in order to enable the designer to customize the experience for one group without risking making the experience less effective for other groups.

12.2 Scaling machine learning requires ethical consideration and accountability

Machine learning poses an intricate problem. Much like humans, machines learn by identifying patterns in large amounts of data. Data, however, are not neutral. Each dataset is gathered under a specific context that may or may not accurately apply to other contexts. Data may contain imperfections that cause them to be inadvertently imbalanced, incomplete, and/or unrepresentative of the population they purport to personify. Therefore, one does not need to explicitly program a machine to make biased decisions in order for the outcome to be unfair.

For example, the extent to which various groups are represented in a data set informs the algorithm’s prediction accuracy for those groups; if facial recognition algorithms are trained on data that contain more white faces, they will be more accurate at identifying white faces. Some think of this principle as “bias-in, bias-out.”

This means that as researchers and practitioners, we are responsible for a series of important decisions about which data we use, how we use them, and how we evaluate outcomes.

Bias-in, bias-out. Make sure data are as complete and representative of the full population as possible. An algorithm’s designers are tasked with choosing a “label,” or the variable that the model will output as its prediction. Some consider label selection “the single most important decision made in the development of a prediction algorithm.” The term “problem formulation” in data science refers to the task of turning an abstract concept into a concrete variable that can be predicted in a given dataset. At each stage of the development process – including data selection and problem formulation – researchers’ choices play a critical role in avoiding (or perpetuating) bias.

Fairness. Involve key stakeholders in the development of a model and its accuracy and fairness criteria. Questions about fairness in machine learning can be especially controversial because the notion of a “fair” algorithm is itself subjective. While the question of how we should define fairness in algorithms is an important one, perhaps even more pressing is the question of who gets to determine what “fair” means in a given context. Machine learning has the potential to be problematic when designers make decisions about fairness without being required to corroborate those decisions with the communities that the algorithms affect.

Transparency. Pursue computational solutions that make machine learning models easier for humans to interpret. Increasing the level of transparency in algorithms is a strategy to promote both adoption and accountability. Behavioral research shows that humans are more likely to trust algorithmic systems when given robust explanations of how those systems work. By providing community stakeholders with a lens into researchers’ decisions, transparency also holds algorithm designers accountable. Even when models are difficult to explain, keep careful documentation of decision-making throughout the process of creating and deploying algorithms. Share that documentation with community stakeholders as you gather their input.

Accountability. Even when community stakeholders have input and the algorithm is considered both “fair” and “transparent,” clarity about the process of appealing/correcting decisions should also exist. There is often no recourse for automated decisions, especially when there is limited human involvement in decision-making. Accountability, therefore, requires that we conduct external evaluations for routinely testing and evaluating the outcomes of the model. It also requires that we create systems for people affected by algorithmic decisions to appeal the outcomes and that we document the appeals and changes made over time, altering more fundamental factors about the model if needed.

ideas42 has developed a set of guiding principles that support teams to ask the right questions; involve community stakeholders in asking and answering those questions; and implement fair, accountable, and transparent policies.