In an impact evaluation study, researchers attempt to estimate the average treatment effects due to exposure to a programme or treatment, by comparing outcomes for treatment and control (non-treated) groups which are randomly assigned (Randomized Controlled Trials – RCT). Average Treatment Effect (ATE) is the difference between the average outcomes between the individuals/units assigned to the treatment and control. The main emphasis here is the random assignment of individuals to these groups which increases the probability of covariates, which encompasses the variables that explain the characteristics of the sample population, being balanced. Second, the random component of treatment-assignment ensures that none of the variables related to the individual/unit considered for the study or the potential outcomes influences the probability of treatment assigned to an individual. These two components, similar distribution of the covariates and random assignment of treatment-assignment, makes possible the unbiased and consistent estimation of the sample average treatment effects.

Besides RCT, researchers also make use of `quasi-experimental approaches’ to estimate the causal impact of a programme or policy change or any intervention. The approach is used when an researcher wants to understand the causal impact of an intervention, but random assignment of individuals to treatment and control units is not possible. For example: if the research objective is to understand the effect of smoking on an individual’s well-being. Then, random assignment is not viable as there is an ethical problem of asking an individual to smoke, or not smoke, since he/she belongs to the treatment, or control group respectively. Instead, researchers have made use of the exogenous variation brought in by policy changes to reduce smoking such as drastic increase in price or restriction of number of cigarettes bought by a person, to create a target population and its comparison. Another example would be where the research objective is to estimate the causal impact of a programme which targets women belonging to low-income households, which was implemented say ten years earlier. Thereby, not allowing for random assignment of individuals to treatment and control groups. In this scenario, researcher can make use of covariates, such as income earned by the household members, education status of women and so on, to predict the treatment-assignment function. The researcher can create a comparison group, using the treatment assignment function, from the population which hasn’t been exposed to the intervention. Since the comparison group is similar to the target population (treatment), it enables the estimation of the causal impact of the intervention.

Quasi-experimental designs include propensity score matching, regression discontinuity design, synthetic controls, differences-in-differences (pre-post without comparison) and instrumental variables. Here, I want to explore about the Propensity Score Matching technique owing to my recent experience of adopting this technique to match Mahila Samakhya programme participants to non-participants. During this process, I have always pondered about the selection of covariates and the emphasis on covariate balancing. The questions I want to explore are: (a) whether the need to achieve the statistical balancing of covariates biases the sample itself? and (b) how does one arrive at the function to predict the probability of treatment-assignment, especially when the function itself is not known ?

Let me explain why questions (a) and (b) are inter-related, even though my original enquiry was only related to the selection of covariates. Theoretically, Propensity Score Matching (PSM) technique enables the generation of treated and control group by estimating the chance of an individual participating in the programme given the observed covariates (which can be any socio-economic characteristics). The underlying assumption is that the covariates that are considered perfectly estimates the probability of programme participation; thus, leaving no significant variation due to the exclusion of any other observable or unobservable variables. The selection of covariates, and its scale, plays a crucial role in ensuring that the estimation function fully determines the treatment-assignment in the quasi-experimental approach. Rosebaum and Rubin (1985) showed that balancing on the propensity scores should be sufficient, if it is enough to adjust the treatment-assignment with the observed covariates, along with overlap and stable unit treatment value assumption. In other words, it is sufficient to match individuals on the basis of probability of exposure given the information in the observed covariates. Therefore, it is possible to identify two sets of groups, one which has been exposed to the treatment programme and other one which has not been exposed to the programme, on the basis of probability of exposure rather than on individual variables.

Under PSM, we are no longer matching on the basis of each covariates (variables such as age, gender and others) but the estimated propensity scores conditional upon a set of covariates. This resolves the curse of dimensionality as it reduces the multi-dimensionality of several constructs into a single construct with which matching can be done. Consider an example where the treatment group constitutes of women, with varying degrees of education, caste, marital status and number of members in the household. Say that there exists a women who is aged 40 years old, completed Bachelors, schedule caste, married, and reside with 5 members in the household. Now, it will be extremely difficult to match another individual who was not exposed to the treatment group who is 40 years old, completed Bachelors, schedule caste, married and reside with 5 members in the household. This problem has been the major impediment to matching processes as there is data limitation to the availability of exact matches on all the covariates. Instead, Rosebaum (1985) showed that it is sufficient to match on a single construct which contains all the information provided by the multiple covariates, thereby alleviating the curse of dimensionality. In addition, it provides the opportunity to work with a continuous construct rather than fixed points of covariates. In other words, you are working with a probability density function, rather than a probability mass function (based on a set of discrete variables). While this facilitates in overcoming operational difficulties of matching on the basis of covariates, this introduces a set of new problems such as researchers’ discretion in variable selection, model dependence and so on.

Take our case, for example, where the Mahila Samakhya (MS) programme takes broad measures such as low female literacy rate and high percentage of Scheduled Caste and Scheduled Tribe for selection of villages to begin their work. Once the village was decided, MS officials approach everyone in the village belonging to socially and economically disadvantaged sections, either directly or indirectly, and start their enrolment process. Even with such broad criteria, I have observed that not all women of similar background have a nonzero, positive, probability of taking up the programme. Then, what made the participants to enrol with the programme? The answer varies if you ask the participants these questions, and making matters worse is the fact that the responses will be subjected to each individual’s ability to recollect the reason for enrolling with the programme. In essence, we are faced with a situation where the treatment-assignment function is not known. This case is not unique, as most researchers face this problem when adopting a quasi-experimental approach, such as propensity score matching, to generate their control, or comparison / non-participant, group from observational datasets.

Here is where the researchers’ discretion seeps in terms of selection of the covariates and models used to predict the propensity scores. The basic question here is whether the idea of covariate selection is to discern the treatment-assignment process or to design and implement one. The only guidance from the literature is to include all the covariates, or variables, that can determine either the treatment or the outcome of the programme. Given this broad scope, the researcher has the liberty to include all variables that he/she thinks is relevant to remove any chance of confounding or unobserved variables biasing the estimation of average treatment effects.

If the idea is to discern the treatment-assignment process, and negate influence of confounding variables, then one would include variables that meet the necessary condition, which may not be sufficient to yield good performances of the estimate. In other words, a researcher would include variables that are considered mandatory for determining the treatment-assignment process and negating the influence of confounding variables which may result in an inefficient but unbiased estimator. Instead, researchers attempt to build a robust and efficient model to predict the propensity scores. See the figure below – which is an attempt to IDENTIFY a robust model at predicting the propensity scores in one of the programme districts.

In the figure above, blue coloured dots indicate programme villages and Red coloured dots indicate non-programme villages. Scatterplot (A) depicts the programme and (matched) non-programme villages using logit model with Percentage of SC/ST population and Percentage of Female Literacy (as used by the programme officials). Scatterplot (B) depicts the programme and (matched) non-programme villages using logit model with Percentage of SC/ST population, Percentage of Female Literacy and Square of Percentage of SC/ST population. Look at the distance between blue and red dots in (A) and (B): the deviation between the programme and (matched) non-programme village increases with an introduction of square of percentage of SC/ST poulation. I can also come up with an estimation function which reduces the deviation between the programme and (matched) non-programme. This is purely a trial and error, and iterative, process where it increases the scope for researchers’ discretion.

Such attempts increases the chances of the estimator being biased, as King and Nielsen, 2016, rightly point out to the fact that “a human making an unconstrained qualitative choice from among a set of different unbiased estimates is a biased estimator”. Further, “there is the actual practice of demonstrating only those specifications which fits the hypothesis” (Ho et al, 2007). Thus, this iterative process results in biased estimators, increases imbalances, bias and model dependence which all combined leads to the biased estimation of average treatment effects.

So, what would be the solutions to avoid the researchers’ discretion problem? The literature suggests a few: (a) application of propensity score matching in appropriate cases – where there exists high levels of imbalances in the original data; (b) combining propensity score matching with other matching methods such as exact or coarsened matching (King and Nielsen (2007); and (c) describing the circumstances under which the adjustments for covariates is sufficient (Joffe and Rosebaum, 1999) and (d) carefully documenting the various specification of propensity score regression which enhances the matches on the basis of the covariates (and not the propensity score) (King and Nielsen (2007)). While this is all good suggestions, I don’t think it still removes the researchers’ discretion [point (a), (b) and (c)] or solves the problem of dimensionality which the PSM method resolves [point (d)].

I believe that the central questions to resolve the researchers’ discretion problem is – *(a) Should the researcher concern himself/herself with formulating a function to estimate the probability of treatment, especially when the actual function itself is not known? (b) Is the idea behind selection of covariates and balancing of propensity scores to create a comparable sample or a exact match as the programme participants? What is methodologically better: balancing on covariates or balancing on propensity scores? What is acceptable quality of matches (which is not just in terms of distance measures)? *More discussions should take place around these questions, and move towards a well-defined empirical strategies to remove the possibility of researchers’ discretion.

**References:
**Ho, Daniel, Kosuke Imai, Gary King and Elizabeth Stuart, 2007, Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference, Political Analysis.

Joffe M, Rosenbaum PR, 1999, Propensity Scores, American Journal of Epidemology.

King, Gary, and Nielsen, Richard, 2016, Why Propensity Scores should not be used for Matching, Working Paper

Shreekanth Mahendiran

Research Advisor, CBPS