Tackling Spurious Associations in Data-Driven Decisions
Decision making based on big data analysis is ubiquitous in current society. A critical issue in data-driven decision making is the possibility that people intentionally or unintentionally obtain a spurious correlation from data and render it a causal interpretation, thus making counterproductive decisions to society. A well-known phenomenon is Simpson’s Paradox (SP) whereby the direction of an association at the population-level may be different if considering the subgroups comprising that population. Examples include that an online platform tested a misinformation debunking application and found that proportionally more people who used this tool (treated group) expressed disbelief in falsehood compared to those who did not (control group). It is risky to immediately conclude that this tool is effective, as it might be because this tool has attracted a large portion of educated people – those who are naturally resilient to falsehood – creating a seemingly optimistic yet distorted association between tool usage and disbelief in falsehood (i.e., “confounding”). A more worrisome but possible scenario is that distinct subgroups were disparately impacted. Suppose that although the majority of users benefit from using this tool, there might exist a subgroup of vulnerable users whose debunking capability were weakened (i.e., “heterogeneity of treatment effects”). Therefore, the debunking tool might widen inequality in online space.
Many researchers have warned of the biases and pitfalls in data-driven decision making. Policymakers are increasingly motivated to address those concerns about fairness and equity. However, approaches or tools are still lacking to assist practitioners to address the SP issue in their data. My work seeks to discover (discoverability) and understand (interpretability) spurious associations in observational studies so as to support people to make fair and equity-oriented decision making. In particular, I will develop automated approaches to construct binary trees whereby nodes represent the subpopulations wherein confounding effect is eliminated and causal effect is homogeneous. I also construct assessment metrics to characterize a candidate partition. To better facilitate knowledge discovery, I will further develop a visual analytic system to support visualization and interpretation of SP in practical data analysis settings. The expected contributions of my work include algorithms, metrics and systems, to assist practitioners and policymakers better address the above challenges in data-driven decision making.
See part of my proposal slides.