Data Is Dumb (And That's Why Causality Matters)
TL;DR
AI learns patterns, not reasons. Without causality, your model is just an expensive correlation machine.
We live in the "data economy," with dashboards, models, and petabytes to give us insights. And yet data is dumb. Left to its own devices, raw data can't tell you why anything happens. It can tabulate, correlate, and predict (sometimes) with accuracy, but data does not understand causes and effects like humans do.
The Cult of Correlation#
"Data scientist is the sexiest job of the 21st century," we were told. In statistics class, students assert that correlation is not causation without being able to say what causation is.
We've built a data-centric culture where:
- Statistics fetishizes observations. We get very good at summarizing what happened and oddly shy about explaining why it happened.
- Science has drifted toward data-first thinking. If we can collect it, we must be learning, right?
- Big Data is portrayed as the universal solvent. Spoiler: some problems don't dissolve regardless of data volume.
- The "data economy" confuses availability with understanding. More rows in a spreadsheet, still the same blind spots.
We've optimized our ability to describe the world while neglecting our responsibility to understand it.
Prediction Is Not Explanation#
In some fields good predictions need not have good explanations. Deep learning has dazzled us by demolishing tasks we thought were hard: vision, speech, translation, reminding us that many problems are tractable pattern-matching after all. But let's keep our press releases and our philosophy in separate folders:
- AI is great at pattern completion. That doesn't make it a scientist, a doctor, or a judge of counterfactual worlds.
- Flexibility and commonsense lag. A self-driving car that treats a pedestrian with a bottle of whiskey exactly like any other pedestrian isn't is just naive. To behave differently, it must be shown examples, not just told that intoxication matters.
- Raw data drives the process. Which is precisely the problem when the process demands reasoning about unseen contingencies.
The Medicine Example (A.K.A., Why Your A/B Test Isn't a Time Machine)#
Data can tell you that patients who took a medicine recovered faster than those who didn't. Encouraging. But why?
Maybe people who can afford the medicine also enjoy better nutrition, safer housing, and responsive healthcare. They might have recovered quickly even without the drug. Your observational dataset can't distinguish the drug's effect from lifestyle, unless you model the causes.
Interventions Need Causal Models#
We care about interventions: What happens if we do X instead of Y? Passively collected data—no matter how big your lake or how deep your network—often can't answer that. Sometimes, even infinite data cannot resolve a causal question.
Consider a minimal causal story:
- D: a treatment (e.g., drug dose)
- L: an outcome (e.g., life expectancy)
- Z: disease stage (it influences both dose D and outcome L)
If Z affects both D and L, and we can't measure Z, then we can't separate the true causal impact of D on L from the influence of Z. The causal query \(Pr(L \mid \text{do}(D))\) is not identifiable. Translation: no amount of observational data will reveal the true effect of D on L because the confounding path through Z remains unblocked.
This isn't a small-sample problem; it's a missing-structure problem.
What We Can't Do with Passive Data#
- Answer counterfactuals or design interventions reliably, just by scaling up.
- Generalize decisions to new regimes when the training data never saw those regimes.
- Improvise when context shifts in ways the data never captured.
What to Do Instead#
When a causal query is unidentifiable, collecting more of the same data won't help. You don't have a measurement problem; you have a model problem. Do this:
- Refine the causal model.
- Add scientific knowledge. Find a way to measure \(Z\) (disease stage). New variables, new instruments, new designs.
-
Change the study design. Randomize \(D\) where ethical and feasible; use natural experiments, instrumental variables, or front-door/back-door criteria when appropriate.
-
Make explicit assumptions (carefully).
- Perhaps assume \(Z\)'s effect on \(D\) is negligible, or that an instrument exists that affects \(D\) but not \(L\) except through \(D\).
-
These assumptions can render \( P(L \mid \text{do}(D)) \) identifiable—and also falsifiable. Write them down. Defend them. Test what you can.
-
Separate goals: predict vs. decide.
- If you only need ranking or forecasting, a high-capacity predictor may suffice.
- If you need to choose actions—dosages, prices, policies—you need causes, not just correlations.
The Data Economy, Upgraded#
The data economy isn't doomed; it's just missing the instruments to move to the next level. We need to replace assumption ledgers, our accuracy metrics with decision value. Data alone is dumb; data + causal structure is how we do science, medicine, and policy without fooling ourselves.
If you remember only one ironic twist, make it this: the way to make data smarter isn't to make it bigger—it's to teach it why.