Article When data is missing, scientists guess. Then guess again.

**C C** · Oct 3, 2024 10:33 PM

https://www.quantamagazine.org/when-data...-20241002/

EXCERPTS: Data is almost always incomplete. Patients drop out of clinical trials and survey respondents skip questions; schools fail to report scores, and governments ignore elements of their economies. When data goes missing, standard statistical tools, like taking averages, are no longer useful.

“We cannot calculate with missing data, just as we can’t divide by zero,” said Stef van Buuren (opens a new tab), the professor of statistical analysis of incomplete data at the University of Utrecht.

Suppose you are testing a new drug to reduce blood pressure. You measure the blood pressure of your study participants every week, but a few get impatient: Their blood pressure hasn’t improved much, so they stop showing up.

You could leave those patients out, keeping only the data of those who completed the study, a method known as complete case analysis. That may seem intuitive, even obvious. It’s also cheating. If you leave out the people who didn’t complete the study, you’re excluding the cases where your drug did the worst, making the treatment look better than it actually is. You’ve biased your results.

Avoiding this bias, and doing it well, is surprisingly hard. For a long time, researchers relied on ad hoc tricks, each with their own major shortcomings. But in the 1970s, a statistician named Donald Rubin proposed a general technique, albeit one that strained the computing power of the day. His idea was essentially to make a bunch of guesses about what the missing data could be, and then to use those guesses. This method met with resistance at first, but over the past few decades, it has become the most common way to deal with missing data in everything from population studies to drug trials. Recent advances in machine learning might make it even more widespread.

In the 1970s, Donald Rubin invented and evangelized a new statistical method for dealing with missing data. Though controversial at first, today it’s ubiquitous across many scientific fields.

Outside of statistics, to “impute” means to assign responsibility or blame. Statisticians instead assign data. If you forget to fill out your height on a questionnaire, for instance, they might assign you a plausible height, like the average height for your gender.

That kind of guess is known as single imputation. A statistical technique that dates back to 1930, single imputation works better than just ignoring missing data. By the 1960s, it was often statisticians’ method of choice. Rubin would change that.

[...] During his doctoral studies, Rubin grew interested in the missing data problem. Though single imputation avoided the bias of complete case analysis, Rubin saw that it had its own flaw: overconfidence. No matter how accurate a guess might seem, statisticians can never be completely sure it’s correct. Techniques involving single imputation often underestimate the uncertainty they introduce. Moreover, while statisticians can find ways to correct for this, Rubin realized that their methods tended to be finicky and specialized, with each situation practically requiring its own master’s thesis. He wanted a method that was both accurate and general, adaptable to almost any situation... (MORE - missing details)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Research More and more people missing from official data	C C	0	383	Jun 21, 2025 10:35 PM Last Post: C C
	Scientists want you—yes, you—to help them track cicada orgies (data acquistion)	C C	0	348	May 2, 2021 06:23 PM Last Post: C C
	5 types of cat owners (survey data) + Breast cancer link to hair dye? (data project)	C C	0	696	Sep 7, 2020 01:54 AM Last Post: C C
	Polling could be missing reality, again	C C	0	511	Jul 27, 2018 12:22 AM Last Post: C C
	Data on preferences: Is gender inequality inevitable? + Data ethics is more than what	C C	1	1,007	Jun 28, 2018 02:30 AM Last Post: Syne
	Data’s intangiblility & ownership claims + Kant according to quantitative data	C C	0	710	Jun 21, 2018 05:22 PM Last Post: C C
	Data thugs + ‘Still working’ on the data: Astronomers explain why they don’t publish	C C	0	1,112	Feb 20, 2018 08:33 PM Last Post: C C
	Data scientists find connections between birth month and health	C C	0	747	Jun 18, 2015 09:10 PM Last Post: C C