Data misuse – or seven ways to fail with statistics
‘Many researchers lack skills in statistics. Major errors and inaccurate conclusions are common’, says Lars Holden at the Norwegian Computing Center.
Large volumes of data are of little use if the researchers lack insight into their own statistical analysis tools. There is no shortage of traps they can fall into.
Statistics are also so complicated that it can be fairly easy to manipulate the results without much risk of being caught.
‘Comprehending data, methods and calculations in a study requires a high level of skill. Few people have the know-how to thoroughly understand statistical analyses performed by someone else’, says Holden, who is managing director of the Norwegian Computing Center (NR).
A crisis
Why are statistics so important in research? Statistical methods enable researchers to examine a sample in order to draw conclusions about a whole. In clinical trials, medicines are tested on groups of research subjects in several rounds before efficacy for the majority can be deduced. A sample population is asked questions about their political preferences so that social scientists can assess current levels of support for the various parties.
The ‘reproducibility crisis’ is the name of the much-discussed phenomenon faced by the research community in recent decades – and it has nothing to do with researchers’ biological reproduction problems. A large proportion of published research findings are impossible to reproduce.
Even with access to the original data and calculation tools, it just cannot be done. The actual data analyses cannot be repeated. Researchers’ misuse and/or lack of understanding of statistics is almost always put forward as a key explanation for the crisis.
‘There is so much that can go wrong. Other factors than those described by researchers in a study can have an impact’, says Anders Løland, assistant head of research at NR.
Here are seven typical pitfalls:
1. Poor source data
Sample size and selection methods are important choices for all researchers.
‘There is a common misconception that it is always best to have a large sample and large dataset. A sample that is not representative is much worse than a small sample’, says Jan Terje Kvaløy, professor at the University
of Stavanger (UiS).
He envisages a hypothetical experiment in which researchers ask 10,000 business leaders about their views on Norway’s affiliation with the EU.
‘It will not exactly provide data that is representative of the population as a whole. A representative sample of 500 people will give a much more accurate picture’, says the professor.
Sometimes, the sample bias is less obvious, as in a US study in 2015 that examined 35,000 teenagers who went to A&E after being involved in an ATV accident. It was found that those who had not worn a helmet fared better.
‘The problem is that there may have been many ATV accidents where the helmet provided such good protection that those involved did not need to attend A&E’, says Kvaløy.
For the teenagers wearing a helmet, the accident had to be of a certain degree of severity for inclusion in the study.
‘This is a pitfall that’s easy to fall into. It’s called Berkson’s paradox’, says Kvaløy.
2. Wrong statistical model
Before researchers can start performing calculations on the data they have collected, they have to decide how to group and code the data in a computer and choose a statistical model. This involves selecting suitable statistical tools.
‘There are countless different models out there. There’s obviously a lot that can go wrong here’, says Kvaløy.
Johs Hjellebrekke is professor of sociology at the University of Bergen and director of the Norwegian University Centre in Paris. He emphasises that choices have consequences and must therefore be defensible.
‘It’s not just a case of sitting down at a computer with the dataset, shaking it well and seeing what comes out. The structure we apply to data through coding, for example, we’re guaranteed to come across again in the results. So we need to be able to analytically defend the choices for grouping the data’, he says.
Løland at NR also emphasises that all statistical tools rely on a certain number of assumptions.
‘The question is whether these assumptions are valid for the specific dataset’, he says.
Very often there is an assumption that the data are randomly selected.
‘If there are biases in the data material, the researcher must choose another modelling method, one that corrects for this’, he explains.
This may entail using more advanced statistical analyses.
‘The choice of statistical model should be justified’, says Løland.
He emphasises that good choices require knowledge of the phenomenon being investigated, the data you are working with and what relevant statistical models are available.
3. Fishing for answers
Imagine rolling a dice and getting sixes on the first five throws – that would be an improbable coincidence. Even if you roll the dice numerous times, there is a theoretical possibility of getting a six relatively often. A weak data analysis could therefore show that you are most likely to get a six.
In this case we are lucky enough to already know that, unless someone has tampered with the dice, the probability of getting any side of the dice is equally likely for each throw. We just need a calculation tool to describe the uncertainty that chance creates.
Researchers have several such statistical tools at their disposal, the most common of which – and currently the most debated – is the p-value. This is denoted by a number between 0 and 1, and the lower the p-value, the smaller the uncertainty. In many branches of science, a p-value of less than 0.05 is now typically considered low enough – i.e. a significance level of 5%. This means that an experiment will give a false positive result 1 out of 20 times in situations with no real effect. A p-value below 0.05 thus gives no guarantee that the finding is correct, but it has nevertheless become a magical threshold for getting works published.
A common, serious problem in science is researchers performing many analyses on their data to try and arrive at p-values less than 0.05. They simply go fishing for them, and he who seeks finds.
‘When we run large numbers of tests, there is a high probability that at least one of them will eventually give a p-value of less than 0.05, without there being any real effect’, says Kvaløy at UiS.
This is why it is important to specify in advance which hypotheses are being tested.
‘If you want to perform more tests, you have to compensate for this by requiring a lower p-value for each hypothesis that is tested’, says Kvaløy.
The well-known eating behaviour researcher Brian Wansink had to leave his job at Cornell University in the United States earlier this year following revelations about errors and problematic use of statistics in a number of widely discussed and cited research articles. Many of them have now been retracted.
Interestingly, it was Wansink’s own comments that triggered the investigation. In a blog post, he encouraged students to test a number of hypotheses on datasets if they did not find what they were looking for in the first round. Even after he lost his job and journals retracted his articles, he maintained that his methods and findings were sound.
‘Much of what goes wrong is not about tricks and bad attitudes. Maybe it’s just a matter of lack of knowledge – that you don’t properly think about what you’re doing. Nevertheless, data fishing represents a negative culture’, says Løland at NR.
Fishing of this type has several names, including p-hacking, significance hunting, selective inference, data dredging, cherry picking and data torture. The search for a low p-value is not typically reported in research articles.
4. Mixing statistical significance and relevance
A low p-value and a significant result do not necessarily mean that a research finding is particularly interesting or relevant; to assess this, we must look at the size of the demonstrated effect.
Researchers’ access to data is increasing, and steadily more studies have large samples. In large groups, even small effects can be detected with a low p-value. Although the researcher may claim to be fairly confident that the effect is real, it may be so small that it does not mean anything in practice.
‘This is yet another reason to be sceptical of p-values. In medicine, for example, a distinction is often made between statistical significance and clinical significance’, says Kvaløy at UiS.
Let us imagine a newspaper headline about how eating a lot of liquorice doubles the risk of a certain type of cancer. Double the risk! It sounds dramatic, but perhaps it is referring to a type of cancer that almost no one is affected by. And how many people actually eat large amounts of liquorice? The finding may mean very little in practice.
5. Simpson’s paradox
‘Researchers must also be vigilant when examining different groups and different sizes’, emphasises Løland at NR.
He gives an example relating to murder, discrimination and the death penalty in the United States. In 1981, the American researcher Michael Radelet found that there was no strong correlation between the defendant’s race and a death sentence. Nevertheless, he showed that the American judicial system discriminates against black people in such cases. How was that possible?
In the first round, he divided the data into two groups: black and white defendants. The results showed that white people received the death penalty in 11% of cases, while the corresponding proportion for black people was 8%.
The picture changed dramatically when Radelet divided the two groups into subgroups according to the ethnicity of the victim: white defendants who had killed a white person (11% received the death penalty), white defendants who had killed a black person (0%), black defendants who had killed a white person (23%), and black defendants who had killed a black person (3%).
Thus, the death penalty was much more common in cases where the victim was white, and especially so if the defendant was black.
‘The danger of Simpson’s paradox is that there may be differences in subgroups that the researchers have not identified. I think people often forget this’, says Løland.
It can therefore be crucial to refine the analysis and spend time understanding what the question posed actually entails – preferably when planning how to collect the source data.
6. Mixing correlation and causation
False causal relationships or so-called spurious correlations are an ongoing problem in science. Although a statistical correlation has been demonstrated between, for example, little sleep and heart disease, it is by no means certain that little sleep causes heart disease.
So-called causal analyses were originally designed for the experimental research design in the natural sciences.
‘In the social sciences, there has been discussion about whether certain types of causal analyses are in any way justifiable from a purely epistemological perspective’, explains sociologist Hjellebrekke.
In the example of little sleep and heart disease, stress may be the underlying factor that affects both sleep and heart disease risk.
‘Despite misinterpretation of statistical correlations as causal relationships being a well-known, classic pitfall, we still constantly see this happening’, says Kvaløy at UiS.
7. Incomplete reporting
The above points show the multitude of choices faced by a researcher when analysing a dataset. The choices affect the result, but researchers do not necessarily report on all their choices when writing research articles.
On top of all the biases this can cause, there is another major, serious problem in science today – publication bias. It is much easier to get articles with clear findings published, and more difficult to publish articles with null result. The latter often therefore end up in a drawer somewhere.
Why is that a problem? If we think back to the point about data fishing, we will remember that all research has a certain risk of false positive findings. If it is only ever the positive findings that are published, a very inaccurate picture will gradually emerge.
‘This is not a type of error made by individual researchers, but rather a fault of the entire scientific system’, says Løland.
‘Phone a statistician’
Part of the solution to the reproducibility crisis may be to promote a better and less mechanical understanding of statistics. For example, the p-value should not be described as a means for obtaining the evidence needed for publication. After all, statistics are not a gatekeeper of truth.
Statistics are a language that scientists use to talk about probability and uncertainty, but it can quickly become rather complicated.
‘It’s easy to make mistakes, but many still feel that their skills are good enough’, says Holden from NR.
Rather than sending all researchers to a range of statistics courses, he believes it is better to develop a culture for bringing in the real experts.
‘Phone a statistician, don’t delay’, says Holden.
‘Do it before you start the data collection. Post-mortem statistics are not always much fun’, says Kvaløy.
Translated from Norwegian by Carole Hognestad, Akasie språktjenester AS.