*P* values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume.

Credit: DALE EDWIN MURRAY

For a brief moment in 2010, Matt Motyl was on the brink of scientific glory: he had discovered that extremists quite literally see the world in black and white.

The results were “plain as day”, recalls Motyl, a psychology PhD student at the University of Virginia in Charlottesville. Data from a study of nearly 2,000 people seemed to show that political moderates saw shades of grey more accurately than did either left-wing or right-wing extremists. “The hypothesis was sexy,” he says, “and the data provided clear support.” The *P* value, a common index for the strength of evidence, was 0.01 — usually interpreted as 'very significant'. Publication in a high-impact journal seemed within Motyl's grasp.

But then reality intervened. Sensitive to controversies over reproducibility, Motyl and his adviser, Brian Nosek, decided to replicate the study. With extra data, the *P* value came out as 0.59 — not even close to the conventional level of significance, 0.05. The effect had disappeared, and with it, Motyl's dreams of youthful fame^{1}.

Statisticians issue warning over misuse of P values

It turned out that the problem was not in the data or in Motyl's analyses. It lay in the surprisingly slippery nature of the *P* value, which is neither as reliable nor as objective as most scientists assume. “*P* values are not doing their job, because they can't,” says Stephen Ziliak, an economist at Roosevelt University in Chicago, Illinois, and a frequent critic of the way statistics are used.

For many scientists, this is especially worrying in light of the reproducibility concerns. In 2005, epidemiologist John Ioannidis of Stanford University in California suggested that most published findings are false^{2}; since then, a string of high-profile replication problems has forced scientists to rethink how they evaluate results.

At the same time, statisticians are looking for better ways of thinking about data, to help scientists to avoid missing important information or acting on false alarms. “Change your statistical philosophy and all of a sudden different things become important,” says Steven Goodman, a physician and statistician at Stanford. “Then 'laws' handed down from God are no longer handed down from God. They're actually handed down to us by ourselves, through the methodology we adopt.”

**Out of context**

*P* values have always had critics. In their almost nine decades of existence, they have been likened to mosquitoes (annoying and impossible to swat away), the emperor's new clothes (fraught with obvious problems that everyone ignores) and the tool of a “sterile intellectual rake” who ravishes science but leaves it with no progeny^{3}. One researcher suggested rechristening the methodology “statistical hypothesis inference testing”^{3}, presumably for the acronym it would yield.

The irony is that when UK statistician Ronald Fisher introduced the *P* value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look. The idea was to run an experiment, then see if the results were consistent with what random chance might produce. Researchers would first set up a 'null hypothesis' that they wanted to disprove, such as there being no correlation or no difference between two groups. Next, they would play the devil's advocate and, assuming that this null hypothesis was in fact true, calculate the chances of getting results at least as extreme as what was actually observed. This probability was the *P* value. The smaller it was, suggested Fisher, the greater the likelihood that the straw-man null hypothesis was false.

Credit: R. NUZZO; SOURCE: T. SELLKE ET AL. AM. STAT. 55, 62–71 (2001)

For all the *P* value's apparent precision, Fisher intended it to be just one part of a fluid, non-numerical process that blended data and background knowledge to lead to scientific conclusions. But it soon got swept into a movement to make evidence-based decision-making as rigorous and objective as possible. This movement was spearheaded in the late 1920s by Fisher's bitter rivals, Polish mathematician Jerzy Neyman and UK statistician Egon Pearson, who introduced an alternative framework for data analysis that included statistical power, false positives, false negatives and many other concepts now familiar from introductory statistics classes. They pointedly left out the *P* value.

But while the rivals feuded — Neyman called some of Fisher's work mathematically “worse than useless”; Fisher called Neyman's approach “childish” and “horrifying [for] intellectual freedom in the west” — other researchers lost patience and began to write statistics manuals for working scientists. And because many of the authors were non-statisticians without a thorough understanding of either approach, they created a hybrid system that crammed Fisher's easy-to-calculate *P* value into Neyman and Pearson's reassuringly rigorous rule-based system. This is when a *P* value of 0.05 became enshrined as 'statistically significant', for example. “The *P* value was never meant to be used the way it's used today,” says Goodman.

**What does it all mean?**

One result is an abundance of confusion about what the *P* value means^{4}. Consider Motyl's study about political extremists. Most scientists would look at his original *P* value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong. The *P* value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backwards and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place. To ignore this would be like waking up with a headache and concluding that you have a rare brain tumour — possible, but so unlikely that it requires a lot more evidence to supersede an everyday explanation such as an allergic reaction. The more implausible the hypothesis — telepathy, aliens, homeopathy — the greater the chance that an exciting finding is a false alarm, no matter what the *P* value is.

*Nature*** special: ** Challenges in irreproducible research

These are sticky concepts, but some statisticians have tried to provide general rule-of-thumb conversions (see 'Probable cause'). According to one widely used calculation^{5}, a *P* value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a *P* value of 0.05 raises that chance to at least 29%. So Motyl's finding had a greater than one in ten chance of being a false alarm. Likewise, the probability of replicating his original result was not 99%, as most would assume, but something closer to 73% — or only 50%, if he wanted another 'very significant' result^{6,7}. In other words, his inability to replicate the result was about as surprising as if he had called heads on a coin toss and it had come up tails.

Critics also bemoan the way that *P* values can encourage muddled thinking. A prime example is their tendency to deflect attention from the actual size of an effect. Last year, for example, a study of more than 19,000 people showed^{8} that those who meet their spouses online are less likely to divorce (*p* < 0.002) and more likely to have high marital satisfaction (*p* < 0.001) than those who meet offline (see *Nature* http://doi.org/rcg; 2013 ). That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale. To pounce on tiny *P* values and ignore the larger question is to fall prey to the “seductive certainty of significance”, says Geoff Cumming, an emeritus psychologist at La Trobe University in Melbourne, Australia. But significance is no indicator of practical relevance, he says: “We should be asking, 'How much of an effect is there?', not 'Is there an effect?'”

Perhaps the worst fallacy is the kind of self-deception for which psychologist Uri Simonsohn of the University of Pennsylvania and his colleagues have popularized the term *P*-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “*P*-hacking,” says Simonsohn, “is trying multiple things until you get the desired result” — even unconsciously. It may be the first statistical term to rate a definition in the online Urban Dictionary, where the usage examples are telling: “That finding seems to have been obtained through *p*-hacking, the authors dropped one of the conditions so that the overall *p*-value would be less than .05”, and “She is a *p*-hacker, she always monitors data while it is being collected.”

The *P* value was never meant to be used the way it's used today.

Such practices have the effect of turning discoveries from exploratory studies — which should be treated with scepticism — into what look like sound confirmations but vanish on replication. Simonsohn's simulations have shown^{9} that changes in a few data-analysis decisions can increase the false-positive rate in a single study to 60%. *P*-hacking is especially likely, he says, in today's environment of studies that chase small effects hidden in noisy data. It is tough to pin down how widespread the problem is, but Simonsohn has the sense that it is serious. In an analysis^{10}, he found evidence that many published psychology papers report *P* values that cluster suspiciously around 0.05, just as would be expected if researchers fished for significant *P* values until they found one.

**Numbers game**

Despite the criticisms, reform has been slow. “The basic framework of statistics has been virtually unchanged since Fisher, Neyman and Pearson introduced it,” says Goodman. John Campbell, a psychologist now at the University of Minnesota in Minneapolis, bemoaned the issue in 1982, when he was editor of the *Journal of Applied Psychology*: “It is almost impossible to drag authors away from their *p*-values, and the more zeroes after the decimal point, the harder people cling to them”^{11}. In 1989, when Kenneth Rothman of Boston University in Massachusetts started the journal *Epidemiology*, he did his best to discourage *P* values in its pages. But he left the journal in 2001, and *P* values have since made a resurgence.

Statistics: P values are just the tip of the iceberg

Ioannidis is currently mining the PubMed database for insights into how authors across many fields are using *P* values and other statistical evidence. “A cursory look at a sample of recently published papers,” he says, “is convincing that *P* values are still very, very popular.”

Any reform would need to sweep through an entrenched culture. It would have to change how statistics is taught, how data analysis is done and how results are reported and interpreted. But at least researchers are admitting that they have a problem, says Goodman. “The wake-up call is that so many of our published findings are not true.” Work by researchers such as Ioannidis shows the link between theoretical statistical complaints and actual difficulties, says Goodman. “The problems that statisticians have predicted are exactly what we're now seeing. We just don't yet have all the fixes.”

Statisticians have pointed to a number of measures that might help. To avoid the trap of thinking about results as significant or not significant, for example, Cumming thinks that researchers should always report effect sizes and confidence intervals. These convey what a *P* value does not: the magnitude and relative importance of an effect.

Many statisticians also advocate replacing the *P* value with methods that take advantage of Bayes' rule: an eighteenth-century theorem that describes how to think about probability as the plausibility of an outcome, rather than as the potential frequency of that outcome. This entails a certain subjectivity — something that the statistical pioneers were trying to avoid. But the Bayesian framework makes it comparatively easy for observers to incorporate what they know about the world into their conclusions, and to calculate how probabilities change as new evidence arises.

Others argue for a more ecumenical approach, encouraging researchers to try multiple methods on the same data set. Stephen Senn, a statistician at the Centre for Public Health Research in Luxembourg City, likens this to using a floor-cleaning robot that cannot find its own way out of a corner: any data-analysis method will eventually hit a wall, and some common sense will be needed to get the process moving again. If the various methods come up with different answers, he says, “that's a suggestion to be more creative and try to find out why”, which should lead to a better understanding of the underlying reality.

Simonsohn argues that one of the strongest protections for scientists is to admit everything. He encourages authors to brand their papers '*P*-certified, not *P*-hacked' by including the words: “We report how we determined our sample size, all data exclusions (if any), all manipulations and all measures in the study.” This disclosure will, he hopes, discourage *P*-hacking, or at least alert readers to any shenanigans and allow them to judge accordingly.

A related idea that is garnering attention is two-stage analysis, or 'preregistered replication', says political scientist and statistician Andrew Gelman of Columbia University in New York City. In this approach, exploratory and confirmatory analyses are approached differently and clearly labelled. Instead of doing four separate small studies and reporting the results in one paper, for instance, researchers would first do two small exploratory studies and gather potentially interesting findings without worrying too much about false alarms. Then, on the basis of these results, the authors would decide exactly how they planned to confirm the findings, and would publicly preregister their intentions in a database such as the Open Science Framework (https://osf.io). They would then conduct the replication studies and publish the results alongside those of the exploratory studies. This approach allows for freedom and flexibility in analyses, says Gelman, while providing enough rigour to reduce the number of false alarms being published.

More broadly, researchers need to realize the limits of conventional statistics, Goodman says. They should instead bring into their analysis elements of scientific judgement about the plausibility of a hypothesis and study limitations that are normally banished to the discussion section: results of identical or similar experiments, proposed mechanisms, clinical knowledge and so on. Statistician Richard Royall of Johns Hopkins Bloomberg School of Public Health in Baltimore, Maryland, said that there are three questions a scientist might want to ask after a study: 'What is the evidence?' 'What should I believe?' and 'What should I do?' One method cannot answer all these questions, Goodman says: “The numbers are where the scientific discussion should start, not end.”

## References

Nosek, B. A., Spies, J. R. & Motyl, M.

*Perspect. Psychol. Sci.***7**, 615–631 (2012).Ioannidis, J. P. A.

*PLoS Med.***2**, e124 (2005).(Video) Type I error vs Type II errorLambdin, C.

*Theory Psychol.***22**, 67–90 (2012).Goodman, S. N.

*Ann. Internal Med.***130**, 995–1004 (1999).Goodman, S. N.

*Epidemiology***12**, 295–297 (2001).Goodman, S. N.

*Stat. Med.***11**, 875–879 (1992).Gorroochurn, P., Hodge, S. E., Heiman, G. A., Durner, M. & Greenberg, D. A.

*Genet. Med.***9**, 325–321 (2007).Cacioppo, J. T., Cacioppo, S., Gonzagab, G. C., Ogburn, E. L. & VanderWeele, T. J.

*Proc. Natl Acad. Sci. USA***110**, 10135–10140 (2013).ADS CAS Article Google Scholar

Simmons, J. P., Nelson, L. D. & Simonsohn, U.

*Psychol. Sci.***22**, 1359–1366 (2011).Simonsohn, U., Nelson, L. D. & Simmons, J. P.

*J. Exp. Psychol.*http://dx.doi.org/10.1037/a0033242 (2013).(Video) HYPOTHESIS TESTING [THE SCIENTIFIC METHOD]Campbell, J. P.

*J. Appl. Psych.***67**, 691–700 (1982).

## FAQs

### What is statistical error in research methodology? ›

Error (statistical error) **describes the difference between a value obtained from a data collection process and the 'true' value for the population**. The greater the error, the less representative the data are of the population. Data can be affected by two types of error: sampling error and non-sampling error.

### How can statistical data be misinterpreted? ›

The data can be misleading **due to the sampling method used to obtain data**. For instance, the size and the type of sample used in any statistics play a significant role — many polls and questionnaires target certain audiences that provide specific answers, resulting in small and biased sample sizes.

### How do scientists manage error in their research? ›

Four ways to reduce scientific errors are by **tests of equipment and programs, examination of results, peer review, and replication**.

### How do you avoid statistical errors? ›

**Increase the sample size**

The sample size primarily determines the amount of sampling error, which translates into the ability to detect the differences in a hypothesis test. A larger sample size increases the chances to capture the differences in the statistical tests, as well as raises the power of a test.

### What are the types of statistical errors? ›

Two potential types of statistical error are Type I error (α, or level of significance), when one falsely rejects a null hypothesis that is true, and Type II error (β), when one fails to reject a null hypothesis that is false.

### What are the main sources of statistical errors? ›

The second axis distinguishes five fundamental sources of statistical error: **sampling, measurement, estimation, hypothesis testing, and reporting**. Bias is error of consistent tendency in direction.

### When can statistics be misleading? ›

Misleading statistics refers to the **misuse of numerical data either intentionally or by error**. The results provide deceiving information that creates false narratives around a topic. Misuse of statistics often happens in advertisements, politics, news, media, and others.

### How statistics can be misleading examples? ›

In 2007, **toothpaste company Colgate ran an ad stating that 80% of dentists recommend their product**. Based on the promotion, many shoppers assumed Colgate was the best choice for their dental health. But this wasn't necessarily true. In reality, this is a famous example of misleading statistics.

### Why statistics are not reliable? ›

The studies are often not repeatable and usually not predictive. The reason for this is that **people and what they say or do are the bases of t he statistics**. It seems axiomatic that people will perversely refuse to say or do the same thing twice running, or let anyone predict what they will do.

### Which statistical error is avoidable? ›

**Type I Error** is equivalent to a false positive. It occurs when we believe we have found a significant difference when there isn't one. Typically, researchers set Alpha (α) to . 05, limiting the chance of making a Type I Error to 5%.

### What is the basic error in statistical work? ›

Common errors encountered during statistical application include but are not limited to: **Choosing wrong test for a particular data**. Choosing a wrong test for the proposed hypothesis. Falsely elevated type-I error during post-hoc significance analysis.

### How can errors in hypothesis testing be reduced? ›

One of the most common approaches to minimizing the probability of getting a false positive error is to **minimize the significance level of a hypothesis test**. Since the significance level is chosen by a researcher, the level can be changed. For example, the significance level can be minimized to 1% (0.01).

### How can research errors be overcome? ›

**Avoiding Five Common Research Errors**

- Don't believe everything you see. or read.
- Always document WHERE you got. EVERY fact in your tree.
- Avoid making assumptions. If you have. ...
- Don't rush backward in time. For some researchers. ...
- Don't assume you are related to. Abraham Lincoln.

### What are the strategies to eliminate errors? ›

**Five ways to reduce errors based on reliability science**

- Standardize your approach. ...
- Use decision aids and reminders. ...
- Take advantage of pre-existing habits and patterns. ...
- Make the desired action the default, rather than the exception. ...
- Create redundancy.

### How do you interpret errors in statistics? ›

**The higher the error value, the lesser will be the representative data of the community**. In simple words, a statistics error is a difference between a measured value and the actual value of the collected data. If the error value is more significant, then the data will be considered as the less reliable.

### What is standard error in statistics? ›

What is standard error? The standard error of the mean, or simply standard error, **indicates how different the population mean is likely to be from a sample mean**. It tells you how much the sample mean would vary if you were to repeat a study using new samples from within a single population.

### What is a Type 2 error in statistics example? ›

A type II error produces a false negative, also known as an error of omission. For example, **a test for a disease may report a negative result when the patient is infected**. This is a type II error because we accept the conclusion of the test as negative, even though it is incorrect.

### What is statistical error example? ›

**What are the common statistical mistakes?**

- Absence of an adequate control condition/group.
- Interpreting comparisons between two effects without directly comparing them.
- Spurious correlations.
- Inflating the units of analysis.
- Correlation and causation.
- Use of small samples.
- Circular analysis.
- Flexibility of analysis.

### What are some statistical concerns? ›

**Here are five common problems when using statistics.**

- Problem 1. Extracting meaning out of little difference. ...
- Problem 2. Using small sample sizes. ...
- Problem 3. Showing meaningless percentages on graphs. ...
- Problem 4. Poor survey design. ...
- Problem 5. Scaling and axis manipulation.

### What is statistical error how it differ from mistake? ›

Statistical Error:Term error is used in statistics in a technical sense. It is **the difference between the estimated or approximated value and the true value**. Mistake:The mistake arises because of miscalculations, use of wrong methods of calculations and wrong interpretation of the result.

### Are statistics always right? ›

Are statistics always right? **Not**. Results of statistics depend on the mode of data collection. There are lots of chance to manipulate data and and thus results.

### What are three ways in which people can lie with statistics? ›

- 3 Ways to Lie with Statistics. Based on the concepts presented above, there are three that a researcher might misrepresent things in presenting statistics:
- Amplifying the Importance of Statistical Significance. ...
- Capitalizing on Type-I Error. ...
- Failing to Report on Effect Size Information.

### How do you know if statistics are accurate? ›

**7 Clues For Identifying Reliable Statistics**

- Statistics Benefit the Group Who Collected the Information. ...
- Small Sample Size. ...
- Error Margins Are Too Large. ...
- The Sample Representation Is Inaccurate or Biased. ...
- Incentives are Inappropriate for the Sample. ...
- The Context Is Not Reported. ...
- The Statistic Flies in the Face of Precedent.

### Can statistics be persuasive and misleading? ›

**Statistics are persuasive**. So much so that people, organizations, and whole countries base some of their most important decisions on organized data. But any set of statistics might have something lurking inside it that can turn the results completely upside down.

### Can statistics be manipulated? ›

**Manipulating statistics isn't as difficult as one may think**. Although statistics are hard numbers and lying about them isn't legal, that doesn't mean they can't be skewed or framed in a way to make the presenter look better. Keep reading for the top eight ways that statistics are manipulated.

### What are the limitations of statistics? ›

**On heterogeneous data, statistics cannot be implemented**. In gathering, analyzing, and interpreting the data, if adequate care is not taken, statistical findings can be misleading. Statistical data can be treated effectively only by a person who has professional knowledge of statistics.

### Should you always trust statistics? ›

With the variation in quality and quantity for the same study, the outcome of the statistical data can be completely different. Hence, **the randomness of a sample is the reason why such statistical data cannot be trusted blindly**.

### What is statistical error how it differ from mistake? ›

Statistical Error:Term error is used in statistics in a technical sense. It is **the difference between the estimated or approximated value and the true value**. Mistake:The mistake arises because of miscalculations, use of wrong methods of calculations and wrong interpretation of the result.

### Is statistical error standard deviation? ›

**The standard error (SE) of a statistic is the approximate standard deviation of a statistical sample population**. The standard error is a statistical term that measures the accuracy with which a sample distribution represents a population by using standard deviation.

### What is sampling error in statistics? ›

A sampling error is **a statistical error that occurs when an analyst does not select a sample that represents the entire population of data**. As a result, the results found in the sample do not represent the results that would be obtained from the entire population.

### What is statistical error class 11? ›

Statistical errors are **those which are occurred during the collection of data and it is dependent on the size of the sample selected to study**. These are of 2 types: Sampling error. Non-sampling error.

### What is standard error in statistics? ›

What is standard error? The standard error of the mean, or simply standard error, **indicates how different the population mean is likely to be from a sample mean**. It tells you how much the sample mean would vary if you were to repeat a study using new samples from within a single population.

### What can go wrong in statistics? ›

Types of mistakes

Many mistakes in using statistics fall into one of the following categories: **Expecting too much certainty**. Misunderstandings about probability. Mistakes in thinking about causation.

### How do you know if two samples are statistically different? ›

**A t-test** is an inferential statistic used to determine if there is a statistically significant difference between the means of two variables. The t-test is a test used for hypothesis testing in statistics.

### What is the difference between standard error and estimated standard error? ›

**Standard error gives the accuracy of a sample mean by measuring the sample-to-sample variability of the sample means**. The SEM describes how precise the mean of the sample is as an estimate of the true mean of the population.

### Should I use standard deviation or standard error? ›

So, **if we want to say how widely scattered some measurements are, we use the standard deviation**. If we want to indicate the uncertainty around the estimate of the mean measurement, we quote the standard error of the mean. The standard error is most useful as a means of calculating a confidence interval.

### What is a Type 2 error in statistics? ›

A Type II error means **not rejecting the null hypothesis when it's actually false**. This is not quite the same as “accepting” the null hypothesis, because hypothesis testing can only tell you whether to reject the null hypothesis.

### What are the four types of errors? ›

Errors in Measurement: **Measurement, Gross Errors, Systematic Errors, Random Errors** and FAQs.

### What are two types of sampling errors? ›

**What are the most common sampling errors in market research?**

- Population specification error: A population specification error occurs when researchers don't know precisely who to survey. ...
- Sample frame error: Sampling frame errors arise when researchers target the sub-population wrongly while selecting the sample.

### What type of error is population error? ›

Sampling errors occur when numerical parameters of an entire population are derived from samples of the entire population. The difference between the values derived from the sample of a population and the true values of the population parameters is considered a sampling error.

### What are 5 types of errors? ›

The errors that may occur in the measurement of a physical quantity can be classified into six types: **constant error, systematic error, random error, absolute error, relative error and percentage error**.

### What are the three types of errors? ›

**Types of Errors**

- (1) Systematic errors. With this type of error, the measured value is biased due to a specific cause. ...
- (2) Random errors. This type of error is caused by random circumstances during the measurement process.
- (3) Negligent errors.

### What causes mistakes or errors in measurement? ›

Random error occurs due to chance. There is always some variability when a measurement is made. Random error may be caused by **slight fluctuations in an instrument, the environment, or the way a measurement is read**, that do not cause the same error every time.