P-Value Interpretation: Common Mistakes and Correct Reading
A focused walkthrough of how to correctly interpret p-values: the technical definition, the four most common misinterpretations, why p-values are not the probability that H0 is true, the relationship to confidence intervals, and worked examples showing correct versus incorrect interpretation.
What You'll Learn
- โState the formal definition of a p-value
- โIdentify the four most common p-value misinterpretations
- โExplain why p is not the probability that H0 is true
- โConnect p-values to confidence intervals
- โApply correct interpretation to worked examples
1. The Formal Definition
A p-value is the probability of observing data as extreme as (or more extreme than) the observed data, assuming the null hypothesis is true. Symbolically: p = P(Data โฅ observed | H0 true). Three critical parts of this definition. (1) It is a conditional probability โ conditional on H0 being true. (2) "As extreme or more extreme" โ it includes the observed value and everything further from the null. (3) It is computed for data, not for hypotheses. The p-value is a property of the data given the model, not a property of the hypothesis given the data. This definition has direct consequences for interpretation. A p-value of 0.03 means: "If H0 were true, there would be a 3% chance of seeing data this extreme or more extreme." It does NOT mean "There is a 3% chance that H0 is true." It does NOT mean "There is a 97% chance that H1 is true." Inverting the conditional probability requires Bayes' theorem and prior probabilities โ which p-values do not contain. The American Statistical Association in 2016 published a position statement clarifying these points after decades of misuse. The takeaway: p-values measure compatibility of data with the null hypothesis, not the truth of the null hypothesis.
Key Points
- โขp = P(Data โฅ observed | H0 true)
- โขConditional on H0 being true
- โขProperty of data given the model, not hypothesis given the data
- โขDoes NOT mean probability that H0 is true
- โขDoes NOT mean probability that H1 is true
2. The Four Most Common Misinterpretations
Mistake 1: "p = 0.03 means there is a 3% chance the null is true." Wrong. The p-value is conditional on H0 being true. Inverting requires Bayes' theorem. The probability that H0 is true depends on prior probability โ which the p-value does not include. Mistake 2: "p = 0.03 means there is a 97% chance the alternative is true." Wrong, for the same reason. The complement of the p-value is not the probability of the alternative. Both probabilities require a prior. Mistake 3: "p < 0.05 proves the effect is real." Wrong. A small p-value provides evidence against H0 but does not prove the alternative. Replication, effect size, and study design all matter. A statistically significant result with tiny effect size may have no practical importance. Mistake 4: "p > 0.05 proves there is no effect." Wrong. Failing to reject H0 is not the same as accepting H0. The data may simply be underpowered to detect a real effect. "Absence of evidence is not evidence of absence" applies directly here. Bonus mistake 5: comparing p-values across studies. p = 0.04 and p = 0.06 are essentially the same. The 0.05 threshold is arbitrary; treating studies as different categorically based on p crossing 0.05 misrepresents continuous evidence. Recent recommendations move toward reporting exact p-values, effect sizes, and confidence intervals jointly.
Key Points
- โข"3% chance H0 is true" โ wrong (needs prior)
- โข"97% chance H1 is true" โ wrong (needs prior)
- โข"p < 0.05 proves effect is real" โ wrong (just evidence)
- โข"p > 0.05 proves no effect" โ wrong (may be underpowered)
- โขComparing p-values across studies often misleading
3. Why Inverting Requires Bayes Theorem
P(H0 true | data) = P(data | H0) ร P(H0) / P(data). This is Bayes' theorem. The p-value provides P(data | H0). To get P(H0 | data), you need P(H0) โ the prior probability that H0 is true before seeing the data. Without the prior, the inversion cannot be done. A p-value of 0.05 paired with a high prior (say P(H0) = 0.99) yields a high posterior probability that H0 is still true. The same p-value paired with a low prior yields a low posterior. The same data can support opposite conclusions depending on the prior. This is why frequentist hypothesis testing does not produce statements about the probability that H0 is true. It produces statements about the data given the hypothesis. Bayesian analysis explicitly incorporates priors and produces direct posterior probability statements โ but this is a different framework with different inputs and assumptions.
Key Points
- โขP(H0 | data) requires Bayes theorem
- โขInversion needs prior probability P(H0)
- โขSame p with different priors โ different posteriors
- โขFrequentist tests do not produce probability statements about H0
- โขBayesian analysis is a different framework with different inputs
4. Connection to Confidence Intervals
A 95% confidence interval and p < 0.05 are mathematically related. For a two-sided test, the test will reject H0 at alpha = 0.05 if and only if the 95% confidence interval does not include the null value (typically zero or some null-hypothesized mean). Worked example. Suppose we test H0: mean = 0 with sample data. If the 95% CI is [0.4, 2.3], the interval does not include zero โ the test rejects at p < 0.05. If the 95% CI is [-0.2, 1.8], the interval includes zero โ the test fails to reject. The CI and p-value give equivalent information. Why confidence intervals are often preferred for reporting. (1) They directly show effect size and precision. (2) They make non-significant results interpretable โ a CI of [-0.05, 0.10] tells the reader the effect is small, while p = 0.30 alone reveals nothing about effect size. (3) They are less prone to the "significance vs effect size" confusion. The recommendation from contemporary statistics: report the confidence interval along with (or instead of) the p-value.
Key Points
- โข95% CI and p < 0.05 (two-sided) are mathematically linked
- โขTest rejects H0 if CI does not include null value
- โขCIs show effect size and precision directly
- โขCIs make non-significant results interpretable
- โขModern recommendation: report CI alongside p-value
5. How StatsIQ Helps With P-Value Interpretation
Snap a photo of any test result and StatsIQ produces the correct interpretation, the confidence interval, the effect size, and flags common misinterpretation patterns. For exam prep, the app produces practice problems with multiple-choice interpretation questions and identifies which interpretations are correct versus which are common misunderstandings. StatsIQ also handles Bayes theorem inversions for students who want to understand posterior probability under specified priors. This content is for educational purposes only.
Key Points
- โขProduces correct interpretation of test results
- โขProvides confidence interval and effect size
- โขFlags common misinterpretation patterns
- โขMultiple-choice practice for interpretation
- โขHandles Bayesian posterior calculation under specified priors
Key Takeaways
- โ p = P(Data โฅ observed | H0 true)
- โ p is conditional on H0 being true
- โ p does NOT equal probability H0 is true
- โ p does NOT equal probability H1 is true
- โ Inverting requires Bayes theorem with prior probability
- โ p < 0.05 = evidence against H0, NOT proof of H1
- โ p > 0.05 = insufficient evidence, NOT proof of H0
- โ 95% CI and two-sided test at alpha = 0.05 mathematically equivalent
- โ Report CI alongside p-value for full information
- โ Effect size matters even when p < 0.05
- โ ASA 2016 statement formalized correct interpretation
- โ Underpowered studies inflate effect estimates ("winners curse")
Practice Questions
1. A study reports p = 0.02. Which interpretation is correct?
2. A study reports p = 0.06 and fails to reject H0. Does this prove there is no effect?
3. A 95% confidence interval for an effect is [-0.1, 0.4]. What is the test result at alpha = 0.05?
4. Researcher A reports p = 0.04. Researcher B reports p = 0.06. Are these results meaningfully different?
5. In an A/B test with very high traffic, you find p = 0.001 but the effect size is 0.1% conversion lift. Should you ship the variant?
FAQs
Common questions about this topic
Because decades of misuse and misinterpretation had distorted scientific practice. The 2016 ASA statement formalized six principles: (1) p-values can indicate how incompatible data are with a specified statistical model; (2) p-values do not measure probability that the studied hypothesis is true; (3) scientific conclusions should not be based only on whether p crosses 0.05; (4) proper inference requires full reporting and transparency; (5) p-value alone does not measure effect size or importance; (6) p-value does not provide good measure of evidence regarding a model or hypothesis. These principles directly address the most common misinterpretations.
No, but they should be reported together. Confidence intervals provide effect size and precision; p-values provide a single summary of compatibility with the null. Each has uses. The contemporary recommendation is to report both, along with effect size estimates, sample sizes, and analytical choices. Some journals now require this. Pure p-value reporting without context risks the misinterpretations the ASA flagged.
A two-sided p-value asks "what is the probability of seeing data this extreme in either direction?" A one-sided p-value asks "what is the probability of seeing data this extreme in the specified direction?" One-sided p-values are typically half the two-sided p-value if the direction matches. One-sided tests are only appropriate when there is a strong a priori reason to test only one direction (e.g., new drug expected to improve, not worsen, outcomes). Most researchers use two-sided tests by default to avoid bias.
In practice, exact zeros do not occur in continuous test statistics. Software often reports "p < 0.001" or "p < 2e-16" rather than exact zero. The interpretation is the same: the observed data are extremely unlikely under H0. The threshold beyond which software reports "very small" varies by software and test. The substantive conclusion is unchanged: strong evidence against H0.
Bayes factors are the Bayesian alternative to p-values. A Bayes factor of 10 means the data are 10 times more likely under H1 than under H0. This is more directly interpretable than a p-value because it does not require a prior โ it directly compares likelihoods under competing models. The challenge: Bayes factors require specifying the alternative hypothesis distribution, which p-values do not need. Both approaches have legitimate uses; the choice often reflects research community conventions rather than fundamental statistical principle.
Snap a photo of any test result and StatsIQ produces the correct interpretation, confidence interval, effect size, and flags common misinterpretation patterns. For exam prep, StatsIQ generates multiple-choice interpretation questions and identifies which interpretations are correct versus common misunderstandings. The app also handles Bayes theorem inversions under specified priors for students who want to understand posterior probability. This content is for educational purposes only.