# The Illusion of Statistical Significance: Rethinking p-Values

Written on

## Chapter 1: Understanding Scientific Discoveries

What truly constitutes a scientific breakthrough? When does an intuition transform into a widely accepted truth? The intricacies of our world make it challenging to glean novel insights that withstand rigorous scrutiny and replication. This challenge becomes even more pronounced within the social and biomedical sciences, where researchers lack the controlled environments available to physicists, who can conduct countless measurements (such as in the Large Hadron Collider) with minimal uncertainty regarding their findings.

In recent times, numerous doubts have emerged surrounding results once thought to be reliable. The replication crisis in social sciences has brought these issues to light for the public, and this crisis remains ongoing. Take psychology as a case in point: concepts like social priming, previously considered solid, are now facing skepticism. A Nature article mentions that many scholars view social priming not as a method to influence subconscious behavior but as a cautionary tale of how flawed statistical techniques misled scientists into publishing results that could not be replicated. This situation is not an anomaly but rather a common occurrence: a 2018 analysis of 200 meta-analyses revealed that "psychological research is, on average, afflicted with low statistical power."

This reality may not come as a shock; studies involving humans or complex biological systems are inherently difficult to regulate, and the task of increasing sample sizes and identifying all potential sources of error can be overwhelming. Thus, it becomes easy for mistakes to occur.

### Significance Testing

Consequently, researchers seek robust criteria to determine whether their findings warrant publication. The concept of significance testing aims to establish objective metrics to differentiate valid scientific outcomes from spurious ones.

Arguably the most recognized criterion for statistical significance is the p-value. However, as we will explore, this seemingly reliable metric can lead to a host of new complications. It can lend a veneer of credibility to dubious results, obscure meaningful findings, and conceal poor research behind a misleading sense of objectivity. The p-value may, in fact, contribute to the replication crisis that afflicts various scientific fields. Instead of enhancing scientific rigor, significance testing can create the illusion that it does, which is where its potential dangers lie.

#### The p-value Explained

The p-value is a relatively simple concept. If you detect an effect in a dataset and possess a theoretical framework for what may have caused it (for instance, in a double-blind study where one group receives medication and another a placebo), how likely is it that you would observe this effect if the intervention had no actual impact (the null hypothesis)? In more abstract terms, do the variables you are analyzing have any correlation?

Consider this example: suppose you have 100 patients receiving the actual medication and another 100 receiving a placebo. After conducting the trial, 60 individuals in the treatment group recover, while only 48 in the placebo group do. How probable is it that this difference is merely a random occurrence, suggesting that the medication has no effect at all?

This probability is linked to the p-value, which has conventionally been set at p<0.05. This threshold implies that the likelihood of observing the effect, even if the null hypothesis holds true, must be less than 5% for the result to be considered significant.

The general procedure is as follows:

- Formulate the research hypothesis.
- Establish the null hypothesis.
- Choose a threshold for the probability of error (the p-value).
- Conduct a statistical significance test.

### Challenges with the p-value

In my previous article discussing the pitfalls of statistical analysis, I highlighted how numbers can create a false sense of certainty. Reducing the success of an experiment and the validity of its hypothesis to a single, objective p-value is problematic for several reasons.

The p-value is closely tied to the null hypothesis. However, what exactly is the origin of the null hypothesis, and what does it signify? Ideally, the null hypothesis represents a precise estimate of zero effect, but this only holds true in perfectly controlled experiments, where every possible effect has been eliminated from consideration.

Real-world studies investigating genuine effects are typically noisy and systematically so. Every genuine null hypothesis is an imperfect abstraction that overlooks the existence of systematic errors or noise, even in highly controlled physics experiments, let alone in medicine or social sciences.

Consequently, McShane et al. argue in their 2019 paper "Abandon Statistical Significance" that both the adequacy of the statistical model used to derive the p-value and various types of systematic error prevalent in different fieldsâ€”such as measurement errors, reliability and validity issues, biased samples, and confoundingâ€”are crucial considerations. The combination of these challenges in the biomedical and social sciences, along with the rigid zero effect null hypothesis, creates significant issues.

It is possible to obtain significant p-values from random noise, as evidenced by well-known cases (e.g., Carney, 2010). The interplay of an idealized null hypothesis with the p-value results in noisier studies being more likely (or more easily manipulated) to yield significant (yet false) findings than more controlled studies, which are, according to p-value criteria, less likely to be published.

Moreover, the p-value threshold of 0.05 is arbitrary. There is no scientific justification for dividing evidence into statistically significant and insignificant categories, and it is naive to assume that a single threshold should apply uniformly across all experiments and disciplines. Imposing such an arbitrary threshold undermines scientific reasoning, which should accommodate a variety of potentially undiscovered explanations.

These undiscovered explanations are often excluded from the null hypothesis. As Andrew Gelman points out, the null hypothesis itself is not a strong hypothesis; it merely serves as a straw man for alternative hypotheses. Comparing your hypothesis to a weak alternative does not negate other, potentially superior hypotheses that may explain the data.

Lastly, the p-value is frequently misinterpreted, which compounds the problem. A meta-analysis of 791 studies revealed that an alarming 49% misused the p-value, categorizing statistically non-significant results as evidence of no effect.

### Abandoning Significance Testing

Statistics can be a complex field. As the replication crisis persists, the need for solutions to enhance scientific inquiry is urgent. The debate surrounding the p-value and its misuse has become particularly vigorous.

So, what is the way forward? McShane et al. advocate for the scientific community to abandon the p-value, or at least adopt a more nuanced perspective that moves away from fixed thresholds and considers it alongside other metrics for evaluating experimental outcomes. Researchers should limit the influence of p-values in the publication process, refraining from exclusively publishing papers that achieve a p-value of 0.05. Many scientists have begun to push back against the reliance on statistical significance (as illustrated in this Nature article), signaling that a paradigm shift may be necessary. However, this transition will not occur unless statistical significance is critically examined in introductory statistics courses.

Ultimately, it is vital to adopt a comprehensive view of the evidence, tailored to the specific needs of the experiment and the field. It is perilous to conclude experiments with binary declarations of "effect" or "no effect" solely based on whether the p-value crosses a predetermined threshold.

However, achieving this understanding requires time, and the dynamics of academia have exacerbated the issue. For example, the pressure to publish as a lead author discourages data pooling: it is more advantageous to "discover" and publish two unreliable findings than to collaborate with competitors to enhance data quality and share the dubious recognition of uncovering inconsequential results.

The challenge remains: a sound statistical mindset is essential for conducting rigorous science, but statistics can be daunting, and producing quality research is difficult, especially in unpredictable environments. The allure of the p-value provides an easy way out. Our cognitive biases favor transforming uncertainties into certainties (as discussed in my article on the Bayesian Brain Hypothesis), but this tendency risks introducing biases into our reasoning and should not justify "uncertainty laundering," as Gelman aptly describes.

Holding onto statistical significance and classifying evidence based on it has compromised the integrity of numerous studies, ultimately eroding public trust in science and its perceived value. Therefore, it is crucial to let go of this flawed practice.

This video discusses the nuances of significance testing, including the differences between significant, insignificant, and non-significant tests in hypothesis testing.

This video delves into the concept of null hypothesis significance testing, exploring its implications and the challenges it poses in scientific research.