What You Think You Know About Psychology is Wrong:
The limitations of null hypothesis significance testing
By: Zach Basehore
Are
college students psychic?!
Let's
say someone claims that people who go to college are more psychic
than people who do not attend college. So I decide to test this
claim!
How
would I do that? Well, a simple test would be to examine people's
ability to correctly predict whether a coin will land on heads or
tails when I flip it. There are 10,000 college students and 10,000
non-college students; each person predicts the results of 100,000
coin flips, one flip at a time.
The
results:
Each
participant had a proportion of correct predictions. The mean
proportion of correct predictions among college students was .50006
(that is, 50.006% correct), and the non-college-students had a mean
proportion of correct predictions equal to .49999. The SDs are .00160
and .00155, respectively.
When
you run an independent-samples t
test, this difference is statistically significant at an alpha level
of .01! The 95% CI for the difference is also quite narrow
(indicating that these means are very close to the true population
means).
So
the
statistical test gives us very strong evidence that college
students really are
more prescient than non-college students! We've made a new discovery
that revolutionizes our understanding of the human mind, and opens up
a whole new field of inquiry! Why
are college students more psychic? Is it because they're smarter?
More sensitive? Do they pay closer attention to the world around
them?
The
problem:
In this example, I've found evidence of psychic abilities!
Specifically, I've shown that college students predict the outcome of
coin flips more accurately than non-college students, and there's
less than a 1% probability that the difference I found is due to chance alone, if the null
hypothesis is true at the population level)! How exciting—I can
establish a huge name for myself among scientific psychologists, and
have my pick of schools at which to continue my groundbreaking
research! I could continue this research at Oxford… nah, let's find
a better climate; like Miami or USC. I could get multi-million-dollar
grants to fund an elaborate lab with fancy equipment! I can give TED
talks, write books and go on lucrative speaking tours...my research
will grab headlines the world over! I’ll be a household name!
The
gut-check:
But wait a second...what was the actual difference again? On average,
college students are right on 7 more trials (out of 100,000) than
non-college students?...
Any
time you gather real-world data, you’d expect there to be some
small difference between groups, even if it’s really not due to any
systematic effect. In the research described above, everything
happened in just the right way to give me a spurious result:
-
1 - low variance within each group [thanks in part to the excessive
sample size; see the law of large numbers];
-
2 - a small but statistically significant difference that can easily
be explained by a seemingly reasonable mechanism, and
-
3 - a very large sample.
These
factors explain how I found a statistically significant difference
between college students and non-college students despite
the tiny difference in means.
Excited by the significant result and the potential to trumpet my
exciting new ‘discovery’ [thereby launching a career, positioning
myself as an expert who can charge ridiculously high consulting or
speaking fees], I've failed to critically evaluate the implications
of my results. And therefore, I've failed as a scientist. :(
How can we avoid falling into that trap?
One
solution:
A
standardized measure of effect
size, like Cohen's d,
will reveal what SHOULD be obvious from a look at the raw data: this
difference between groups is tiny and practically insignificant, and
it shouldn't convince anyone
that college students are actually psychic!
In
the spirit of scientific inquiry, you can test this for yourself! At
GraphPad
QuickCalcs, enter a mean of
.50006 for Group 1 and .49999 for Group 2. Next, enter the SD of
.00160 for Group 1 and .00155 for Group 2. The N for each group is
10000. Hit "Calculate now" and see what you get.
Now,
enter the same means and SDs, but change the N to 100 for both
groups, and observe the results.
Then,
go to the Cohen's d calculator here and enter the same information (it
doesn't ask for sample size). So what does all of this information
mean?…
I’ve
already done the easy part for you:
Sample of 20,000:
Sample of 200:
Cohen's d:
***
-
-
-
-
Or that people are actually younger after listening to “When I'm Sixty-Four” by the Beatles than an instrumental control song!
- The linked article actually protests current publication practices and researcher degrees of freedom; that is, the different ex post facto decisions that researchers often make in order to obtain statistically significant differences.
This is exactly why I pound the figurative table so hard about using effect sizes and well-designed, targeted experimental research. Don't just run NHST procedures on autopilot, or collect a huge dataset and mine for significance, or draw conclusions based solely on the arbitrary p ≤ .05 standard.
“But that's not how math works! How is the .05 standard arbitrary? And where did it come from?” Well, Gigerenzer (2004) identifies the source of this practice as a 1935
textbook by the influential psychological statistician Sir R.A. Fisher—and Gigerenzer also notes that Fisher himself wrote in 1956 that the practice of always relying on the .05 standard is “absurdly academic
” and
is not useful for scientific inquiry!
So, one of the early thinkers on whose work current psychological statistical practice is based would likely recoil in horror at what has become of statistical practice in our field today! [Note, however, that Cowles and Davis (1982) identified similar, though less absolute, rules about an older statistical practice called ‘probable error.’]
Remember that the greatest scientific discoveries, such as gravity, the laws of thermodynamics, Darwin's description of natural selection, and Pavlov's discovery of classical condition—not one relied on anything like p-values.
Not.
One.
There is truly no substitute—none whatsoever—for thinking critically about the quality of your research design, the strengths and limitations of your procedure, and the size and replicability of your effect. Attempts to automate interpretation based on the .05 standard (or any such universal line-in-the-sand!) result in most researchers pumping out mounds of garbage and hoping to find a diamond in the rubbish heap, rather than setting out specifically to find a genuine diamond...
Conclusion? The validity of most psychological research is questionable (at best)! We're taught to base research around statistical procedures that are of dubious help in understanding a phenomenon—and our work is almost always published solely on that basis! This pervasive problem will not be easy to fix: we need the entire field to stop doing analyses on autopilot, and to start thinking deeply and critically!
The most powerful evidence is, and will always be, to show that an effect occurs over, and over, and over again.
***
If
you need further explanations, here are a couple helpful links:
Some
interesting links on the investigation of people who claim to have
paranormal powers: