Sunday, January 8, 2017

What You Think You Know About Psychology is Wrong

What You Think You Know About Psychology is Wrong:
The limitations of null hypothesis significance testing

By: Zach Basehore

Are college students psychic?!

Let's say someone claims that people who go to college are more psychic than people who do not attend college. So I decide to test this claim!

How would I do that? Well, a simple test would be to examine people's ability to correctly predict whether a coin will land on heads or tails when I flip it. There are 10,000 college students and 10,000 non-college students; each person predicts the results of 100,000 coin flips, one flip at a time.

The results:
Each participant had a proportion of correct predictions. The mean proportion of correct predictions among college students was .50006 (that is, 50.006% correct), and the non-college-students had a mean proportion of correct predictions equal to .49999. The SDs are .00160 and .00155, respectively.

When you run an independent-samples t test, this difference is statistically significant at an alpha level of .01! The 95% CI for the difference is also quite narrow (indicating that these means are very close to the true population means).

So the statistical test gives us very strong evidence that college students really are more prescient than non-college students! We've made a new discovery that revolutionizes our understanding of the human mind, and opens up a whole new field of inquiry! Why are college students more psychic? Is it because they're smarter? More sensitive? Do they pay closer attention to the world around them?

The problem:
In this example, I've found evidence of psychic abilities! Specifically, I've shown that college students predict the outcome of coin flips more accurately than non-college students, and there's less than a 1% probability that the difference I found is due to chance alone, if the null hypothesis is true at the population level)! How exciting—I can establish a huge name for myself among scientific psychologists, and have my pick of schools at which to continue my groundbreaking research! I could continue this research at Oxford… nah, let's find a better climate; like Miami or USC. I could get multi-million-dollar grants to fund an elaborate lab with fancy equipment! I can give TED talks, write books and go on lucrative speaking research will grab headlines the world over! I’ll be a household name!

The gut-check:
But wait a second...what was the actual difference again? On average, college students are right on 7 more trials (out of 100,000) than non-college students?...

Any time you gather real-world data, you’d expect there to be some small difference between groups, even if it’s really not due to any systematic effect. In the research described above, everything happened in just the right way to give me a spurious result:
  • 1 - low variance within each group [thanks in part to the excessive sample size; see the law of large numbers];
  • 2 - a small but statistically significant difference that can easily be explained by a seemingly reasonable mechanism, and
  • 3 - a very large sample.
These factors explain how I found a statistically significant difference between college students and non-college students despite the tiny difference in means.

Excited by the significant result and the potential to trumpet my exciting new ‘discovery’ [thereby launching a career, positioning myself as an expert who can charge ridiculously high consulting or speaking fees], I've failed to critically evaluate the implications of my results. And therefore, I've failed as a scientist. :(

How can we avoid falling into that trap?

One solution:
A standardized measure of effect size, like Cohen's d, will reveal what SHOULD be obvious from a look at the raw data: this difference between groups is tiny and practically insignificant, and it shouldn't convince anyone that college students are actually psychic!

In the spirit of scientific inquiry, you can test this for yourself! At GraphPad QuickCalcs, enter a mean of .50006 for Group 1 and .49999 for Group 2. Next, enter the SD of .00160 for Group 1 and .00155 for Group 2. The N for each group is 10000. Hit "Calculate now" and see what you get.

Now, enter the same means and SDs, but change the N to 100 for both groups, and observe the results.

Then, go to the Cohen's d calculator here and enter the same information (it doesn't ask for sample size). So what does all of this information mean?…

I’ve already done the easy part for you:

Sample of 20,000:

Sample of 200:

Cohen's d:


Statisical significance is a concept that has been called idolatrous, mindless, and an educational failure that proves essentially nothing! But every psychology major and minor has to learn it nonetheless...

The absurd focus on p-values in many social science fields (like psychology, education, economics, and biomedical science) leads to articles like the highly influential John Ioannidis piece Why Most Published Research Findings Are Falsewhich has been cited over 4000 times! 

A variety of ridiculous conclusions have been published based on small p-values, such as:

This is exactly why I pound the figurative table so hard about using effect sizes and well-designed, targeted experimental research. Don't just run NHST procedures on autopilot, or collect a huge dataset and mine for significance, or draw conclusions based solely on the arbitrary p .05 standard.

But that's not how math works! How is the .05 standard arbitrary? And where did it come from?Well, Gigerenzer (2004) identifies the source of this practice as a 1935 textbook by the influential psychological statistician Sir R.A. Fisher—and Gigerenzer also notes that Fisher himself wrote in 1956 that the practice of always relying on the .05 standard is absurdly academic and is not useful for scientific inquiry!

So, one of the early thinkers on whose work current psychological statistical practice is based would likely recoil in horror at what has become of statistical practice in our field today! [Note, however, that Cowles and Davis (1982) identified similar, though less absolute, rules about an older statistical practice called probable error.]

Remember that the greatest scientific discoveries, such as gravity, the laws of thermodynamics, Darwin's description of natural selection, and Pavlov's discovery of classical conditionnot one relied on anything like p-values. 



There is truly no substitutenone whatsoever—for thinking critically about the quality of your research design, the strengths and limitations of your procedure, and the size and replicability of your effect. Attempts to automate interpretation based on the .05 standard (or any such universal line-in-the-sand!) result in most researchers pumping out mounds of garbage and hoping to find a diamond in the rubbish heap, rather than setting out specifically to find a genuine diamond...

Conclusion? The validity of most psychological research is questionable (at best)! We're taught to base research around statistical procedures that are of dubious help in understanding a phenomenonand our work is almost always published solely on that basis! This pervasive problem will not be easy to fix: we need the entire field to stop doing analyses on autopilot, and to start thinking deeply and critically!

The most powerful evidence is, and will always be, to show that an effect occurs over, and over, and over again.

If you need further explanations, here are a couple helpful links:
Some interesting links on the investigation of people who claim to have paranormal powers:

No comments:

Post a Comment