Header


Sunday, April 15, 2018

Why I hate teaching the classics

I’m approaching the end of my first semester teaching Intro to Social Psychology. As someone who came of age during the peak of the replication crisis (Bem, Stapel, Reproducibility Project), studies publication bias, and has had a hard time finding statistically significant results, I generally have a dim view of big chunks of the literature. I was worried that we would have very little to talk about given all the uncertainty, but we’ve made a good semester of it by talking about the general ideas, their strengths and weaknesses, and the opportunities for a young scientist to contribute by addressing these uncertainties.

But this semester’s teaching has taught me one thing: I hate teaching the classics.
What makes the classics, and why do I hate teaching them? The studies that my textbooks present as classics tend to have a few common attributes, some desirable and others undesirable.
The desirable:

  1. They provide a useful summary of some broader theory.
  2. They are catchy or sticky in a way that makes them easy to remember and fun to talk about.
  3. The outcome is provocative and interesting.

The undesirable:
  1. The sample size is tiny.
  2. The p-values are either marginal or bizarrely good. 
  3. The outcome has little evidence of validity.
  4. Data from the classic study tend to predate strong tests of the theory by several decades. The strongest evidence tends to come later (if at all) when people have cleaned up the methods and run more studies (often in response to criticism).
My concern is that these qualities of classics give students the wrong idea about what makes for good psychological science, leading them to embrace the desirable attributes of these classics without considering the undesirable attributes.

Some classics that I’ve struggled with this semester:
Frederickson et al., 1998: In this classic study on the harms of self-objectification, wearing a swimsuit (vs. a sweater) caused women (but not men) to do worse on a math test, N = 82, p = .051.
Pennebaker & Beall (1986): In this classic study on the benefits of self-expression, students who wrote about a traumatic experience enjoyed better health, N = 46, p = .055 for health center visits, p = .10 for sick days, p = .040 for total health problems.
Rosenthal and Jacobson (1968): In this classic study on how expectations shape outcomes, students labeled as “about to bloom” gained more IQ than other students. Unfortunately, the data are insane, with many students scoring well outside of the range of the test, featuring pre-post scores on the scale of hundreds of points (see Snow, 1995; hat tip to Bob C-J)
Srull & Wyer (1979): In this classic study of how primes influence perceptions of others, primes influenced perceptions up to days later. Unfortunately, the data show an effect too insanely powerful to be true; in meta-analyzing this literature, DeCoster and Claypool (2004) estimate Srull & Wyer’s result as d = 5.7. (For reference, obvious effects like “men are taller than women” are in the range of d = 1.85; Simmons, Nelson, & Simonsohn, 2013.)
Festinger & Carlsmith (1959): In this classic study of cognitive dissonance, participants given a small bribe to say a boring task was fun changed their opinion of the boring task. Unfortunately, the published results contain a number of GRIM errors.
This isn’t to say that the classics are bad science, especially for their time. My concern is that their evidence is much weaker than one might expect given their status as classics. It makes me feel sometimes like I am teaching the history of psychology instead of the science of psychology; something where knowing about the peg-turning experiment is hoped to represent some greater knowledge.
Figure 1. Me and my fellow troublemakers (periphery) complaining about a classic study (center).
What’s the problem?
My concern is that these classics set a bad example for young scientists and do not prepare them to think about science according to modern standards. According to these classics, one collects a little data on a new, untested method, and so long as the p-value isn’t too far from significance, you can make an argument about how the mind works. If your idea is catchy enough, the citations will roll in forever, and few will talk about the weaknesses of the evidence. Like Daryl Bem said in his recent interview with Dan Engber, “I’m all for rigor, but I prefer other people do it. […] If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made.” 
This isn’t to say that the theories proposed by these classics are necessarily wrong. It’s just hard to teach these originals while talking about how weak that one study is. Discrediting one operationalization may unjustifiably discredit the broader idea. Maybe the whole Festinger & Carlsmith peg-turning, subject-bribing method is bunk, but cognitive dissonance is such a stronger, broader idea that it seems impossible to discard it. In that light, is it really important how Festinger & Carlsmith did it? Couldn’t we cite instead something that demonstrates the core idea with a little more refinement or rigor?
In the “Creativity and Rigor” episode of The Black Goat, Sanjay, Simine, and Alexa talk about the problem of framing creativity and rigor as enemies. This framing sets science up as some sort of battle between the creative, idea-generating geniuses and the rigorous, pencil-pushing critics. It doesn’t have to be this way, they argue -- the goals of rigor and creativity are aligned. To test interesting ideas in useful ways will require both rigor and creativity.
It’s my concern that teaching these cool-idea, weak-evidence studies as the classics may lead students to value creativity without rigor. When we canonize these early studies, we honor them for their interesting ideas or provocative manipulations, but we overlook all their weaknesses in sample size and measurement validity.
Figure 2. A brilliant idea occurs to a psychologist in 1972. The psychologist will demonstrate its truth in a sample of 28 undergraduates with a p-value of .063, an event which will be remembered by textbooks forever.

What should we do?
I would like to see more textbooks credit both the original idea and some of the stronger methods and samples. In this way, we could teach both the origin of the theory and the best science involved in testing that theory. If newer, stronger data is not available, this should be made clear as a weakness of the literature and an opportunity for students to do their own studies.
This is probably not easy to do. The classics have a lot of momentum and citations, which makes them easy to discover. Finding these newer, more rigorous studies and writing them up for textbooks will be more work. I think it will be worth it. This will help communicate to students our values as a member of the sciences. It will give more credit and more attention to psychology as an empirical science, not just a system for the generation of cool ideas.