The legacy of social psychology

To anyone teaching psychology.

In this post I express some concerns about the prestige given to ‘classic’ studies, which are widely taught in undergraduate social psychology courses around the world. I argue that rather than just demonstrating a bunch of clever but dodgy experiments, we could teach undergraduates to evaluate studies for themselves. To exemplify this, I quickly demonstrate power, Bayes factors, the p-checker app and the GRIM test.

psychology’s foundations are built not of theory but with the rock of classic experiments

Christian Jarrett

Here is an out-of-context quote from Sanjay Srivastava from a while back:


This got me thinking about why and how we teach classic studies.

Psychologists usually lack the luxury of well-behaving theories. Some have thus proposed that the classic experiments, which have survived in the literature until the present, serve as the bedrock of our knowledge 1. In the introduction to a book retelling the stories of classic studies in social psychology 2, the authors note that classical studies have “played an important role in setting the research agenda for the field as it has progressed over time” and “serve as common points of reference for researchers, teachers and students alike”. The authors continue by pointing out that many of these classics lacked sophistication, but that this in fact is a feature of their enduring appeal, as laypeople can understand the “points” the studies make. Exposing the classics to modern statistical methods, would thus miss their point.

Now, this makes me wonder; if the point of a study is not to assess the existence of a phenomenon, what in the world may it be? One answer would be to serve as historical examples of practices no longer considered scientific, but I doubt this is what’s normally thought. Notwithstanding, I wanted to dip into the “foundations” of our knowledge by demostrating the use of some more-or-less recently developed tools on a widely known article. According to Google Scholar, the Festinger and Carlsmith cognitive dissonance experiment 3 has been cited for over three thousand times, so its influence is hard to downplay.


But first, a necessary digression: statistical power is the probability of detecting a “significant” effect of the postulated size, if the null hypothesis is false. As explained in Brunner & Schimmack 4, it is an interesting anomaly that the statistical power of studies in psychology is usually small, but almost all of them end up finding these “significant” results. As to how small, power doubtfully exceeds 50% 5–7, and for small (conventional?) effect sizes, the mean has been shown to be as low as 24%. As a recent replication project regarding the ego depletion effect 8 exemplified, a highly “replicable” (as judged by the published record) phenomenon may turn out to be a fluke, when null findings are taken into account. This has recently made psychologists consider the uncomfortable possibility, that entire research lines consisting of “accumulated scientific evidence” may in fact not contain that much evidence 9,10.

So, what is the statistical power of Festinger and Carlsmith? Using G*Power 11, it turns out that they had 80% chance to discover a humongous effect of d = 0.9, and only a coin flip’s probability to find a (still large) effect of d = 0.64. Now, if an underpowered study finds an effect, with current practices it is likely to be exaggerated, and/or even of the wrong sign 12. Here would be a nice opportunity to demonstrate these concepts to students.

Considering the low power, it may not come as a surprise that the evidence the study provided was low to begin with. A Bayes Factor (BF) is an indicator of evidence for one hypothesis, in relation to another. In this case, a BF of ~3 moves an impartial observer from being 50% sure the experiment works to being 75% sure, or a skeptic from being 25% sure to being 43% sure that the effect is small instead of nil.

It would be relatively simple to introduce Bayes Factors with this study. The effect of a prior scale in this case does not matter much for reasonable choices, as exemplified with a plot made in JASP with two clicks:

Figure 1: Bayes factor robustness check for the main finding of the dissonance study. Plotted by JASP, using n=20 for both groups, a t-value of 2.48 and a cauchy prior scale of 0.4.

Nowadays it is possible to easily check, whether a paper correctly reports test statistics and their associated p-values. The p-checker app (this link feeds the relevant statistics to the app) can do this, and it turns out that most of the t-values in the paper are incorrectly rounded down (assuming, that “significant at the 0.08 level” means p < 0.08). You can demonstrate this by including the link on your slides, using it to go to p-checker and choosing “p-values correct?”.

Finally, you can look at the study using the GRIM test 13, which evaluates if the reported means are mathematically possible. As it turns out, a quarter of the reported means in the table with the main results do not pass the test. One more time: 25% of the reported means are mathematically impossible. The most likely explanation for this is shoddy reporting of means or accidental misreporting of sample sizes, but I find it telling that—to my knowledge, at least—the issue has not come up in fifty years of scientific investigation.

Figure 2: Main results table of the Festinger & Carlsmith study. Circled means are mathematically impossible given the reported sample sizes.

Now, even though I have doubts about this study, as well as the process by which the theory has “evolved” 14, it does not mean that cognitive dissonance effects do not exist. It is just that the research may not have been able to capture the essence of this everyday phenomenon (which, if it exists, can influence behaviour without the help of academics). Under the traditional paradigm of psychological science, fraught with publication bias and unhelpful incentives 10, a Registered Replication Report (RRR) -type of work would be needed, and even that could only test one operationalisation. As an undergraduate, I would have been exhilarated to hear early about how and why such initiatives work, and why the approach is much more informative than any singular experiments.

Returning to the notion of the bedrock of psychology, consisting of classic experiments instead of theories as in the natural sciences 1. Perhaps we need a more solid foundation, regardless of whether some flashy findings from decades ago happened to spur out a progressive-ish 15,16 line of research.

How would such foundation come to be? Maybe teaching could play a role?


  1. Jarrett, C. Foundations of sand? The Psychologist 21, 756–759 (2008).
  2. Smith, J. R. & Haslam, S. A. Social psychology: Revisiting the classic studies. (SAGE Publications, 2012).
  3. Festinger, L. & Carlsmith, J. M. Cognitive consequences of forced compliance. The Journal of Abnormal and Social Psychology 58, 203–210 (1959).
  4. Brunner, J. & Schimmack, U. How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies. (2016).
  5. Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 14, 365–376 (2013).
  6. Cohen, J. Things I have learned (so far). American psychologist 45, 1304 (1990).
  7. Sedlmeier, P. & Gigerenzer, G. Do studies of statistical power have an effect on the power of studies? Psychological bulletin 105, 309 (1989).
  8. Hagger, M. S. et al. A multi-lab pre-registered replication of the ego-depletion effect. Perspectives on Psychological Science (2016).
  9. Earp, B. D. & Trafimow, D. Replication, falsification, and the crisis of confidence in social psychology. Front. Psychol 6, 621 (2015).
  10. Smaldino, P. E. & McElreath, R. The Natural Selection of Bad Science. arXiv preprint arXiv:1605.09511 (2016).
  11. Faul, F., Erdfelder, E., Lang, A.-G. & Buchner, A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods 39, 175–191 (2007).
  12. Gelman, A. & Carlin, J. Beyond Power Calculations Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science 9, 641–651 (2014).
  13. Brown, N. J. L. & Heathers, J. A. J. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science (2016). doi:10.1177/1948550616673876
  14. Aronson, E. in The science of social influence: Advances and future progress (ed. Pratkanis, A. R.) 17–82 (Psychology Press, 2007).
  15. Lakatos, I. History of science and its rational reconstructions. (Springer, 1971).
  16. Meehl, P. E. Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry 1, 108–141 (1990).


How lack of transparency feeds the beast

This is a presentation I held for the young researchers branch of the Finnish Psychological Society. I show how low power and lack of transparency can lead to weird situations, where the published literature contains little or no knowledge.


We had big fun with Markus Mattsson and Leo Aarnio in a seminar, presenting to a great audience of eager young researchers.

The slides for my talk are here:

If you’re interested in more history and solutions, check out Felix Schönbrodt‘s slides here. Some pictures were made adapting code from a wonderful Coursera MOOC by Daniel Lakens. For Bayes, check out Alexander Etz‘s blog.

Oh, and for the monster analogy; this piece made me think of it.

The myth of the magical “Because”

In this post I try to answer the call for increased transparency in psychological science by presenting my master’s thesis. I ask for feedback about the idea and the methods. I’d also appreciate suggestions for which journal it might be wise to submit the paper I’m now starting to write with co-authors. Check OSF for the documents: thesis is here (33 pages), analysis code and plots here (I presented the design analysis in a previous post).

In my previous career as a marketing professional, I was often enchanted by news about behavioral science. Such small things could have such large effects! When I moved into social psychology, it turned out that things weren’t quite so simple.

One study that intrigued me was done in the 70’s, and has since gained huge publicity (see here and here, for examples). The basic story is, that you could use the word because to get people to do things, due to a learned “reason → compliance” link.


Long story short, I was able to experiment in a within-trial setting of a health psychology intervention. Here’s a slideshow adapted from what I presented in the annual conference of the European Health Psychology Society:


Things I’m happy about:

  • Maintaining a Bayes Factor / p-value ratio of about 1:2. It’s not “a B for every p“, but it’s a start…
  • Learning basic R and redoing all analyses in the last minute, so I wouldn’t have to mention SPSS 🙂
  • Figuring out how this pre-registration thing works, and registering before end of data collection.
  • Using the word “significant” only twice and not in the context of results.

Things I’m not happy about:

  • Not having pre-registered before starting data collection.
  • Not knowing what I now know, when the project started. Especially about theory formation and appraisal (Meehl).
  • Not having an in-depth understanding of the mathematics underlying the analyses (although math and logic are priority items on my stuff-to-learn-list).
  • Not having the data public… yet. It will be in 2017 the latest, but hopefully already this autumn.

A key factor for fixing psychological science is transparency; making analyses, intentions and data available for all researchers. As a consequence, anyone can point out inconsistencies and use the findings to elaborate on the theory, making accumulation of knowledge possible.

Science is all about predicting, and everyone knows how anyone can say “yeah, I knew that’d happen”. The most impressive predictions are those made well before things start happening. So don’t be like me, and pre-register your study before the start of data collection. It’s not as hard as it sounds! For clinical trials, this can be done for free in the WHO-approved German Clinical Trials Register (DRKS). For all trials, the Open Science Framework (OSF) website can be used for pre-registering plans and protocols, as well as making study data available for researchers everywhere.There’s also an extremely easy-to-use pre-registration site AsPredicted.

One can also use the OSF website as a cloud server to privately manage one’s workflow (for free). As a consequence, automated version control protects the researcher in the case of accusations of fraud or questionable research practices. Check the site out by browsing my thesis here (33 pages) or analysis code and plots here.

ps. If there’s anything weird in that thesis, it’s probably because I have disregarded some piece of advice from Nelli Hankonen, Keegan Knittle and Ari Haukkala, for whose comments I’m indebted to.

Defeating the crisis of confidence in science: 3 + 3 ideas

[Update 6. March 2016: new figure for the Bayesian RP:P + some minor changes]

The first thing you need to know about practical science is that it is not a miraculous (or often, even awesome) way to learn about the world. As put in the excellent blog Less Wrong, it is just the first set of methods that isn’t totally useless when trying to make sense of the modern world. Although problems are somewhat similar in all sciences, I will focus on psychology here.

One of the most important projects in the history of psychology was published in the journal Science in the end of August. In the “Reproducibility Project: Psychology”, 356 contributors tried to re-do 100 studies published in high-profile psychology journals (all in 2008). (You can download the open data and see further info here.) Care was taken to mimic the original experiments as closely as possible, just with many more participants for increased reliability.

The results? Not too flattering: the effects in the replications were only about half as big as in the original studies. Alexander Etz provides an informative summary in this figure from this blog about a recent paper:

If Bayes Factor (B) is somewhere between 1/10 and 10 (or, for example 1/3 and 3, if you have less strict evidence standards), you can’t make confident conclusions. Most of the replicated studies (the ones on the lower left corner) never contained much information in the first place!

The number of “successes” in different fields of psychology depends on how you count and what you include (for example, what you think counts as social or cognitive psychology). For social psychology, the success rate of replication was somewhere between 8% (3 out of 38; by Replication Index) and 25% (14 out of 55; the paper in Science). Even conceding that the scientific method is not perfect, as referred to in the beginning, this was not what I expected to see.

My thoughts and beliefs often torment the hell out of me, so I’ve learned to celebrate when they turn out to be false. Thus, I ended up informing people at Helsinki University’s discipline of social psychology by replicating a “Friday-cake-for-no-reason” from a month earlier by one awesome colleague:

Behold; the Replicake!
Behold; the Replicake!

The messenger cake worked well in Helsinki, but unfortunately the news were too bitter a dish to many. The results raised several confused reactions from psychologists who refused to believe the sorry state of status quo. I like the term “hand-wringling” as a description, the loudest arguments being (in no specific order):

If you're a researcher in the psychological sciences, denial of troubles isn't a real option.
Don’t just sit there, go make the world a better place!
  1. The replicators didn’t know what they were doing.
  2. The studies replicated weren’t representative of the actual state of art.
  3. This is how science is supposed to work, no cause for alarm!
    1. … because some fields are doing even worse (e.g. replicability of cancer biology may be just 10%-25%, economics ~49%, psychiatry less than 40% etc.)
    2. … because non-replications are a part of the self-correcting nature of science.

Andrew Gelman answers these points eloquently, so I won’t go much deeper into them. Note also, that Daniel “Stumbling on Happiness” Gilbert & co. used these arguments in their much publicised (but unhappily, flawed) critique of the psychology’s replication effort.

Suffice it to say that I value practicality; claiming there is a phenomenon only you can show (ideally, when no-one is looking) doesn’t sound too impressive to me. What worries me is this: science is supposed to embrace change and move forward with cumulative knowledge. Instead, researchers often take their favourite findings to a bunker and start shouting profanities to whoever wants to have a second look.

Researchers take their favourite findings to a bunker and start shouting profanities to whoever wants to have a second look.

I think anyone who’s seen the Internet recognises the issue. Personally, I find it hard to believe that arguing can change things, so I’d rather see people exemplify their values by their actions.

Sometimes I find consolation in Buddhist philosophy, so here are some thoughts maybe worth considering when you need to amp up your cognitive flexibility:

1. “You” are not being attacked.

Things are non-self. Just as wishing doesn’t make winning the lottery more likely, a thought of yours that turns out to be ill-informed doesn’t destroy “you”. Your beloved ideas, whether they concern the right government policy, the right way to deal with refugees or the right statistical methods in research, may turn out to be wrong. It’s okay. You can say you don’t know or don’t expect your view to be the final solution.

When Richard Ryan visited our research group, I asked him when does he expect his self-determination theory to die. The answer was fast: “When it gets replaced by a higher-order synthesis”. He had thought about it and I respect that.

2. The business of living includes stress, but it’s worse if you cling to stuff.

Wanting to hold on to things you like and resist things you don’t is normal, but takes up a lot of energy. You might want to try not gripping so hard and see if it makes an actual difference in how long the pleasures or displeasure lasts. So; if your ideas are under fire, take a moment to think about what life would be without whatever it is being threatened.

One of the big ideas in science is that we need big ideas. And, of course, big ideas are exciting. The problem is that most ideas – big or small – will turn out to be wrong and if we don’t want to be spectacularly wrong, we might want to take small steps at a time. As Daniel Lakens, one of the authors of the reproducibility project, put it:


3. Nothing will last (and this, too, will pass – but the past will never return).

Although calls for change in research practices have been made for at least half a century, this time the status quo is currently going away fast. It might be a product of the accelerating change we see in all human domains. It’s impossible to predict how things end up, but change isn’t going away. What you can do, is to try create the kind of change that reflects what you think is right.

For an example in research, take statistical methods, where the insanity of the whole “p<0.05 -> effect exists” approach has become more and more common knowledge in the recent years. Another change is happening in publishing practices; we are no more bound by the shackles of the printing press, which did serve science well for a long time. This means infinite storage space for supplements and open data for anyone to confirm or elaborate upon another researcher’s conclusions. Of course, the traditional publishing industry isn’t that happy about seeing their dominance crumble. But in the end, they too must change to avoid the fate of (music industry) dinosaurs in this NOFX-song from 15 years ago.

Mentions of
Change in action: Mentions of “Bayesian” in the English literature since the death of rev. Thomas Bayes in 1761. Click for source in Google Ngram. For an intro to Bayesian ideas, check out this or this.

Good research with real effects does exist!

The reproducibility project described above was actually not the first large-scale replication project in psychology. Projects called “Many Labs”, where effects in psychology are tested with different emphases, are just now beginning to bear fruit:

  • Many Labs 1 (over 6 000 participants; published fall 2014) picked 13 classic and contemporary effects and managed to replicate 10 consistently. Priming studies were found hard to replicate. Interestingly the fact that most psychology studies are conducted on US citizens didn’t have much of an effect.
  • Many Labs 2 (ca. 15 000 participants; expected on October 2015) studied how effects vary across persons and situations.
  • Many Labs 3 (around 3 500 participants; currently in press) studied mainly the so-called “semester-effects”. As study participants usually are university students, it has been thought they might behave differently in different points of the semester. Apparently they don’t, which is good news. The not-so-good news is that only three of the original 10 results was replicated.
  • Many Labs 4 (in preparation phase) will study how replicator expertise affects replicability, as well as whether involving the original author makes a difference.

These projects definitely will increase our understanding of psychological science, although suffer from some limitations themselves (such as the fact that e.g. really expensive studies get less replication attempts for practical reasons).

… It’s just really hard to tell what’s real and what’s not.

In the Cochrane Colloquium 2015, John “God of Meta-analysis” Ioannidis (the guy who published the 3000+ times cited paper Why Most Published Research Findings Are False) ended his presentation with a discouraging slide. He concluded that systematic reviews in biomedicine have become marketing tools with illusory credibility assigned to them.

The field I’m most interested in is health psychology. So when one of the world’s top researchers in the field tweeted that poorly performing meta-analyses are increasingly biasing psychological knowledge, I asked him to elaborate. Here’s his reply:


Susan Michie addressed the reproducibility problem in her talk at the annual conference of the European Health Psychology Society, with an emphasis on behavior change. She mostly addressed reporting, but questionable research practices are undoubtedly important, too.

Susan Michie presenting in EHPS 2015. Click to enlarge.
Susan Michie presenting in EHPS 2015. Click to enlarge.

This became clear in the very same conference, when a PHD student told me how a professor reacted to his null results: “Ok, send me your data and you’ll have a statistically significant finding in two weeks”. I have hope that young researchers are getting more savvy with methods and more confident that the game of publishing can be changed. This opens the door for fraudulent authority figures to exit the researcher pool like Diederik Stapel – by the hands of their students, instead of a failed peer-review process.

“Ok, send me your data and you’ll have a statistically significant finding in two weeks”.

– a professor’s reaction to null results


Based on all the above, here’s what I think makes science worth the taxpayers’ money:

  1. Sharing and collaborating. Not identifying with one’s ideas. Maybe openness to the possibility of being wrong is the first step towards true transparency?
  2. Doing good, cumulative research [1], even if it means doing less of it. Evaluating eligibility for funding by the number of publications (or related twisted metrics) must stop. [2]
  3. Need to study how things can be made better, instead of just describing the problems. Driving change instead of clinging to the status quo!

[1] The need for better statistical education has been apparent for decades but not much has changed… Until the 2010s.

[2] See here for reasoning. (Any thoughts on this alternative?)