Preprints, short and sweet

Photo courtesy of Nelli Hankonen

These are slides (with added text content to make more sense) from a small presentation I held at the University of Helsinki. Mainly of interest to academic researchers.

TL;DR: To get the most out of scientific publishing, we may need imitate physics a bit, and bypass the old gatekeepers. If the slideshare below is of crappy quality, check out the slides here.

ps. if you prefer video, this explains things in four minutes 🙂

Deterministic doesn’t mean predictable

In this post, I argue against the intuitively appealing notion that, in a deterministic world, we just need more information and can use it to solve problems in complex systems. This presents a problem in e.g. psychology, where more knowledge does not necessarily mean cumulative knowledge or even improved outcomes.

Recently, I attended a talk where Misha Pavel happened to mention how big data can lead us astray, and how we can’t just look at data but need to know mechanisms of behaviour, too.

Misha Pavel arguing for the need to learn how mechanisms work.

Later, a couple of my psychologist friends happened to present arguments discounting this, saying that the problem will be solved due to determinism. Their idea was that the world is a deterministic place—if we knew everything, we could predict everything (an argument also known as Laplace’s Demon)—and that we eventually a) will know, and b) can predict. I’m fine with the first part, or at least agnostic about it. But there are more mundane problems to prediction than “quantum randomness” and other considerations about whether truly random phenomenon exist. The thing is, that even simple and completely deterministic systems can be utterly unpredictable to us mortals. I will give an example of this below.

Even simple and completely deterministic systems can be utterly unpredictable.

Let’s think of a very simple made-up model of physical activity, just to illustrate a phenomenon:

Say today’s amount of exercise depends only on motivation and exercise of the previous day. Let’s say people have a certain maximum amount of time to exercise each day, and that they vary from day to day, in what proportion of that time they actually manage to exercise. To keep things simple, let’s say that if a person manages to do more exercise on Monday, they give themselves a break on Tuesday. People also have different motivation, so let’s add that as factor, too.

Our completely deterministic, but definitely wrong, model could generalise to:

Exercise percentage today = (motivation) * (percentage of max exercise yesterday) * (1 – percentage of max exercise yesterday)

For example, if one had a constant motivation of 3.9 units (whatever the scale), and managed to do 80% of their maximum exercise on Monday, they would use 3.9 times 80% times 20% = 62% of their maximum exercise time on Tuesday. Likewise, on Wednesday they would use 3.9 times 62% times 38% = 92% of the maximum possible exercise time. And so on and so on.

We’re pretending this model is the reality. This is so that we can perfectly calculate the amount of exercise on any day, given that we know a person’s motivation and how much they managed to exercise the previous day.

Imagine we measure a person, who obeys this model with a constant motivation of 3.9, and starts out on day 1 reaching 50% of their maximum exercise amount. But let’s say there is a slight measurement error: instead of 50.000%, we measure 50.001%. In the graph below we can observe, how the error (red line) quickly diverges from the actual (blue line). The predictions we make from our model after around day 40 do not describe our target person’s behaviour at all. The slight deviation from the deterministic system has made it practically chaotic and random to us.

Predicting this simple, fully deterministic system becomes impossible to predict in a short time due to a measurement error of 0.001%-points. Blue line depicts actual, red line the measured values. They diverge around day 35 and are soon completely off. [Link to gif]

What are the consequences?

The model is silly, of course, as we probably would never try to predict an individual’s exact behaviour on any single day (averages and/or bigger groups help, because usually no single instance can kill the prediction). But this example does highlight a common feature of complex systems, known as sensitive dependence to initial conditions: even small uncertainties cumulate to create huge errors. It is also worth noting, that increasing model complexity doesn’t necessarily help us with prediction, due to a problems such as overfitting (thinking the future will be like the past; see also why simple heuristics can beat optimisation).

Thus, predicting long-term path-dependent behaviour, even if we knew the exact psycho-socio-biological mechanism governing it, may be impossible in the absence of perfect measurement. Even if the world was completely deterministic, we still could not predict it, as even trivially small things left unaccounted for could throw us off completely.

Predicting long-term path-dependent behaviour, even if we knew the exact psycho-socio-biological mechanism governing it, may be impossible in the absence of perfect measurement.

The same thing happens when trying to predict as simple a thing as how billiard balls impact each other on the pool table. The first collision is easy to calculate, but to compute the ninth you already have to take into account the gravitational pull of people standing around the table. By the 56th impact, every elementary particle in the universe has to be included in your assumptions! Other examples include trying to predict the sex of a human fetus, or trying to predict the weather 2 weeks out (this is the famous idea about the butterfly flapping its wings).

Coming back to Misha Pavel’s points regarding big data, I feel somewhat skeptical about being able to acquire invariant “domain knowledge” in many psychological domains. Also, as shown here, knowing the exact mechanism is still no promise of being able to predict what happens in a system. Perhaps we should be satisfied when we can make predictions such as “intervention x will increase the probability that the system reaches a state where more than 60% of the goal is reached on more than 50% of the days, by more than 20% in more than 60% of the people who belong in a group it was designed to affect”?

But still: for determinism to solve our prediction problems, the amount and accuracy of data needed is beyond the wildest sci-fi fantasies.

I’m happy to be wrong about this, so please share your thoughts! Leave a comment below, or on these relevant threads: Twitter, Facebook.

References and resources:

  • Code for the plot can be found here.
  • The billiard ball example explained in context.
  • A short paper on the history about the butterfly (or seagull) flapping its wings-thing.
  • To learn about dynamic systems and chaos, I highly recommend David Feldman’s course on the topic, next time it comes around at Complexity Explorer.
  • … Meanwhile, the equation I used here is actually known as the “logistic map”. See this post about how it behaves.


Post scriptum:

Recently, I was happy and surprised to see a paper attempting to create a computational model of a major psychological theory. In a conversation, Nick Brown expressed doubt:


Do you agree? What are the alternatives? Do we have to content with vague statements like “the behaviour will fluctuate” (perhaps as in: fluctuat nec mergitur)? How should we study the dynamics of human behaviour?


Also: do see Nick Brown’s blog, if you don’t mind non-conformist thinking.


The art of expecting p-values

In this post, I try to present the intuition behind the fact that, when studying real effects, one usually should not expect p-values near the 0.05 threshold. If you don’t read quantitative research, you may want to skip this one. If you think I’m wrong about something, please leave a comment and set the record straight!

Recently, I attended a presentation by a visiting senior scholar. He spoke about how their group had discovered a surprising but welcome correlation between two measures, and subsequently managed to replicate the result. What struck me, was his choice of words:

“We found this association, which was barely significant. So we replicated it with the same sample size of ~250, and found that the correlation was almost the same as before and, as expected, of similar statistical significance (p < 0.05)“.

This highlights a threefold, often implicit (but WRONG), mental model:

[EDIT: due to Markus’ comments, I realised the original, off-the-top-of-my-head examples were numerically impossible and changed them a bit. Also, added stuff in brackets that the post hopefully clarifies as you read on.]

  1. “Replications with a sample size similar to the original, should produce p-values similar to the original.”
    • Example: in subsequent studies with n = 100 each, a correlation (p = 0.04) should replicate as the same correlation (p ≈ 0.04) [this happens about 0.02% of the time when population r is 0.3; in these cases you actually observe an r≈0.19]
  2. “P-values are linearly related with sample size, i.e. bigger sample gives you proportionately more small p-values.”
    • Example: a correlation (n = 100, p = 0.04), should replicate as a correlation of about the same, when n = 400, with e.g. a p ≈ 0.02. [in the above-mentioned case, the replication gives observed r±0.05 about 2% of the time, but the p-value is smaller than 0.0001 for the replication]
  3. “We study real effects.” [we should think a lot more about how our observations could have come by in the absence of a real effect!]

It is obvious that the third point is contentious, and I won’t consider it here much. But the first two points are less clear, although the confusion is understandable if one has learned and always applied Jurassic (pre-Bem) statistics.

[Note: “statistical power” or simply “power” is the probability of finding an effect, if it really exists. The more obvious an effect is, and the bigger your sample size, the better are your chances of detecting these real effects – i.e. you have bigger power. You want to be pretty sure your study detects what it’s designed to detect, so you may want to have a power of 90%, for example.]

Figure 1. A lottery machine. Source: Wikipedia

To get a handle of how the p behaves, we must understand the nature of p-values as random variables 1. They are much like the balls in a lottery machine, with values between zero and one marked on them. The lottery machine of real effects has disproportionately more low (e.g. < 0.01) values on the balls, while the lottery machine of null effects contains a “fair” distribution of numbers on balls (where each number is as likely as any other). If this doesn’t make sense yet, read on.

Let us exemplify this with a simulation. Figure 2 shows the expected distribution of p-values, when we do 10 000 studies with one t-test each, and every time report the p of the test. You can think of this as 9999 replications with the same sample size as the original.

Figure 2: p-value distribution for 10 000 simulated studies, under 50% power when the alternative hypothesis is true. (When power increases, the curve gets pushed even farther to the left, leaving next to no p-values over 0.01)

Now, if we would do just five studies with the parameters laid out above, we could see a set of p-values like {0.002, 0.009, 0.024, 0.057, 0.329, 0.479}, half of them being “significant” (in bold). If we had 80% power to detect the difference we are looking for, about 80% of the p-values would be “significant”. As an additional note, with 50% power, 4% of the 10 000 studies give a p between 0.04 and 0.05. With 80% power, this number goes down to 3%. For 97.5% power, only 0.5%  of studies (yes, five for every thousand studies) are expected to give such a “barely significant” p-value.

The senior scholar, who was mentioned in the beginning, was studying correlations. They work the same way. The animation below shows, how p-values are distributed for different sample sizes, if we do 10 000 studies with every sample size (i.e. every frame is 10 000 studies with that sample size). The samples are from a population where the real correlation is 0.3. The red dotted line is p = 0.05.

Figure 3. P-value distributions for different sample sizes, when studying a real correlation of 0.3. Each frame is 10 000 replications with a given sample size. If pic doesn’t show, click here for the gif (and/or try another browser).

The next animation zooms in on “significant” p-values in the same way as in figure 2 (though the largest bar goes off the roof quickly here). As you can see, it is almost impossible to get a p-value close to 5% with large power. Thus, there is no way we should “expect” a p-value over 0.01 when we replicate a real effect with large power. Very low p-values are always more probable than “barely significant” ones.

Figure 4. Zooming in on the “significant” p-values. It is more probable to get a very low p than a barely significant one, even with small samples. If pic doesn’t show, click here for the gif.

But what if there is no effect? In this case, every p-value is equally likely (see Figure 5). This means, that in the long run, getting a p = 0.01 is just as likely as getting a p = 0.97, and by implication, 5% of all p-values are under 0.05. Therefore, the number of studies that generated a p between 0.04 and 0.05, is 1%. Remember, how this percentage was 0.5% (five in a thousand) when the alternative hypothesis was true under 97.5% power? Indeed, when power is high, these “barely significant” p-values may actually speak for the null, not the alternative hypothesis! Same goes for e.g. p=0.024, when power is 99% [see here].

Figure 5. p-value distribution when the null hypothesis is true. Every p is just as likely as any other.

Consider the lottery machine analogy again. Does it make better sense now?

The lottery machine of real effects has disproportionately more low (e.g. < 0.01) values on the balls, while the lottery machine of null effects contains a “fair” distribution of numbers on balls (each number is as likely as any other).

Let’s look at one more visualisation of the same thing:

Figure 6. The percentages of “statistically significant” p-values evolving as sample size increases. If the gif doesn’t show, you’ll find it here.

Aside: when the effect one studies is enormous, sample size naturally matters less. I calculated Cohen’s d for the Asch 2 line segment study, and a whopping d = 1.59 emerged. This is surely a very unusual effect size in psychological experiments, and leads to high statistical power even under low sample sizes. In such a case, by the logic presented above, one should be extremely cautious of p-values closer to 0.05 than zero.

Understanding all this is vital in interpreting past research. We never know what the data generating system has been (i.e. are p-values extracted from a distribution under the null, or under the alternative), but the data gives us hints about what is more likely. Let us take an example from a social psychology classic, Moscovici’s “Towards a theory of conversion behaviour” 3. The article reviews results, which are then used to support a nuanced theory of minority influence. Low p-values are taken as evidence for an effect.

Based on what we learned earlier about the distribution of p-values under the null vs. the alternative, we can now see, under which hypothesis the p-values are more likely to occur. The tool to use here is called the p-curve 4, and it is presented in Figure 6.

Figure 6. A quick-and-dirty p-curve of Moscovici (1980). See this link for the data you can paste onto p-checker or p-curve.

You can directly see, how a big portion of p-values is in the 0.05 region, whereas you would expect them to cluster near 0.01. The p-curve analysis (from the p-curve website) shows that evidential value, if there is any, is inadequate (Z = -2.04, p = .0208). Power is estimated to be 5%, consistent with the null hypothesis being true.

The null being true may or may not have been the case here. But looking at the curve might have helped researchers, who spent some forty years trying to unsuccessfully replicate the pattern of Moscovici’s afterimage study results 5.

In a recent talk, I joked about a bunch of researchers who tour around holiday resorts every summer, making people fill in IQ tests. Each summer they keep the results which show p < 0.05 and scrap the others, eventually ending up in the headlines with a nice meta-analysis of the results.

Don’t be those guys.


Disclaimer: the results discussed here may not generalise to some more complex models, where the p-value is not uniformly distributed under the null. I don’t know much about those cases, so please feel free to educate me!

Code for the animated plots is here. It was inspired by code from Daniel Lakens, whose blog post inspired this piece. Check out his MOOC here. Additional thanks to Jim Grange for advice on gif making and Alexander Etz for constructive comments.


  1. Murdoch, D. J., Tsai, Y.-L. & Adcock, J. P-Values are Random Variables. The American Statistician 62, 242–245 (2008).
  2. Asch, S. E. Studies of independence and conformity: I. A minority of one against a unanimous majority. Psychological monographs: General and applied 70, 1 (1956).
  3. Moscovici, S. in Advances in Experimental Social Psychology 13, 209–239 (Elsevier, 1980).
  4. Simonsohn, U., Simmons, J. P. & Nelson, L. D. Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). J Exp Psychol Gen 144, 1146–1152 (2015).
  5. Smith, J. R. & Haslam, S. A. Social psychology: Revisiting the classic studies. (SAGE Publications, 2012).

The legacy of social psychology

To anyone teaching psychology.

In this post I express some concerns about the prestige given to ‘classic’ studies, which are widely taught in undergraduate social psychology courses around the world. I argue that rather than just demonstrating a bunch of clever but dodgy experiments, we could teach undergraduates to evaluate studies for themselves. To exemplify this, I quickly demonstrate power, Bayes factors, the p-checker app and the GRIM test.

psychology’s foundations are built not of theory but with the rock of classic experiments

Christian Jarrett

Here is an out-of-context quote from Sanjay Srivastava from a while back:


This got me thinking about why and how we teach classic studies.

Psychologists usually lack the luxury of well-behaving theories. Some have thus proposed that the classic experiments, which have survived in the literature until the present, serve as the bedrock of our knowledge 1. In the introduction to a book retelling the stories of classic studies in social psychology 2, the authors note that classical studies have “played an important role in setting the research agenda for the field as it has progressed over time” and “serve as common points of reference for researchers, teachers and students alike”. The authors continue by pointing out that many of these classics lacked sophistication, but that this in fact is a feature of their enduring appeal, as laypeople can understand the “points” the studies make. Exposing the classics to modern statistical methods, would thus miss their point.

Now, this makes me wonder; if the point of a study is not to assess the existence of a phenomenon, what in the world may it be? One answer would be to serve as historical examples of practices no longer considered scientific, but I doubt this is what’s normally thought. Notwithstanding, I wanted to dip into the “foundations” of our knowledge by demostrating the use of some more-or-less recently developed tools on a widely known article. According to Google Scholar, the Festinger and Carlsmith cognitive dissonance experiment 3 has been cited for over three thousand times, so its influence is hard to downplay.


But first, a necessary digression: statistical power is the probability of detecting a “significant” effect of the postulated size, if the null hypothesis is false. As explained in Brunner & Schimmack 4, it is an interesting anomaly that the statistical power of studies in psychology is usually small, but almost all of them end up finding these “significant” results. As to how small, power doubtfully exceeds 50% 5–7, and for small (conventional?) effect sizes, the mean has been shown to be as low as 24%. As a recent replication project regarding the ego depletion effect 8 exemplified, a highly “replicable” (as judged by the published record) phenomenon may turn out to be a fluke, when null findings are taken into account. This has recently made psychologists consider the uncomfortable possibility, that entire research lines consisting of “accumulated scientific evidence” may in fact not contain that much evidence 9,10.

So, what is the statistical power of Festinger and Carlsmith? Using G*Power 11, it turns out that they had 80% chance to discover a humongous effect of d = 0.9, and only a coin flip’s probability to find a (still large) effect of d = 0.64. Now, if an underpowered study finds an effect, with current practices it is likely to be exaggerated, and/or even of the wrong sign 12. Here would be a nice opportunity to demonstrate these concepts to students.

Considering the low power, it may not come as a surprise that the evidence the study provided was low to begin with. A Bayes Factor (BF) is an indicator of evidence for one hypothesis, in relation to another. In this case, a BF of ~3 moves an impartial observer from being 50% sure the experiment works to being 75% sure, or a skeptic from being 25% sure to being 43% sure that the effect is small instead of nil.

It would be relatively simple to introduce Bayes Factors with this study. The effect of a prior scale in this case does not matter much for reasonable choices, as exemplified with a plot made in JASP with two clicks:

Figure 1: Bayes factor robustness check for the main finding of the dissonance study. Plotted by JASP, using n=20 for both groups, a t-value of 2.48 and a cauchy prior scale of 0.4.

Nowadays it is possible to easily check, whether a paper correctly reports test statistics and their associated p-values. The p-checker app (this link feeds the relevant statistics to the app) can do this, and it turns out that most of the t-values in the paper are incorrectly rounded down (assuming, that “significant at the 0.08 level” means p < 0.08). You can demonstrate this by including the link on your slides, using it to go to p-checker and choosing “p-values correct?”.

Finally, you can look at the study using the GRIM test 13, which evaluates if the reported means are mathematically possible. As it turns out, a quarter of the reported means in the table with the main results do not pass the test. One more time: 25% of the reported means are mathematically impossible. The most likely explanation for this is shoddy reporting of means or accidental misreporting of sample sizes, but I find it telling that—to my knowledge, at least—the issue has not come up in fifty years of scientific investigation.

Figure 2: Main results table of the Festinger & Carlsmith study. Circled means are mathematically impossible given the reported sample sizes.

Now, even though I have doubts about this study, as well as the process by which the theory has “evolved” 14, it does not mean that cognitive dissonance effects do not exist. It is just that the research may not have been able to capture the essence of this everyday phenomenon (which, if it exists, can influence behaviour without the help of academics). Under the traditional paradigm of psychological science, fraught with publication bias and unhelpful incentives 10, a Registered Replication Report (RRR) -type of work would be needed, and even that could only test one operationalisation. As an undergraduate, I would have been exhilarated to hear early about how and why such initiatives work, and why the approach is much more informative than any singular experiments.

Returning to the notion of the bedrock of psychology, consisting of classic experiments instead of theories as in the natural sciences 1. Perhaps we need a more solid foundation, regardless of whether some flashy findings from decades ago happened to spur out a progressive-ish 15,16 line of research.

How would such foundation come to be? Maybe teaching could play a role?


  1. Jarrett, C. Foundations of sand? The Psychologist 21, 756–759 (2008).
  2. Smith, J. R. & Haslam, S. A. Social psychology: Revisiting the classic studies. (SAGE Publications, 2012).
  3. Festinger, L. & Carlsmith, J. M. Cognitive consequences of forced compliance. The Journal of Abnormal and Social Psychology 58, 203–210 (1959).
  4. Brunner, J. & Schimmack, U. How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies. (2016).
  5. Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 14, 365–376 (2013).
  6. Cohen, J. Things I have learned (so far). American psychologist 45, 1304 (1990).
  7. Sedlmeier, P. & Gigerenzer, G. Do studies of statistical power have an effect on the power of studies? Psychological bulletin 105, 309 (1989).
  8. Hagger, M. S. et al. A multi-lab pre-registered replication of the ego-depletion effect. Perspectives on Psychological Science (2016).
  9. Earp, B. D. & Trafimow, D. Replication, falsification, and the crisis of confidence in social psychology. Front. Psychol 6, 621 (2015).
  10. Smaldino, P. E. & McElreath, R. The Natural Selection of Bad Science. arXiv preprint arXiv:1605.09511 (2016).
  11. Faul, F., Erdfelder, E., Lang, A.-G. & Buchner, A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods 39, 175–191 (2007).
  12. Gelman, A. & Carlin, J. Beyond Power Calculations Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science 9, 641–651 (2014).
  13. Brown, N. J. L. & Heathers, J. A. J. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science (2016). doi:10.1177/1948550616673876
  14. Aronson, E. in The science of social influence: Advances and future progress (ed. Pratkanis, A. R.) 17–82 (Psychology Press, 2007).
  15. Lakatos, I. History of science and its rational reconstructions. (Springer, 1971).
  16. Meehl, P. E. Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry 1, 108–141 (1990).


How lack of transparency feeds the beast

This is a presentation I held for the young researchers branch of the Finnish Psychological Society. I show how low power and lack of transparency can lead to weird situations, where the published literature contains little or no knowledge.


We had big fun with Markus Mattsson and Leo Aarnio in a seminar, presenting to a great audience of eager young researchers.

The slides for my talk are here:

If you’re interested in more history and solutions, check out Felix Schönbrodt‘s slides here. Some pictures were made adapting code from a wonderful Coursera MOOC by Daniel Lakens. For Bayes, check out Alexander Etz‘s blog.

Oh, and for the monster analogy; this piece made me think of it.

Getting Started With Bayes

This post presents a Bayesian roundtable I convened for the EHPS/DHP 2016 health psychology conference. Slides for the three talks are included.

bayes healthpsych cover

So, we kicked off the session with Susan Michie and acknowledged Jamie Brown who was key in making it happen, but could not attend.


Robert West was the first to present, you’ll find his slides “Bayesian analysis: a brief introductionhere. This presentation gave a brief introduction to Bayes and how belief updating with Bayes Factors works.

I was the second speaker, building on Robert’s presentation. Here are slides for my talk, where I introduced some practical resources to get started with Bayes. The slides are also embedded below (some slides got corrupted by Slideshare, so the ones in the .ppt link are a bit nicer).

The third and final presentation was by Niall Bolger. In his talk, he gave a great example of how using Bayes in a multilevel model enabled him to incorporate more realistic assumptions and—consequently—evaporate a finding he had considered somewhat solid. His slides, “Bayesian Estimation: Implications for Modeling Intensive Longitudinal Data“, are here.

Let me know if you don’t agree with something (especially in my presentation) or have ideas regarding how to improve the methods in (especially health) psychology research!