These are slides from my lecture on significance testing, which took place in a course on research methods for social scientists. Some thoughts:
I tried to emphasise that this stuff is difficult, that people shouldn’t be afraid to say they don’t know, and that academics should try doing that more, too.
I tried to instill a deep memory that many uncertainties are involved in this endeavour, and that mistakes are ok as long as you report the choices you made transparently.
Added a small group discussion exercise at about 2/3 of the lecture: What was the most difficult part to understand so far? I think this worked quite well, although “Is this what an existential crisis feels like?” was not an uncommon response.
I really think statistics is mostly impossible to teach, and people learn when they get interested and start finding things out on their own. Not sure how successful this attempt was in doing that. Anyway, slides are available here.
TLDR: If you’re a seasoned researcher, see this. If you’re an aspiring one, start here or here, and read this.
In this post, I demonstrate how one could use Gelman & Carlin’s (2014) method to analyse a research design for Type S (wrong sign) and Type M (exaggeration ratio) errors, when studying an unknown real effect. Please let me know if you find problems in the code presented here.
Statistical power is the probability you detect an effect, when it’s really there. Conventionally disregarded completely, but often set at 80% (more is better, though).
Alphais the probability you’ll say there’s something when there’s really nothing, in the long run (as put by Daniel Lakens). Conventionally set at 5%.
Why do we need to worry about research design?
If you have been at all exposed to the recent turbulence in the psychological sciences, you may have bumped into discussions about the importance of a bigger-than-conventional sample sizes. The reason is, in a nutshell, that if we find a “statistically significant” effect with an underpowered study, the results are likely to be grossly overestimated and perhaps fatally wrong.
Traditionally, if people have considered their design at all, they have done it in relation to Type 1 and Type 2 errors. Gelman and Carlin, in a cool paper, bring another perspective to this thinking. They propose considering two things:
Say you have discovered a “statistically significant” effect (p < alpha)…
How probable is it, that you have in your hands a result that’s of the wrong sign? Call this a Type S (sign) error.
How exaggerated is this finding likely to be? Call this a Type M (magnitude) error.
Let me exemplify this with a research project we’re writing up at the moment. We had two groups with around 130 participants each, and exposed one of them to a message with the word “because” followed by a reason. The other received a succinct message, and we observed their subsequent behavior. Note, that you can’t use the observed effect size to figure out your power (see this paper by Dienes). That’s why I figured out a minimally interesting effect size of around d=.40 [defined by calculating the mean difference considered meaningful, and dividing the result by the standard deviation we got in a another study].
First, see how we had an ok power to detect a wide array of decent effects:
So, unless the (unknown) effect is smaller than what we care about, we should be able to detect it.
Next, above we see that the probability we would observe an effect of the wrong sign would be miniscule for any effect over d=.2. This would mean it’d look like the succinct message worked better than the reason message, when it really was the other way around.
Finally, and a little surprisingly, we can see that even relatively large true effects would actually be exaggerated by a factor of two!
But what can you do, those were all the participants we could muster up with our resources. An interesting additional point is brought by looking at the “v-statistic”. This is the measure of how your model compares to random guessing. 0.5 represents coin flipping accuracy (see here for full explanation and the original code I used).
Figure above shows how we start exceeding random guessing at R^2 around 0.25 (d=.32 according to this). The purple line is in there to show how an additional 90 people help a little but do not do wonders. I’ll write about the results of this study in a later post.