These are slides from my lecture on significance testing, which took place in a course on research methods for social scientists. Some thoughts:

I tried to emphasise that this stuff is difficult, that people shouldn’t be afraid to say they don’t know, and that academics should try doing that more, too.

I tried to instill a deep memory that many uncertainties are involved in this endeavour, and that mistakes are ok as long as you report the choices you made transparently.

Added a small group discussion exercise at about 2/3 of the lecture: What was the most difficult part to understand so far? I think this worked quite well, although “Is this what an existential crisis feels like?” was not an uncommon response.

I really think statistics is mostly impossible to teach, and people learn when they get interested and start finding things out on their own. Not sure how successful this attempt was in doing that. Anyway, slides are available here.

TLDR: If you’re a seasoned researcher, see this. If you’re an aspiring one, start here or here, and read this.

In this post, I demonstrate how one could use Gelman & Carlin’s (2014) method to analyse a research design for Type S (wrong sign) and Type M (exaggeration ratio) errors, when studying an unknown real effect. Please let me know if you find problems in the code presented here.

[Concept recap:]

Statistical power is the probability you detect an effect, when it’s really there. Conventionally disregarded completely, but often set at 80% (more is better, though).

Alphais the probability you’ll say there’s something when there’s really nothing, in the long run (as put by Daniel Lakens). Conventionally set at 5%.

Why do we need to worry about research design?

If you have been at all exposed to the recent turbulence in the psychological sciences, you may have bumped into discussions about the importance of a bigger-than-conventional sample sizes. The reason is, in a nutshell, that if we find a “statistically significant” effect with an underpowered study, the results are likely to be grossly overestimated and perhaps fatally wrong.

Traditionally, if people have considered their design at all, they have done it in relation to Type 1 and Type 2 errors. Gelman and Carlin, in a cool paper, bring another perspective to this thinking. They propose considering two things:

Say you have discovered a “statistically significant” effect (p < alpha)…

How probable is it, that you have in your hands a result that’s of the wrong sign? Call this a Type S (sign) error.

How exaggerated is this finding likely to be? Call this a Type M (magnitude) error.

Let me exemplify this with a research project we’re writing up at the moment. We had two groups with around 130 participants each, and exposed one of them to a message with the word “because” followed by a reason. The other received a succinct message, and we observed their subsequent behavior. Note, that you can’t use the observed effect size to figure out your power (see this paper by Dienes). That’s why I figured out a minimally interesting effect size of around d=.40 [defined by calculating the mean difference considered meaningful, and dividing the result by the standard deviation we got in a another study].

First, see how we had an ok power to detect a wide array of decent effects:

So, unless the (unknown) effect is smaller than what we care about, we should be able to detect it.

Next, above we see that the probability we would observe an effect of the wrong sign would be miniscule for any effect over d=.2. This would mean it’d look like the succinct message worked better than the reason message, when it really was the other way around.

Finally, and a little surprisingly, we can see that even relatively large true effects would actually be exaggerated by a factor of two!

Dang.

But what can you do, those were all the participants we could muster up with our resources. An interesting additional point is brought by looking at the “v-statistic”. This is the measure of how your model compares to random guessing. 0.5 represents coin flipping accuracy (see here for full explanation and the original code I used).

Figure above shows how we start exceeding random guessing at R^2 around 0.25 (d=.32 according to this). The purple line is in there to show how an additional 90 people help a little but do not do wonders. I’ll write about the results of this study in a later post.

Oh, and the heading? I believe it’s better to do as much of this sort of thinking, before someone looking to have your job (or, perhaps, reviewer 2) does it for you.

In the right light, study becomes insight
But the system that dissed us
Teaches us to read and write

– Rage Against The Machine, “TAKE THE POWER BACK”

[DISCLAIMER 1: THIS POST MAY CAUSE DEATH BY BOREDOM IF YOU’RE NOT INVOLVED WITH INTERVENTION RESEARCH DESIGN]

Statistical power is the probability of finding an effect of a specified size, if it exists. It is of critical importance to interpreting research, but it’s amazing how little attention it has got in undergraduate statistics courses. Of course it can be argued, that up until recent years, statistics in psychology was taught by those who were effectively a part of the problem. I can’t recommend enough this wonderful summary of Cohen’s classic article about how psychology failed to take it seriously in the 20th century.

This doesn’t mean the problem is eradicated. In social psychology research, power still seems to be less than 50%. It gets worse in neuroscience: the median power is estimated to be around 20%. So, if an effect is real, you have a 1-in-5 probability of finding it with your test. If you still happen to find the effect, it most probably is grossly overestimated, because when an effect happens to look big just by chance, it crosses the p<0.05 threshold more easily. (See paper by John Ioannidis; “Why Most Discovered True Associations Are Inflated”.)

Once more: if your study is underpowered, younot only fail to detect possible effects, but also get unrealistic estimates when you do.

Recently, I’ve had the interesting experience of having to figure out how to do sample size calculations in a cluster-randomised setting. This essentially means that you’re violating the assumption of independent observations, because your participants come clustered in e.g. classrooms, and people in one classroom tend to be more like each other than people in another classroom.

It also pretty much churns your dreams [of simple sample size specification] to dust.

So, to make my life easier, I built a couple of Excel sheets that can be used by a simpleton like me. You can download the file from the end of this post. (note: the sheets contain “array formulas” that only work in Excel, so sadly no Openoffice version.)

I want to make it perfectly clear that I still know very little about power analysis (or anything else, actually) and made these as tools to help me out because my go-to statistician was too busy to give me the support I needed. Sources and justifications are provided, but it’s not impossible these calculations are totally wrong.

I’m guessing your friendly neighbourhood statistician, too, would rather help “see if your calculations are correct” instead doing your calculations for you. So I’m hoping you can use this tool to estimate the sample size, then talk to a statistician and let me know if he says you have corrections to make 🙂

[DISCLAIMER 2: ALWAYS CONSULT A STATISTICIAN BEFORE MOVING FORWARD WITH CALCULATED SAMPLE SIZES]

What’s in the sheets

Here’s what’s in the file:

2-level cluster randomization: sample size aide

Use this sheet to calculate sample size for 2-level cluster randomization when you know power and a bunch of other stuff. Some links and guidance is included. Also includes two toys (the rightmost and the bottom yellow blocks) that give you optimistic estimates of whether your “discovery” is false. These are based on this paper. I highly recommend it if you want to make sense of p-values.

Find the ICC (intra-class correlation) in SPSS and R

One of the big boogiemen, to me at least, of the whole enterprise was the intra-class correlation (apparently, often used synonymously with “intra-cluster” correlation). I jotted down instructions that I wish I had when I began meddling with this stuff.

Power calculator for a 3-level cluster randomized design

Here’s the dream crusher. In the “Justifications…”-sheet you’ll find mathematical formulas and the logic behind the machine, but it’s not super obvious for us mortals. I managed to make it work in Excel by combining pieces of code from all over the internet; I’m hoping you don’t need to do the same.