Correlation pitfalls – Happier times with mutual information?

[UPDATE: added distance correlation due to Vithor’s suggestion; see comments]

I’ve become increasingly anxious about properties of correlation I never knew existed. I collect resources and stuff on the topic in this post, so that have everything in one place. Some resources for beginners in the end of the post.

Correlation isn’t causation, and causation doesn’t require correlation. Ok. But have you heard that correlation is not correlation? In other words, things can be dependent without being correlated, and independent though correlated. Ain’t that fun. As Shay Allen Hill describes visually in his excellent, short blog (HIGHLY RECOMMENDED):

[C]ovariance doesn’t actually measure “Does y increase when x increases?” it only measures “Is y above average when x is above average (and by how much)?” And when covariance is broken [i.e. mean doesn’t coincide with median], our correlation function is broken.

So there may well be situations, where only 20% of people in the sample show dependence between two variables, and this shows up as a correlation of 37% at minimum. Or when r=0.5 carries ~4.5 times and r=0.75 carries ~12.8 times more information than r=0.25. How can we interpret a result without in-depth knowledge of the field as well as the data in question? A partial remedy apparently is using mutual information instead. I know nothing about it, so like always, I just started playing around with things I don’t understand. Here’s what came out:


The first four panels are the Anscombe’s Quartet. Fifth illustrates Taleb’s point about intelligence. Data for the last two panels are from this project. First four and last two panels have the same mean and standard deviation. Code for creating the pic is here.

MIC and BCMI were new to me, but I thought they were easy to implement, which doesn’t of course mean they make sense. But see how they catch the dinosaur?

  • MIC is the Maximal Information Coefficient, from maximal information-based nonparametric exploration (documentation)
  • BCMI stands for Jackknife Bias Corrected MI estimates (documentation)
  • DCOR is distance correlation (see comments)

I’d be happy to hear thoughts and caveats regarding the use of entropy-based dependency measures in general, and these in particular, from people who actually know these methods. Here’s a related Twitter thread, or just email me!

ps. If this is your first brush with uncertainties related to correlations, and/or have little or no statistics background, you may not know how correlation can vary spectacularly in small samples. Taleb’s stuff (mini-moocs [1, 2]) can sometimes be difficult to grasp without math background, so perhaps get started with this visualisation, or these Excel sheets. A while ago I animated some elementary simulations of p-value distributions for statistical significance of correlations; selective reporting makes things a lot worse than what’s depicted there. If you’re a psychology student, also be sure to check out the p-hacker app. If you haven’t thought about distributions much lately, check this out for a fun read by a math student.

⊂This post has been a formal sacrifice to Rexthor.⊃

Modern tools to enhance reproducibility and comprehension of research findings (VIDEO WALKTHROUGH 14min)

presen eka slide

These are the slides of my presentation at the annual conference of the European Health Psychology Society. It’s about presenting data visually, and taking publishing culture from the journals to our own hands. I hint to a utopia, where the journal publication is a side product of a comprehensively reported data set. 

Please find a 14min video walkthrough of the slides (which can be found here) below. The site presented in the slides is here, and the tutorial by the most awesome Lisa DeBruine is here!


After the talk, I saw what was probably the best tweet about a presentation of mine ever. For a fleeting moment, I was happy to exist:

ehps cap

Big thanks to everyone involved, especially Gjalt-Jorn Peters for helpful suggestions on code and the plots. For the diamond plots, check out

Authors of the conference abstract:

Matti Heino; Reijo Sund; Ari Haukkala; Keegan Knittle; Katja Borodulin; Antti Uutela; Vera Araújo-Soares, Falko Sniehotta, Tommi Vasankari; Nelli Hankonen


Background: Comprehensive reporting of results has traditionally been constrained by limited reporting space. In spite of calls for increased transparency, researchers have had to choose carefully what to report, and what to leave out; choices made based on subjective evaluations of importance. Open data remedies the situation, but privacy concerns and tradition hinder rapid progress. We present novel possibilities for comprehensive representation of data, making use of recent software developments.

Methods: We illustrate the opportunities using the Let’s Move It trial baseline data (n=1084). Descriptive statistics and group comparison results on psychosocial correlates of physical activity (PA) and accelerometry-assessed PA were reported in an easily accessible html-supplement, directly created from a combination of analysis code and data using existing tools within R.

Findings: Visualisations (e.g. network graphs, combined ridge and diamond plots) enabled presenting large amounts of information in an intelligible format. This bypasses the need to create narrative explanations for all data, or compress nuanced information into simple summary statistics. Providing all analysis code in a readily accessible format further contributed to transparency.

Discussion: We demonstrate how researchers can make their extensive analyses and descriptions openly available as website supplements, preferably with abundant visualisation to avoid overwhelming the reader with e.g. large numeric tables. Uptake of such practice could lead to a parallel form of literature, where highly technical and traditionally narrated documents coexist. While we may have to wait for fully open and documented data, comprehensive reporting of results is available to us now.



Visualising ordinal data with Flaming Pillars of Hell

I recently had a great experience with a StackOverflow question, when I was thinking about how to visualise ordinal data. This post shows an option for how to do that. Code for the plots is in the end of this post.

Update: here’s an FB discussion, which mentions e.g. a good idea of making stacked % graphs (though I like to see the individuals, so they won’t sneak up behind me) and using the package TramineR to visualise and analyse change.

Update 2: Although they have other names too, I’m going to call these things flamethrower plots. Just because it reflects the fact, that even though you have the opportunity to do it, it may not always be the best idea to apply them.

Say you have scores on some likert-type scale questionnaire items, like motivation, in two time points, and would like to visualise them. You’re especially interested in whether you can see detrimental effects, e.g. due to an intervention. One option would be to make a plot like this: each line in the plot below is one person, and the lighter lines indicate bigger increases in motivation scores, whereas the darker lines indicate iatrogenic development. The data is simulated so, that the highest increases take place in the item in the leftmost plot, the middle is randomness and the right one shows iatrogenics.

I have two questions:

  1. Do these plots have a name, and if not, what should we call them?
  2. How would you go about superimposing model-implied-changes, i.e. lines showing that when someone starts off at, for example, a score of four, where are they likely to end up in T2?


flaming pillars

The code below first simulates 500 participants for two time points, then draws plot. If you want to use it on your own data, transform the variables in the form scaleName_itemNumber_timePoint (e.g. “motivation_02_T1”).

<br /># Simulate data:
data <- data.frame(id = 1:500,
Intrinsic_01_T1 = sample(1:5, 500, replace = TRUE),
Intrinsic_02_T1 = sample(1:5, 500, replace = TRUE),
Intrinsic_03_T1 = sample(1:5, 500, replace = TRUE),
Intrinsic_01_T2 = sample(1:5, 500, replace = TRUE, prob = c(0.1, 0.1, 0.2, 0.3, 0.3)),
Intrinsic_02_T2 = sample(1:5, 500, replace = TRUE),
Intrinsic_03_T2 = sample(1:5, 500, replace = TRUE, prob = c(0.3, 0.3, 0.2, 0.1, 0.1)))

pd <- position_dodge(0.4) # X-axis jitter to make points more readable

# Draw plot:

data %>%
tidyr::gather(variable, value, -id) %>%
tidyr::separate(variable, c("item", "time"), sep = "_T") %>%
dplyr::mutate(value = jitter(value, amount = 0.1)) %>% # Y-axis jitter to make points more readable
group_by(id, item) %>%
mutate(slope = (value[time == 2] - value[time == 1]) / (2 - 1)) %>%
ggplot(aes(x = time, y = value, group = id)) +
geom_point(size = 1, alpha = .2, position = pd) +
geom_line(alpha = .2, position = pd, aes(color = slope), size = 1.5) +
scale_color_viridis_c(option = "inferno")+
ggtitle('Changes in indicators of motivation scores') +
ylab('Intrinsic motivation scores') +
xlab('Time points') +

Open technical reports for the technically-minded

When Roger Giner-Sorolla three years ago lamented to me, how annoying it can be to dig out interesting methods/results information from a manuscript with a carefully crafted narrative, I wholeheartedly agreed. When I saw the 100%CI post on reproducible websites a year ago, I thought it was cool but way too tech-y for me.

Well, it turned out that when you learn a tiny bit of elementary R Markdown, you can follow idiot-proof instructions on how to make cool websites out of your analysis code. I was also working on the manuscript-version of my Master’s thesis, and realised several commenters thought much of the methods stuff I considered interesting, was just unnecessary and/or boring.

So I made this thing of what I thought was the beef of the paper (also, to motivate me to finally submit that damned piece):


It got me thinking: Perhaps we could create a parallel form of literature, where (open) highly technical and (closed) traditionally narrated documents coexist. The R Markdown research notes could be read with only a preregistration or a blog post to guide the reader, while the journals could just continue with business as usual. The great thing is that, as Ruben Arslan pointed out in the 100%CI post, you can present a lot of results and analyses, which is nice if you’d do them anyway and data sharing a no-no in your field. In general, if there’s just too much conservative inertia in your field, this could be a way around it: Let the to-be-extinct journals build paywalls around your articles, but put the important things openly available. The people who get pissed off by that sort of stuff rarely look at technical supplements anyway 🙂

I’d love to hear your thoughts of the feasibility of the approach, as well as how to improve such supplements!


After some insightful comments by Gjalt-Jorn Peters, I started thinking how this could be abused. We’ve already seen how e.g. preregistration can be used as a signal of illusory quality (1, 2), and supplements like this could do the same thing. Someone could just bluff by cramming the thing full of difficult-to-interpret analyses, and claim “hey, it’s all there!”. One helpful thing is to expect heavy use of visualisations, which are less morbid to look at than numeric tables and raw R output. Another option would be creating a wonderful shiny app, like Emorie Beck did.

Actually, let’s take a moment to marvel at how super awesomesauce that thing is.


So, to continue: I don’t know how difficult it really is to make such a thing. I’m sure a lot of tech-savvy people readily say it’s the simplest thing in the world, and I’m sure a lot of people will see the supplements I presented here as a shitton of learning to do. I don’t have a solution. But if you’re a PI, you can do both yourself and your doctoral students a favour by nudging them towards learning R; maybe they’ll make a shiny app (or whatever’s in season then) for you one day!

ps. If I’d do the R Markdown all over again, I’d do more and better plots, as well as put more emphasis on readability, including better annotation of my code and decisions. Some of that code is from when I first learned R, and it’s a bit … rough. (In the last moment before submitting my Master’s thesis I decided, in a small state of frustrated fury, to re-do all analyses in R so that I needn’t mention SPSS or Excel in the thesis…)

pps. In the manuscript, I link to the page via a GitHub Pages url shortener, but provide permalink (web page stored with the Wayback Machine) in the references. We’ll see what the journal thinks of that.

ppps. There are probably errors lurking around, so please notify me when you see them 🙂

Analyse your research design, before someone else does

In this post, I demonstrate how one could use Gelman & Carlin’s (2014) method to analyse a research design for Type S (wrong sign) and Type M (exaggeration ratio) errors, when studying an unknown real effect. Please let me know if you find problems in the code presented here.

[Concept recap:]

Statistical power is the probability you detect an effect, when it’s really there. Conventionally disregarded completely, but often set at 80% (more is better, though).

Alpha is the probability you’ll say there’s something when there’s really nothing, in the long run (as put by Daniel Lakens). Conventionally set at 5%.

Two classic types of errors. Mnemonic: with type 1, there’s one person and with type 2, there are two people. Not making a type 2 error is called ‘power’ (feel free to make your own mnemonic for that one). Photo source.

Why do we need to worry about research design?

If you have been at all exposed to the recent turbulence in the psychological sciences, you may have bumped into discussions about the importance of a bigger-than-conventional sample sizes. The reason is, in a nutshell, that if we find a “statistically significant” effect with an underpowered study, the results are likely to be grossly overestimated and perhaps fatally wrong.

Traditionally, if people have considered their design at all, they have done it in relation to Type 1 and Type 2 errors. Gelman and Carlin, in a cool paper, bring another perspective to this thinking. They propose considering two things:

Say you have discovered a “statistically significant” effect (p < alpha)…

  1. How probable is it, that you have in your hands a result that’s of the wrong sign?  Call this a Type S (sign) error.
  2. How exaggerated is this finding likely to be? Call this a Type M (magnitude) error.

Let me exemplify this with a research project we’re writing up at the moment. We had two groups with around 130 participants each, and exposed one of them to a message with the word “because” followed by a reason. The other received a succinct message, and we observed their subsequent behavior. Note, that you can’t use the observed effect size to figure out your power (see this paper by Dienes). That’s why I figured out a minimally interesting effect size of around d=.40 [defined by calculating the mean difference considered meaningful, and dividing the result by the standard deviation we got in a another study].

First, see how we had an ok power to detect a wide array of decent effects:


So, unless the (unknown) effect is smaller than what we care about, we should be able to detect it.


Next, above we see that the probability we would observe an effect of the wrong sign would be miniscule for any effect over d=.2. This would mean it’d look like the succinct message worked better than the reason message, when it really was the other way around. typeM

Finally, and a little surprisingly, we can see that even relatively large true effects would actually be exaggerated by a factor of two!


But what can you do, those were all the participants we could muster up with our resources. An interesting additional point is brought by looking at the “v-statistic”. This is the measure of how your model compares to random guessing. 0.5 represents coin flipping accuracy (see here for full explanation and the original code I used).


Figure above shows how we start exceeding random guessing at R^2 around 0.25 (d=.32 according to this). The purple line is in there to show how an additional 90 people help a little but do not do wonders. I’ll write about the results of this study in a later post.

Until then, please let me know if you spot errors or find this remotely helpful. In case of the latter, you might be interested in how to calculate power in cluster randomised designs.

Oh, and the heading? I believe it’s better to do as much of this sort of thinking, before someone looking to have your job (or, perhaps, reviewer 2) does it for you.

Taking back the power (in cluster randomization)

In the right light, study becomes insight
But the system that dissed us
Teaches us to read and write

– Rage Against The Machine, “TAKE THE POWER BACK”


Statistical power is the probability of finding an effect of a specified size, if it exists. It is of critical importance to interpreting research, but it’s amazing how little attention it has got in undergraduate statistics courses. Of course it can be argued, that up until recent years, statistics in psychology was taught by those who were effectively a part of the problem. I can’t recommend enough this wonderful summary of Cohen’s classic article about how psychology failed to take it seriously in the 20th century.

This doesn’t mean the problem is eradicated. In social psychology research, power still seems to be less than 50%. It gets worse in neuroscience: the median power is estimated to be around 20%. So, if an effect is real, you have a 1-in-5 probability of finding it with your test. If you still happen to find the effect, it most probably is grossly overestimated, because when an effect happens to look big just by chance, it crosses the p<0.05 threshold more easily. (See paper by John Ioannidis; “Why Most Discovered True Associations Are Inflated”.)

Once more: if your study is underpowered, you not only fail to detect possible effects, but also get unrealistic estimates when you do.

Recently, I’ve had the interesting experience of having to figure out how to do sample size calculations in a cluster-randomised setting. This essentially means that you’re violating the assumption of independent observations, because your participants come clustered in e.g. classrooms, and people in one classroom tend to be more like each other than people in another classroom.

It also pretty much churns your dreams [of simple sample size specification] to dust.

That’s probably the case, Research Wahlberg. But HOW MUCH more?! This intra-class correlation is killing me!

So, to make my life easier, I built a couple of Excel sheets that can be used by a simpleton like me. You can download the file from the end of this post. (note: the sheets contain “array formulas” that only work in Excel, so sadly no Openoffice version.)

I want to make it perfectly clear that I still know very little about power analysis (or anything else, actually) and made these as tools to help me out because my go-to statistician was too busy to give me the support I needed. Sources and justifications are provided, but it’s not impossible these calculations are totally wrong.

I’m guessing your friendly neighbourhood statistician, too, would rather help “see if your calculations are correct” instead doing your calculations for you. So I’m hoping you can use this tool to estimate the sample size, then talk to a statistician and let me know if he says you have corrections to make 🙂


What’s in the sheets

Here’s what’s in the file:

2-level cluster randomization: sample size aide

2-level cluster randomization sample size aide

Use this sheet to calculate sample size for 2-level cluster randomization when you know power and a bunch of other stuff. Some links and guidance is included. Also includes two toys (the rightmost and the bottom yellow blocks) that give you optimistic estimates of whether your “discovery” is false. These are based on this paper. I highly recommend it if you want to make sense of p-values.

Find the ICC (intra-class correlation) in SPSS and R


One of the big boogiemen, to me at least, of the whole enterprise was the intra-class correlation (apparently, often used synonymously with “intra-cluster” correlation). I jotted down instructions that I wish I had when I began meddling with this stuff.

Power calculator for a 3-level cluster randomized design


Here’s the dream crusher. In the “Justifications…”-sheet you’ll find mathematical formulas and the logic behind the machine, but it’s not super obvious for us mortals. I managed to make it work in Excel by combining pieces of code from all over the internet; I’m hoping you don’t need to do the same.

Download the Excel-file HERE.

Have fun and let me know if you find errors! All other comments are of course welcome, too.