# Idiography illustrated: Things you miss when averaging people

This post contains slides I made to illustrate some points about phenomena, which will remain forever out of reach, if we continue the common practice of always averaging individual data. For another post on perils of averaging, check this out, and for an overview of idiographic research with resources, see here.

(Almost the same presentation with some narration is included in this thread, in case you want more explanation.)

Here’s one more illustration of why you need the right sampling frequency for whatever it is you study – and the less you know, the denser sampling you need initially. From a paper I’m drafting:

The figure illustrates a hypothetical percentage of a person’s maximum motivation (y-axis) measured on different days (x-axis). Panels:

• A) measurement on three time points—representing conventional evaluation of baseline, post-intervention and a longer-term follow-up—shows a decreasing trend.
• B) Measurement on slightly different days shows an opposite trend.
• C) Measuring 40 time points instead of three would have accommodated both phenomena.
• D) New linear regression line (dashed) as well as the LOESS regression line (solid), with potentially important processes taking place during the circled data points.
• E) Having measured 400 time points instead, would have revealed a process of “deterministic chaos” instead. Not knowing the equation and the starting points, it would be impossible to predict accurately, but this doesn’t mean regression is helpful.

During the presentation, a question came up: How much do we need to know? Do we really care about the “real” dynamics? Personally, I mostly just want information to be useful, so I’d be happy just tinkering with trial and error. Thing is, tinkering may benefit from knowing what has already failed, and where fruitful avenues may lie. My curiosity ends, when we can help people change their behaviour in ways that fulfill the spirit of R.A. Fisher’s criterion for an empirically demonstrable phenomenon:

In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result. (Fisher 1935b/1947, p. 14; see Mayo 2018)

So, if I was a physiology researcher studying the effects of exercise, I would have changed fields (to e.g. PA promotion) when the negative effects of low activity became evident, whereas other people want to learn the exact metabolic pathways by which the thing happens. And I will quit intervention research when we figure out how to create interventions that fail to work <5% of the time.

Some people say we’re dealing with human phenomena that are so unpredictable and turbulent, that we cannot expect to do much better than we currently do. I disagree with this view, as all the methods I’ve seen used in our field so far are designed for ergodic, stable, linear systems. But there are other kinds of methods, which physicists started using when they left behind the ones that stuck with us, around maybe the 19th century. I’m very excited about learning more at the Complexity Methods for Behavioural Science summer school (here are some slides on what I presume will be among the topics).

I don’t have examples on e.g. physical activity, because nobody’s done that yet, and lack of good longitudinal within-individual data is a severe historical hindrance. But some research groups are gathering longitudinal continuous data, and one that I know of, has very long time series of machine vision data on school yard physical activity (those are systems, too, just like individuals). Plenty has already been done in the public health sphere.

Hell do I know, this might turn out to be a dead-end, like most new developments tend to be.

But I’d be happy to be convinced that it is an inferior path to our current one 😉

# Correlation pitfalls – Happier times with mutual information?

I’ve become increasingly anxious about properties of correlation I never knew existed. I collect resources and stuff on the topic in this post, so that have everything in one place. Some resources for beginners in the end of the post.

Correlation isn’t causation, and causation doesn’t require correlation. Ok. But have you heard that correlation is not correlation? In other words, things can be dependent without being correlated, and independent though correlated. Ain’t that fun. As Shay Allen Hill describes visually in his excellent, short blog (HIGHLY RECOMMENDED):

[C]ovariance doesn’t actually measure “Does y increase when x increases?” it only measures “Is y above average when x is above average (and by how much)?” And when covariance is broken [i.e. mean doesn’t coincide with median], our correlation function is broken.

So there may well be situations, where only 20% of people in the sample show dependence between two variables, and this shows up as a correlation of 37% at minimum. Or when a correlation of 0.5 carries ~4.5 times (and a correlation of 0.75 carries ~12.8 times) more information than a correlation of 0.25. As you may know, in psychology, it’s quite rare to see a correlation of 0.5. But even a correlation of 0.5 only gives 13% more information than random. This prompted the following conversation:

How can we interpret a result without in-depth knowledge of the field as well as the data in question? A partial remedy apparently is using mutual information instead (see this paper draft for more information). I know nothing about it, so like always, I just started playing around with things I don’t understand. Here’s what came out:

The first four panels are the Anscombe’s Quartet. Fifth illustrates Taleb’s point about intelligence. Data for the last two panels are from this project. First four and last two panels have the same mean and standard deviation. Code for creating the pic is here.

MIC and BCMI were new to me, but I thought they were easy to implement, which doesn’t of course mean they make sense. But see how they catch the dinosaur?

• MIC is the Maximal Information Coefficient, from maximal information-based nonparametric exploration (documentation)
• BCMI stands for Jackknife Bias Corrected MI estimates (documentation)
• DCOR is distance correlation (see comments)

I’d be happy to hear thoughts and caveats regarding the use of entropy-based dependency measures in general, and these in particular, from people who actually know these methods. Here’s a related Twitter thread, or just email me!

ps. If this is your first brush with uncertainties related to correlations, and/or have little or no statistics background, you may not know how correlation can vary spectacularly in small samples. Taleb’s stuff (mini-moocs [1, 2]) can sometimes be difficult to grasp without math background, so perhaps get started with this visualisation, or these Excel sheets. A while ago I animated some elementary simulations of p-value distributions for statistical significance of correlations; selective reporting makes things a lot worse than what’s depicted there. If you’re a psychology student, also be sure to check out the p-hacker app. If you haven’t thought about distributions much lately, check this out for a fun read by a math student.

⊂This post has been a formal sacrifice to Rexthor.⊃

# Statistical tests for social science

These are slides from my lecture on significance testing, which took place in a course on research methods for social scientists. Some thoughts:

• I tried to emphasise that this stuff is difficult, that people shouldn’t be afraid to say they don’t know, and that academics should try doing that more, too.
• I tried to instill a deep memory that many uncertainties are involved in this endeavour, and that mistakes are ok as long as you report the choices you made transparently.
• Added a small group discussion exercise at about 2/3 of the lecture: What was the most difficult part to understand so far? I think this worked quite well, although “Is this what an existential crisis feels like?” was not an uncommon response.

I really think statistics is mostly impossible to teach, and people learn when they get interested and start finding things out on their own. Not sure how successful this attempt was in doing that. Anyway, slides are available here.

TLDR: If you’re a seasoned researcher, see this. If you’re an aspiring one, start here or here, and read this.

# Complexity considerations for intervention (process) evaluation

For some years, I’ve been partly involved in the Let’s Move It intervention project, which targeted dysfunctional physical activity and sedentary behaviour patterns of older adolescents, by affecting their school environment as well as social and psychological factors.

I held a talk at the closing seminar; it was live streamed and is available here (on stage starting from about 1:57:00 in the recording). But if you were there, or are otherwise interested in the slides I promised, they are now here.

For a demonstration of non-stationary processes (which I didn’t talk about but which are mentioned in these slides), check out this video and an experimental mini-MOOC I made. Another blog post touching on some of the issues is found here.

# Misleading simplifications and where to find them (Slides & Mini-MOOC 11min)

The gist: to avoid getting fooled by them, we need to name our simplifying assumptions when modeling social scientific data. I’m experimenting with this visual approach to delivering information to those who think modeling is boring; feedback and improvement suggestions very welcome! [Similar presentation with between-individual longitudinal physical activity networks, presented at the Finnish Health Psychology conference: here]

I’m not as smooth as those talking heads on the interweb, so you may want just the slides. Download by clicking on the image below or watch at SlideShare.

SLIDE DECK:

Mini-MOOC:

Note: Jan Vanhove thinks we shouldn’t  become paranoid with model assumptions; check his related blog post here!

# Modern tools to enhance reproducibility and comprehension of research findings (VIDEO WALKTHROUGH 14min)

These are the slides of my presentation at the annual conference of the European Health Psychology Society. It’s about presenting data visually, and taking publishing culture from the journals to our own hands. I hint to a utopia, where the journal publication is a side product of a comprehensively reported data set.

Please find a 14min video walkthrough of the slides (which can be found here) below. The site presented in the slides is here, and the tutorial by the most awesome Lisa DeBruine is here!

After the talk, I saw what was probably the best tweet about a presentation of mine ever. For a fleeting moment, I was happy to exist:

Big thanks to everyone involved, especially Gjalt-Jorn Peters for helpful suggestions on code and the plots. For the diamond plots, check out diamondplots.com.

Authors of the conference abstract:

Matti Heino; Reijo Sund; Ari Haukkala; Keegan Knittle; Katja Borodulin; Antti Uutela; Vera Araújo-Soares, Falko Sniehotta, Tommi Vasankari; Nelli Hankonen

# Abstract

Background: Comprehensive reporting of results has traditionally been constrained by limited reporting space. In spite of calls for increased transparency, researchers have had to choose carefully what to report, and what to leave out; choices made based on subjective evaluations of importance. Open data remedies the situation, but privacy concerns and tradition hinder rapid progress. We present novel possibilities for comprehensive representation of data, making use of recent software developments.

Methods: We illustrate the opportunities using the Let’s Move It trial baseline data (n=1084). Descriptive statistics and group comparison results on psychosocial correlates of physical activity (PA) and accelerometry-assessed PA were reported in an easily accessible html-supplement, directly created from a combination of analysis code and data using existing tools within R.

Findings: Visualisations (e.g. network graphs, combined ridge and diamond plots) enabled presenting large amounts of information in an intelligible format. This bypasses the need to create narrative explanations for all data, or compress nuanced information into simple summary statistics. Providing all analysis code in a readily accessible format further contributed to transparency.

Discussion: We demonstrate how researchers can make their extensive analyses and descriptions openly available as website supplements, preferably with abundant visualisation to avoid overwhelming the reader with e.g. large numeric tables. Uptake of such practice could lead to a parallel form of literature, where highly technical and traditionally narrated documents coexist. While we may have to wait for fully open and documented data, comprehensive reporting of results is available to us now.

# Their mean doesn’t work for you

In this post, I present a property of averages I found surprising. Undoubtedly this is self-evident to statisticians and people who can think multi-variately, but personally I needed to see it to get a grasp of it. If you’re a researcher, make sure you do the single-item quiz before reading, to see how well your intuitions compare to those of others!

UPDATE: The finding regarding average intervention participants’ prevalence is published in this paper, in case you want a citable reference for it.

Ooo-oh! Don’t believe what they say is true
Ooo-oh! Their system doesn’t work for you
Ooo-oh! You can be what you want to be
Ooo-oh! You don’t have to join their f*king army

– Anti-Flag: Their System Doesn’t Work For You

In his book “The End of Average”, Todd Rose relates a curious story. In the late 1940s, the US Air Force saw a lot of planes crashing, and those crashes couldn’t be attributed to pilot error nor equipment malfunction. On one particularly bad day, 17 pilots crashed without an obvious reason. As everything from cockpits to helmets had been built to conform to the average pilot of the 1926, they brought in Lt. Gilbert Daniels to see if pilots had gotten bigger since then. Daniels measured 4063 pilots—who were preselected to not deviate from the average too much—on ten dimensions: height, chest circumference, arm length, thigh circumference, and so forth.

Before Daniels began, the general assumption was, that these pilots were mostly if not exclusively average, and Daniels’ task was to find the most accurate point estimate. But he had a more fundamental idea in mind. He defined “average” generously as person who falls within the 30% band around the middle, i.e. the median ±15%, and looked at whether each individual fulfills that criterion for all the ten bodily dimensions.

So, how big a proportion of pilots were found to be average by this metric?

Zero.

This may be surprising, until you realise that each additional dimension brings with it a new “objective”, making it less likely that someone achieves all of them. But actually, only a fourth were average on a single dimension, and already less than ten percent were average on two dimensions.

As you saw in the quiz, I wanted to figure out how big a proportion of our intervention participants could be described as “average” by Daniels’ definition, on four outcome measures. The answer?

A lousy 1.5 percent.

I’m a bit slow, so I had to do a of simulation to get a better grasp of the phenomenon (code here). First, I simulated 700 intervention participants, who were hypothetically measured on four random, uncorrelated, normally distributed variables. What I found was that 0.86 % of this sample were “average” by the same definition as before. But what if we changed the definition?

Here’s what happens:

As you can see, you’ll describe more than half of the sample only when you extend the definition of “average” to about the middle 85% percent (i.e. median ±42.5%).

But what if the variables were highly correlated? I also simulated 700 independent participants with four variables, which were correlated almost perfectly (within-individual r = 0.99) with each other. Still, only 22.9 % percent of participants were described by defining average as the middle 30% around the median. For other definitions, see the plot below.

What have we learned? First of all: When you see averages, do not go assuming that they describe individuals. If you’re designing an intervention, you don’t just want to see which determinants correlate highly with the target behaviour on average, or seem changeable in the sense that the mean on those variables is not very high to begin with in your target group (see the CIBER approach, if you’re starting from scratch and want to get a preliminary handle on the data). This, because a single individual is unlikely to have the average standing on more than, say, two of the determinants, and individuals are who you’re generally looking to target. One thing you could do, is a cluster analysis where you’d look for the determinant profile, which is best associated with e.g. hospital visits (or, attitude/intention), and try to target the changeable determinants within that.

As a corollary: If you, your child, or your relationship doesn’t seem to conform to the dimensions of an average person in your city, or a particular age group, or whatever, this is completely normal! Whenever you see yourself falling behind the average, remember that there are plenty of dimensions where you land above it.

But wait, what happened to USAF’s problem of planes crashing? Well, the air force told the plane manufacturers to fix the problem of cockpits which don’t fit any individuals. The manufacturers said it was impossible and extremely costly. But when the air force said didn’t listen to excuses, cheap and easy solutions appeared quickly. Adjustable seats—now standard equipment in cars—are an example of the new design philosophy of individual fit, where we don’t try to fit the individual to the system, but the system to the individual.

Let us conclude with Daniels’ introduction section:

Note 1: Here’s a very nice Google Talks presentation of this and extended topics!

Note 2: There’s a curious tendency to think that deviations from the average represent “error” regardless of domain, whereas it’s self-evident that individuals can survive both if they’re e.g. big and bulky, or small and fast. With psychological measurement, is it not madness to think all participants have an attitude score, which comes from a normal distribution with a common mean for all participants? To inject reality in the situation, each participant may have their own mean, which changes over time. But that’s a story for another post.

Note 3:  I’m taking it for granted, that we already know that the average is a useless statistic to begin with, unless you know the variation around the average, so I won’t pound on that further. But remember that variables generally aren’t perfectly normally distributed, as in the above simulations; my guess is that the situation would be even worse in those cases. Here’s a blog post you may want to check out: On Average, You’re Using the Wrong Average.

Note 4: Did I already say, that you generally shouldn’t make individual-level conclusions based on between-individual data, unless ergodicity holds (which, in psychology, would be quite weird)? See short video here!

# Assumptions, Schmassumptions; Did my intervention work or not?!

I organised a mini-seminar in Cambridge (ad below), relating to compatibility of theories and models of change. Please find my slides below, or download them here.

# Visualising ordinal data with Flaming Pillars of Hell

I recently had a great experience with a StackOverflow question, when I was thinking about how to visualise ordinal data. This post shows an option for how to do that. Code for the plots is in the end of this post.

Update: here’s an FB discussion, which mentions e.g. a good idea of making stacked % graphs (though I like to see the individuals, so they won’t sneak up behind me) and using the package TramineR to visualise and analyse change.

Update 2: Although they have other names too, I’m going to call these things flamethrower plots. Just because it reflects the fact, that even though you have the opportunity to do it, it may not always be the best idea to apply them.

Say you have scores on some likert-type scale questionnaire items, like motivation, in two time points, and would like to visualise them. You’re especially interested in whether you can see detrimental effects, e.g. due to an intervention. One option would be to make a plot like this: each line in the plot below is one person, and the lighter lines indicate bigger increases in motivation scores, whereas the darker lines indicate iatrogenic development. The data is simulated so, that the highest increases take place in the item in the leftmost plot, the middle is randomness and the right one shows iatrogenics.

I have two questions:

1. Do these plots have a name, and if not, what should we call them?
2. How would you go about superimposing model-implied-changes, i.e. lines showing that when someone starts off at, for example, a score of four, where are they likely to end up in T2?

The code below first simulates 500 participants for two time points, then draws plot. If you want to use it on your own data, transform the variables in the form scaleName_itemNumber_timePoint (e.g. “motivation_02_T1”).

```<br /># Simulate data:
data <- data.frame(id = 1:500,
Intrinsic_01_T1 = sample(1:5, 500, replace = TRUE),
Intrinsic_02_T1 = sample(1:5, 500, replace = TRUE),
Intrinsic_03_T1 = sample(1:5, 500, replace = TRUE),
Intrinsic_01_T2 = sample(1:5, 500, replace = TRUE, prob = c(0.1, 0.1, 0.2, 0.3, 0.3)),
Intrinsic_02_T2 = sample(1:5, 500, replace = TRUE),
Intrinsic_03_T2 = sample(1:5, 500, replace = TRUE, prob = c(0.3, 0.3, 0.2, 0.1, 0.1)))

pd <- position_dodge(0.4) # X-axis jitter to make points more readable

# Draw plot:

data %>%
tidyr::gather(variable, value, -id) %>%
tidyr::separate(variable, c("item", "time"), sep = "_T") %>%
dplyr::mutate(value = jitter(value, amount = 0.1)) %>% # Y-axis jitter to make points more readable
group_by(id, item) %>%
mutate(slope = (value[time == 2] - value[time == 1]) / (2 - 1)) %>%
ggplot(aes(x = time, y = value, group = id)) +
geom_point(size = 1, alpha = .2, position = pd) +
geom_line(alpha = .2, position = pd, aes(color = slope), size = 1.5) +
scale_color_viridis_c(option = "inferno")+
ggtitle('Changes in indicators of motivation scores') +
ylab('Intrinsic motivation scores') +
xlab('Time points') +
facet_wrap("item")

```

# The secret life of (complex dynamical) habits

It was recently brought to my attention that there exist such things as time and context, the flow of which affects human affairs considerably. Then there was this Twitter conversation about what habits actually are. In this post, I try to make sense of how to view health behavioural habits from the perspective of dynamical systems / complexity theory. I mostly draw from this article.

Habits are integral to human behaviour, and arguably necessary to account for in intervention research 1–3. Gardner 1 proposes a definition of habit as not a behaviour but “a process by which a stimulus generates an impulse to act as a result of a learned stimulus-response association”. Processes being seldom stable for all eternity, a complex dynamical systems perspective would propose some consequences of this definition.

What does it mean, when a process—such as habit—is stable? One way of conceiving this is considering the period of stability as a particular state a system can be in, while being subject to change. Barrett 4 proposes four features of dynamic system stability, in which a system’s states depend on the interactions among its components, as well as the system’s interactions with its environment.

First of all, stability always has a time frame, and stabilities at different time frames (such as stability over a month and a year) are interdependent. We ought to consider, how these time scales interact. For example, some factors which determine one’s motivation to go to the gym, such as mood, fluctuate on the scale from minutes to hours. Others may fluctuate on the daily level, and can be influenced by how much one slept the previous night or how stressful one’s workday was, whereas others fluctuate weekly. Then again, some—which increasingly resemble dispositions or personality factors—may be quite stable across decades. When inspecting a health behaviour, we ought to be looking at minimum the process which takes place on a time scale one level faster, and one lever slower than the one we are purportedly interested in 4. For example, how do daily levels of physical activity relate to weekly ones, and how do montly fluctuations affect the weekly fluctuations? Health psychologists could also classify each determinant of a health behaviour, based on the time scale it is thought to operate on. For example, if autonomous forms of motivation 5 seem to predict physical activity quite well cross-sectionally, we could attempt to measure it for a hundred days and investigate what the relevant time-scales of fluctuations are, in relation to those of the target behaviour. Such an exercise could also be helpful for deciding on the sampling frequency of experience sampling studies.

Second, processes in systems such as people have their characteristic attractor landscapes, and these landscapes can possibly be spelled out, along with the criteria associated with them. By attractors I mean here behaviours a person is drawn to, and an attractor landscape is the conglomerate of these behaviours. The cue-structure of the behaviours can be quite elaborate. For example, a person may smoke only, when they have drank alcohol (1) in a loud environment (2), among a relatively large group (3) of relatively unfamiliar people (4), one or two of whom are smokers (5); a situation where it is easier to have a private conversation if one joins another to go out for a cigarette. This highlights how the process of this person’s smoking habit can be very stable (mapping to the traditional conception of “habitual”), while also possibly being highly infrequent.

Note: Each of the aforementioned conditions for this person to smoke are insufficient by themselves, although all are needed to trigger smoking in this context. As a whole, they are sufficient to cause the person to smoke, but not always necessarily needed, because the person may smoke in some more-or-less limited other conditions, too. These conditions can also be called INUS (referring to Insufficient but Necessary criteria of an Unnecessary but Sufficient context for the behaviour) 6. Let that sink in a bit. As a corollary, if a criterion really is necessary, it may be an attractive target for intervention.

Third, the path through which change happens matters, a lot. Even when all determinants of behaviour are at a same value, the outcome may be very different depending on previous values of the outcome. This phenomenon is known as hysterisis, and it has been observed in various fields from physics (e.g. the form of a magnetic field depends on its past) to psychology (e.g. once a person becomes depressed due to excess stress, the stress level must be much lower to switch back to the normal state, than was needed for the shift to depression; 7). As a health behaviour example, just imagine how much easier it is to switch from a consistent training regime to doing no exercise at all, compared to doing it the other way around. Another way to think about is to consider that systems are “influenced by the residual stability of an antecedent regime” 4. As a consequence of stability being “just” a particular type of a path-dependent dynamic process 4,8, we need to consider the history leading up to the period where a habit is active. This forces investigators to consider attractor patterns and sensitivity to initial conditions: When did this stable (or attractor) state come about? If interactions in a system create the state of the system, which bio-psycho-social interactions are contributing to the stable state in question?

Fourth, learning processes such as those happening due to interventions usually affect a cluster of variables’ stabilities, not just one of them. To change habits, we naturally need to consider which changeable processes should be targeted, but it is probably impossible to manipulate these processes in isolation. This has been dubbed the “fat finger problem” (Borsboom 2018, personal communication); trying to change a specific variable, like attempting to press a specific key on the keyboard with gloves on, almost invariably ends up affecting neighbouring variables. Our target is dynamic and interconnected, often calling for coevolution of the intervention and the intervened.

It is obvious that people can relapse to their old habitual (attractor) behaviour after an intervention, and likely that extinction, unlearning and overwriting of cue-response patterns can help in breaking habits, whatever the definition. But the complex dynamics perspective puts a special emphasis on understanding the time scale and history of the intervenable processes, as well as highlighting the difficulty of changing one process while holding others constant, as the classical experimental setup would propose.

I would be curious of hearing thoughts about these clearly unfinished ideas.

1. Gardner, B. A review and analysis of the use of ‘habit’ in understanding, predicting and influencing health-related behaviour. Health Psychol. Rev. 9, 277–295 (2015).
2. Wood, W. Habit in Personality and Social Psychology. Personal. Soc. Psychol. Rev. 21, 389–403 (2017).
3. Wood, W. & Rünger, D. Psychology of Habit. Annu. Rev. Psychol. 67, 289–314 (2016).
4. Barrett, N. F. A dynamic systems view of habits. Front. Hum. Neurosci. 8, (2014).
5. Ryan, R. M. & Deci, E. L. Self-determination theory: Basic psychological needs in motivation, development, and wellness. (Guilford Publications, 2017).
6. Mackie, J. L. Causes and Conditions. Am. Philos. Q. 2, 245–264 (1965).
7. Cramer, A. O. J. et al. Major Depression as a Complex Dynamic System. PLoS ONE 11, (2016).
8. Roe, R. A. Test validity from a temporal perspective: Incorporating time in validation research. Eur. J. Work Organ. Psychol. 23, 754–768 (2014).