Their mean doesn’t work for you

In this post, I present a property of averages I found surprising. Undoubtedly this is self-evident to statisticians and people who can think multi-variately, but personally I needed to see it to get a grasp of it. If you’re a researcher, make sure you do the single-item quiz before reading, to see how well your intuitions compare to those of others!

Ooo-oh! Don’t believe what they say is true
Ooo-oh! Their system doesn’t work for you
Ooo-oh! You can be what you want to be
Ooo-oh! You don’t have to join their f*king army

– Anti-Flag: Their System Doesn’t Work For You

In his book “The End of Average”, Todd Rose relates a curious story. In the late 1940s, the US Air Force saw a lot of planes crashing, and those crashes couldn’t be attributed to pilot error nor equipment malfunction. On one particularly bad day, 17 pilots crashed without an obvious reason. As everything from cockpits to helmets had been built to conform to the average pilot of the 1926, they brought in Lt. Gilbert Daniels to see if pilots had gotten bigger since then. Daniels measured 4063 pilots—who were preselected to not deviate from the average too much—on ten dimensions: height, chest circumference, arm length, thigh circumference, and so forth.

Before Daniels began, the general assumption was, that these pilots were mostly if not exclusively average, and Daniels’ task was to find the most accurate point estimate. But he had a more fundamental idea in mind. He defined “average” generously as person who falls within the 30% band around the middle, i.e. the median ±15%, and looked at whether each individual fulfills that criterion for all the ten bodily dimensions.

So, how big a proportion of pilots were found to be average by this metric?

Zero.

averageman clip1
Daniels, Gilbert S. “The” Average Man”?” AIR FORCE AEROSPACE MEDICAL RESEARCH LAB WRIGHT-PATTERSON AFB OH, 1952.

This may be surprising, until you realise that each additional dimension brings with it a new “objective”, making it less likely that someone achieves all of them. But actually, only a fourth were average on a single dimension, and already less than ten percent were average on two dimensions.

As you saw in the quiz, I wanted to figure out how big a proportion of our intervention participants could be described as “average” by Daniels’ definition, on four outcome measures. The answer?

A lousy 1.98 percent.

I’m a bit slow, so I had to do a of simulation to get a better grasp of the phenomenon (code here). First, I simulated 700 intervention participants, who were hypothetically measured on four random, uncorrelated, normally distributed variables. What I found was that 0.86 % of this sample were “average” by the same definition as before. But what if we changed the definition?

Here’s what happens:

averageman uncorrelated

As you can see, you’ll describe more than half of the sample only when you extend the definition of “average” to about the middle 85% percent (i.e. median ±42.5%).

But what if the variables were highly correlated? I also simulated 700 independent participants with four variables, which were correlated almost perfectly (within-individual r = 0.99) with each other. Still, only 22.9 % percent of participants were described by defining average as the middle 30% around the median. For other definitions, see the plot below.

averageman correlated

What have we learned? First of all: When you see averages, do not go assuming that they describe individuals. If you’re designing an intervention, you don’t just want to see which determinants correlate highly with the target behaviour on average, or seem changeable in the sense that the mean on those variables is not very high to begin with in your target group (see the CIBER approach, if you’re starting from scratch and want to get a preliminary handle on the data). This, because a single individual is unlikely to have the average standing on more than, say, two of the determinants, and individuals are who you’re generally looking to target. One thing you could do, is a cluster analysis where you’d look for the determinant profile, which is best associated with e.g. hospital visits (or, attitude/intention), and try to target the changeable determinants within that.

As a corollary: If you, your child, or your relationship doesn’t seem to conform to the dimensions of an average person in your city, or a particular age group, or whatever, this is completely normal! Whenever you see yourself falling behind the average, remember that there are plenty of dimensions where you land above it.

But wait, what happened to USAF’s problem of planes crashing? Well, the air force told the plane manufacturers to fix the problem of cockpits which don’t fit any individuals. The manufacturers said it was impossible and extremely costly. But when the air force said didn’t listen to excuses, cheap and easy solutions appeared quickly. Adjustable seats—now standard equipment in cars—are an example of the new design philosophy of individual fit, where we don’t try to fit the individual to the system, but the system to the individual.

Let us conclude with Daniels’ introduction section:

averageman clip2

Three additional notes about the average:

Note 1: I’m taking it for granted, that we already know that the average is a useless statistic to begin with, unless you know the variation around the average, so I won’t pound on that further. But remember that variables generally aren’t perfectly normally distributed, as in the above simulations; my guess is that the situation would be even worse in those cases. Here’s a blog post you may want to check out: On Average, You’re Using the Wrong Average.

Note 2: There’s a curious tendency to think that deviations from the average represent “error” regardless of domain, whereas it’s self-evident that individuals can survive both if they’re e.g. big and bulky, or small and fast. With psychological measurement, is it not madness to think all participants have an attitude score, which comes from a normal distribution with a common mean for all participants? To inject reality in the situation, each participant may have their own mean, which changes over time. But that’s a story for another post.

Note 3: Did I already say, that you generally shouldn’t make individual-level conclusions based on between-individual data, unless ergodicity holds (which, in psychology, would be quite weird)?

 

Is it possible to unveil intervention mechanisms under complexity?

In this post, I wonder what complex systems, as well as the nuts and bolts of mediation analysis, imply for studying processes of health psychological interventions.

Say we make a risky prediction and find an intervention effect that replicates well (never mind for now that replicability is practically never tested in health psychology). We could then go on to investigating boundary conditions and intricacies of the effect. What’s sometimes done is a study of “mechanisms of action”, also endorsed by the MRC guidelines for process evaluation (1), as well as the Workgroup for Intervention Development and Evaluation Research (WIDER) (2). In such a study, we investigate whether the intervention worked as we thought it should have worked (in other words, to test the program theory; see previous post). It would be spectacularly useful to decision makers, if we could disentangle the mechanisms of the intervention; “by increasing autonomy support, autonomous motivation goes up and physical activity ensues”. But attempting to evaluate this opens a spectacular can of worms.

Complex interventions include multiple interacting components, targeting several facets of a behaviour on different levels of the environment the individual operates in (1). This environment itself can be described as a complex system (3). In complex, adaptive systems such as the society or a human being, causality is thorny issue (4): Feedback loops, manifold interactions between variables over time, path-dependence and sensitivity to initial conditions make it challenging at best to state “a causes b” (5). But what does it even mean to say something causes something else?

Bollen (6) presents three conditions for causal inference: isolation, association and direction. Isolation means that no other variable can reasonably cause the outcome. This is usually impossible to achieve strictly, which is why researchers usually aim to control for covariates and thus reach a condition of pseudo-isolation. A common, but not often acknowledged problem is overfitting; adding covariates to a model leads to also fitting the measurement error they carry with them. Association means there should be a connection between the cause and the effect – in real life, usually a probabilistic one. In social sciences, a problem arises as everything is more or less correlated with everything else, and high-dimensional datasets suffer of the “curse of dimensionality”. Direction, self-evidently, means that the effect should flow from one direction to the other, not the other way around. This is highly problematic in complex systems. For an example in health psychology, it seems obvious that depression symptoms (e.g. anxiety and insomnia) feed each other, resulting in self-enforcing feedback loops (7).

When we consider the act of making efficient inferences, we want to be able to falsify our theories of the world (9); something that’s only recently really starting to be understood among psychologists (10). An easy-ish way about this, is to define the smallest effect size of interest (SESOI) a priori, ensure one has proper statistical power and attempt to reject the hypotheses that effects are larger than the upper bound of the SESOI, and lower than the lower bound. This procedure, also known as equivalence testing (11) allows for rejecting the falsification of statistical hypotheses in situations, where a SESOI can be determined. But when testing program theories of complex interventions, there may be no such luxury.

The notion of non-linear interactions with feedback loops makes the notion of causality in a complex system an evasive concept. If we’re dealing with complexity, it is a situation where even miniscule effects can be meaningful when they interact with other effects: even small effects can have huge influences down the line (“the butterfly effect” in nonlinear dynamics; 8). It is hence difficult to determine the SESOI for intermediate links in the chain from intervention to outcome. And if we only say we expect an effect to be “any positive number”, this leads to the postulated processes, as described in intervention program theories, being unfalsifiable: If a correlation of 0.001 between intervention participation and a continuous variable would corroborate a theory, one would need more than six million participants to detect it (at 80% power and an alpha of 5%; see also 12, p. 30). If researchers are unable to reject the null hypothesis of no effect, they cannot determine whether there is evidence for a null effect, or if a more elaborate sample was needed (e.g. 13).

Side note: One could use Bayes factors to compare whether a point null data generator (effect size being zero) would predict the data better than, for example, an alternative model where most effects are near zero but half of them over d = 0.2. But still, the smaller effects you consider potentially important, the less the data can distinguish between alternative and null models. A better option could be to estimate, how probable it is that the effect has a positive sign (as demonstrated here).

In sum, researchers are faced with an uncomfortable trade-off: Either they must specify a SESOI (and thus, a hypothesis) which does not reflect the theory under test or, on the other hand, unfalsifiability.

A common way to study mechanisms is to conduct a mediation analysis, where one variable’s (X) impact on another (Y) is modelled to pass through a third variable (M). In its classical form, one expects the path X-Y to go near zero, when M is added to the model.

The good news is, that nowadays we can do power analyses for both simple and complex mediation models (14).  The bad news is, that in the presence of randomisation of X but not M, the observed M-Y relation entails strong assumptions which are usually ignored (15). Researchers should e.g. justify why there exist no other mediating variables than the ones in the model; leaving variables out is effectively the same as assuming their effect to be zero. Also, the investigator should demonstrate why no omitted variables affect both M and Y – if there are such variables, the causal effect may be distorted at best and misleading at worst.

Now that we know it’s bad to omit variables, how do we avoid overfitting the model (i.e. be fooled by looking too much into what the data says)? It is very common for seemingly supported theories to fail to generalise to slightly different situations or other samples (16), and subgroup claims regularly fail to pan out in new data (17). Some solutions include ridge regression in the frequentist framework and regularising priors in the Bayesian one, but the simplest (though not the easiest) solution would be cross-validation. In cross-validation, you basically divide your sample in two (or even up to n) parts, use the first one to explore and the second one to “replicate” the analysis. Unfortunately, you need to have a large enough sample so that you can break it down to parts.

What does all this tell us? Mainly, that investigators would do well to heed Kenny’s (18) admonition: “mediation is not a thoughtless routine exercise that can be reduced down to a series of steps. Rather, it requires a detailed knowledge of the process under investigation and a careful and thoughtful analysis of data”. I would conjecture that researchers often lack such process knowledge. It may also be, that under complexity, the exact processes become both unknown and unknowable (19). Tools like structural equation modelling are wonderful, but I’m curious if they are up to the task of advising us about how to live in interconnected systems, where trends and cascades are bound to happen, and everything causes everything else.

These are just relatively disorganised thoughts, and I’m curious to hear if someone can shed hope to the situation. Specifically, hearing of interventions that work consistently and robustly, would definitely make my day.

ps. If you’re interested in replication matters in health psychology, there’s an upcoming symposium on the topic in EHPS17 featuring Martin Hagger, Gjalt-Jorn Peters, Rik Crutzen, Marie Johnston and me. My presentation is titled “Disentangling replicable mechanisms of complex interventions: What to expect and how to avoid fooling ourselves?

pps. A recent piece in Lancet (20) called for a complex systems model of evidence for public health. Here’s a small conversation with the main author, regarding the UK Medical Research Council’s take on the subject. As you see, the science seems to be in some sort of a limbo/purgatory-type of place currently, but smart people are working on it so I have hope 🙂

complexity rutter twitter.PNG

https://twitter.com/harryrutter/status/876219437430517761

 

Bibliography

 

  1. Moore GF, Audrey S, Barker M, Bond L, Bonell C, Hardeman W, et al. Process evaluation of complex interventions: Medical Research Council guidance. BMJ. 2015 Mar 19;350:h1258.
  2. Abraham C, Johnson BT, de Bruin M, Luszczynska A. Enhancing reporting of behavior change intervention evaluations. JAIDS J Acquir Immune Defic Syndr. 2014;66:S293–S299.
  3. Shiell A, Hawe P, Gold L. Complex interventions or complex systems? Implications for health economic evaluation. BMJ. 2008 Jun 5;336(7656):1281–3.
  4. Sterman JD. Learning from Evidence in a Complex World. Am J Public Health. 2006 Mar 1;96(3):505–14.
  5. Resnicow K, Page SE. Embracing Chaos and Complexity: A Quantum Change for Public Health. Am J Public Health. 2008 Aug 1;98(8):1382–9.
  6. Bollen KA. Structural equations with latent variables. New York: John Wiley. 1989;
  7. Borsboom D. A network theory of mental disorders. World Psychiatry. 2017 Feb;16(1):5–13.
  8. Hilborn RC. Sea gulls, butterflies, and grasshoppers: A brief history of the butterfly effect in nonlinear dynamics. Am J Phys. 2004 Apr;72(4):425–7.
  9. LeBel EP, Berger D, Campbell L, Loving TJ. Falsifiability Is Not Optional. Accepted pending minor revisions at Journal of Personality and Social Psychology. [Internet]. 2017 [cited 2017 Apr 21]. Available from: https://osf.io/preprints/psyarxiv/dv94b/
  10. Morey R D, Lakens D. Why most of psychology is statistically unfalsifiable. GitHub [Internet]. in prep. [cited 2016 Oct 23]; Available from: https://github.com/richarddmorey/psychology_resolution
  11. Lakens D. Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses [Internet]. 2016 [cited 2017 Feb 24]. Available from: https://osf.io/preprints/psyarxiv/97gpc/
  12. Dienes Z. Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan; 2008. 185 p.
  13. Dienes Z. Using Bayes to get the most out of non-significant results. Quant Psychol Meas. 2014;5:781.
  14. Schoemann AM, Boulton AJ, Short SD. Determining Power and Sample Size for Simple and Complex Mediation Models. Soc Psychol Personal Sci. 2017 Jun 15;194855061771506.
  15. Bullock JG, Green DP, Ha SE. Yes, but what’s the mechanism? (don’t expect an easy answer). J Pers Soc Psychol. 2010;98(4):550–8.
  16. Yarkoni T, Westfall J. Choosing prediction over explanation in psychology: Lessons from machine learning. FigShare Httpsdx Doi Org106084m9 Figshare. 2016;2441878:v1.
  17. Wallach JD, Sullivan PG, Trepanowski JF, Sainani KL, Steyerberg EW, Ioannidis JPA. Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials. JAMA Intern Med. 2017 Apr 1;177(4):554–60.
  18. Kenny DA. Reflections on mediation. Organ Res Methods. 2008;11(2):353–358.
  19. Bar-Yam Y. The limits of phenomenology: From behaviorism to drug testing and engineering design. Complexity. 2016 Sep 1;21(S1):181–9.
  20. Rutter H, Savona N, Glonti K, Bibby J, Cummins S, Finegood DT, et al. The need for a complex systems model of evidence for public health. The Lancet [Internet]. 2017 Jun 13 [cited 2017 Jun 17];0(0). Available from: http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(17)31267-9/abstract

 

The scientific foundation of intervention evaluation

In the post-replication-crisis world, people are increasingly arguing, that even applied people should actually know what they’re doing when they do what they call science. In this post I expand upon some points I made in these slides about the philosophy of science behind hypothesis testing in interventions.

How does knowledge grow when we do intervention research? Evaluating whether an intervention worked can be phrased in relatively straightforward terms; “there was a predicted change in the pre-specified outcome“. This is, of course, a simplification. But try and contrast it with the attempt to phrase what you mean when you want to claim how the intervention worked, or why it did not. To do this, you need to spell out the program theory* of the intervention, which explicates the logic and causal assumptions behind intervention development.

* Also referred to as programme logic, intervention logic, theory-based (or driven) evaluation, theory of change, theory of action, impact pathway analysis, or programme theory-driven evaluation science… (Rogers, 2008). These terms are equivalent for the purposes of this piece.

The way I see it (for a more systematic approach, see intervention mapping), we have background theories (Theory of Planned Behaviour, Self-Determination Theory, etc.) and knowledge from earlier studies, which we synthesise into a program theory. This knowledge informs us about how we believe an intervention in our context would achieve its goals, regarding the factors (“determinants”) that determine the target behaviour. From (or during the creation of) this mesh of substantive theory and accompanying assumptions, we deduce a boxes-and-arrows diagram, which describes the causal mechanisms at play. These assumed causal mechanisms then help us derive a substantive hypothesis (e.g. “intervention increases physical activity”), which informs a statistical hypothesis (e.g. “accelerometer-measured metabolic equivalent units will be statistically significantly higher in the intervention group than the control group”). The statistical hypothesis then dictates what sort of observations we should be expecting. I call this the causal stream; each one of the entities follows from what came before it.

program_2streams.PNG

The inferential stream runs to the other direction. Hopefully, the observations are informative enough so that we can make judgements regarding the statistical hypothesis. The statistical hypothesis’ fate then informs the substantive hypothesis, and whether our theory upstream get corroborated (supported). Right?

Not so fast. What we derived the substantive and statistical hypotheses from, was not only the program theory (T) we wanted to test. We also had all the other theories the program theory was drawn from (i.e. auxiliary theories, At), as well as an assumption that the accelerometers measure physical activity as they are supposed to, and other assumptions about instruments (Ai). Not only this, we assume that the intervention was delivered as planned and all other presumed experimental conditions (Cn) hold, and that there are no other systematic, unmeasured contextual effects that mess with the results (“all other things being equal”; a ceteris paribus condition, Cp).

Program_link tells.png

We now come to a logical implication (“observational conditional”) for testing theories (Meehl, 1990b, p. 119, 1990a, p. 109). Oi is the observation of an intervention having taken place, and Op is an observation of increased physical activity:

(T and At and Ai and Cn and Cp) → (Oi → Op)

[Technically, the first arrow should be logical entailment, but that’s not too important here.] The first bracket can be thought of as “all our assumptions hold”, the second bracket as “if we observe the intervention, then we should observe increased physical activity”. The whole thing thus roughly means “if our assumptions (T, A, C) hold, we should observe a thing (i.e. Oi → Op)”.

Now here comes falsifiability: if we observe an intervention but no increase in physical activity, the logical truth value of the second bracket comes out false, which also destroys the conjunction in the first bracket. By elementary logic, we must conclude that one or more of the elements in the first bracket is false – the big problem is that we don’t know which element(s) was or were false! And what if the experiment pans out? It’s not just our theory that’s been corroborated, but the bundle of assumptions as a whole. This is known as the Duhem-Quine problem, and it has brought misery to countless induction-loving people for decades.

EDIT: As Tal Yarkoni pointed out, this corroboration can be negligible unless one is making a risky prediction. See the damn strange coincidence condition below.

Program_link fails.png

EDIT: There was a great comment by Peter Holtz. Knowledge grows when we identify the weakest links in the mix of theoretical and auxiliary assumptions, and see if we can falsify them. And things do get awkward if we abandon falsification.

If wearing an accelerometer increases physical activity in itself (say people who receive an intervention are more conscious about their activity monitoring, and thus exhibit more pronounced measurement effects when told to wear an accelerometer), you obviously don’t conclude the increase is due to the program theory’s effectiveness. Also, you would not be very impressed by setups where you’d likely get the same result, whether the program theory was right or wrong. In other words, you want a situation where, if the program theory was false, you would doubt a priori that among those who increased their physical activity, many would have underwent the intervention. This is called the theoretical risk; prior probability p(Op|Oi)—i.e. probability of observing increase in physical activity, given that the person underwent the intervention—should be low absent the theory (Meehl, 1990a, p. 199, mistyped in Meehl, 1990b, p. 110), and the lower the probability, the more impressive the prediction. In other words, spontaneous improvement absent the program theory should be a damn strange coincidence.

Note that solutions for handling the Duhem-Quine mess have been proposed both in the frequentist (e.g. error statistical piecewise testing, Mayo, 1996), and Bayesian (Howson & Urbach, 2006) frameworks.

What is a theory, anyway?

A lot of the above discussion hangs upon what we mean by a “theory” – and consequently, should we apply the process of theory testing to intervention program theories. [Some previous discussion here.] One could argue that saying “if I push this button, my PC will start” is not a scientific theory, and that interventions use theory but logic models do not capture them. It has been said that if the theoretical assumptions underpinning an intervention don’t hold, the intervention will fail, but that doesn’t make an intervention evaluation a test of the theory. This view has been defended by arguing that behaviour change theories underlying an intervention may work, but e.g. the intervention targets the wrong cognitive processes.

To me it seems like these are all part of the intervention program theory, which we’re looking to make inferences from. If you’re testing statistical hypotheses, you should have substantive hypotheses you believe are informed by the statistical ones, and those come from a theory – it doesn’t matter if it’s a general theory-of-everything or one that applies in very specific context such as the situation of your target population.

Now, here’s a question for you:

If the process described above doesn’t look familiar and you do hypothesis testing, how do you reckon your approach produces knowledge?

Note: I’m not saying it doesn’t (though that’s an option), just curious of alternative approaches. I know that e.g. Mayo’s error statistical perspective is superior to what’s presented here, but I’m yet to find an exposition of it I could thoroughly understand.

Please share your thoughts and let me know where you think this goes wrong!

With thanks to Rik Crutzen for comments on a draft of this post.

ps. If you’re interested in replication matters in health psychology, there’s an upcoming symposium on the topic in EHPS17 featuring Martin Hagger, Gjalt-Jorn Peters, Rik Crutzen, Marie Johnston and me. My presentation is titled “Disentangling replicable mechanisms of complex interventions: What to expect and how to avoid fooling ourselves?“

pps. Paul Meehl’s wonderful seminar Philosophical Psychology can be found in video and audio formats here.

Bibliography:

Abraham, C., Johnson, B. T., de Bruin, M., & Luszczynska, A. (2014). Enhancing reporting of behavior change intervention evaluations. JAIDS Journal of Acquired Immune Deficiency Syndromes, 66, S293–S299.

Dienes, Z. (2008). Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan.

Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Quantitative Psychology and Measurement, 5, 781. https://doi.org/10.3389/fpsyg.2014.00781

Hilborn, R. C. (2004). Sea gulls, butterflies, and grasshoppers: A brief history of the butterfly effect in nonlinear dynamics. American Journal of Physics, 72(4), 425–427. https://doi.org/10.1119/1.1636492

Howson, C., & Urbach, P. (2006). Scientific reasoning: the Bayesian approach. Open Court Publishing.

Lakatos, I. (1971). History of science and its rational reconstructions. Springer. Retrieved from http://link.springer.com/chapter/10.1007/978-94-010-3142-4_7

Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.

Meehl, P. E. (1990a). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141.

Meehl, P. E. (1990b). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195–244. https://doi.org/10.2466/pr0.1990.66.1.195

Moore, G. F., Audrey, S., Barker, M., Bond, L., Bonell, C., Hardeman, W., … Baird, J. (2015). Process evaluation of complex interventions: Medical Research Council guidance. BMJ, 350, h1258. https://doi.org/10.1136/bmj.h1258

Rogers, P. J. (2008). Using Programme Theory to Evaluate Complicated and Complex Aspects of Interventions. Evaluation, 14(1), 29–48. https://doi.org/10.1177/1356389007084674

Shiell, A., Hawe, P., & Gold, L. (2008). Complex interventions or complex systems? Implications for health economic evaluation. BMJ, 336(7656), 1281–1283. https://doi.org/10.1136/bmj.39569.510521.AD

 

Evaluating intervention program theories – as theories

How do we figure out, whether our ideas worked out? To me, it seems that in psychology we seldom rigorously think about this question, despite having been criticised for dubious inferential practices for at least half a century. You can download a pdf  of my talk at the Finnish National Institute for Health and Welfare (THL) here, or see the slide show in the end of this post. Please solve the three problems in the summary slide! 🙂

TLDR: is there a reason, why evaluating intervention program theories shouldn’t follow the process of scientific inference?

summary