Is it possible to unveil intervention mechanisms under complexity?

In this post, I wonder what complex systems, as well as the nuts and bolts of mediation analysis, imply for studying processes of health psychological interventions.

Say we make a risky prediction and find an intervention effect that replicates well (never mind for now that replicability is practically never tested in health psychology). We could then go on to investigating boundary conditions and intricacies of the effect. What’s sometimes done is a study of “mechanisms of action”, also endorsed by the MRC guidelines for process evaluation (1), as well as the Workgroup for Intervention Development and Evaluation Research (WIDER) (2). In such a study, we investigate whether the intervention worked as we thought it should have worked (in other words, to test the program theory; see previous post). It would be spectacularly useful to decision makers, if we could disentangle the mechanisms of the intervention; “by increasing autonomy support, autonomous motivation goes up and physical activity ensues”. But attempting to evaluate this opens a spectacular can of worms.

Complex interventions include multiple interacting components, targeting several facets of a behaviour on different levels of the environment the individual operates in (1). This environment itself can be described as a complex system (3). In complex, adaptive systems such as the society or a human being, causality is thorny issue (4): Feedback loops, manifold interactions between variables over time, path-dependence and sensitivity to initial conditions make it challenging at best to state “a causes b” (5). But what does it even mean to say something causes something else?

Bollen (6) presents three conditions for causal inference: isolation, association and direction. Isolation means that no other variable can reasonably cause the outcome. This is usually impossible to achieve strictly, which is why researchers usually aim to control for covariates and thus reach a condition of pseudo-isolation. A common, but not often acknowledged problem is overfitting; adding covariates to a model leads to also fitting the measurement error they carry with them. Association means there should be a connection between the cause and the effect – in real life, usually a probabilistic one. In social sciences, a problem arises as everything is more or less correlated with everything else, and high-dimensional datasets suffer of the “curse of dimensionality”. Direction, self-evidently, means that the effect should flow from one direction to the other, not the other way around. This is highly problematic in complex systems. For an example in health psychology, it seems obvious that depression symptoms (e.g. anxiety and insomnia) feed each other, resulting in self-enforcing feedback loops (7).

When we consider the act of making efficient inferences, we want to be able to falsify our theories of the world (9); something that’s only recently really starting to be understood among psychologists (10). An easy-ish way about this, is to define the smallest effect size of interest (SESOI) a priori, ensure one has proper statistical power and attempt to reject the hypotheses that effects are larger than the upper bound of the SESOI, and lower than the lower bound. This procedure, also known as equivalence testing (11) allows for rejecting the falsification of statistical hypotheses in situations, where a SESOI can be determined. But when testing program theories of complex interventions, there may be no such luxury.

The notion of non-linear interactions with feedback loops makes the notion of causality in a complex system an evasive concept. If we’re dealing with complexity, it is a situation where even miniscule effects can be meaningful when they interact with other effects: even small effects can have huge influences down the line (“the butterfly effect” in nonlinear dynamics; 8). It is hence difficult to determine the SESOI for intermediate links in the chain from intervention to outcome. And if we only say we expect an effect to be “any positive number”, this leads to the postulated processes, as described in intervention program theories, being unfalsifiable: If a correlation of 0.001 between intervention participation and a continuous variable would corroborate a theory, one would need more than six million participants to detect it (at 80% power and an alpha of 5%; see also 12, p. 30). If researchers are unable to reject the null hypothesis of no effect, they cannot determine whether there is evidence for a null effect, or if a more elaborate sample was needed (e.g. 13).

Side note: One could use Bayes factors to compare whether a point null data generator (effect size being zero) would predict the data better than, for example, an alternative model where most effects are near zero but half of them over d = 0.2. But still, the smaller effects you consider potentially important, the less the data can distinguish between alternative and null models. A better option could be to estimate, how probable it is that the effect has a positive sign (as demonstrated here).

In sum, researchers are faced with an uncomfortable trade-off: Either they must specify a SESOI (and thus, a hypothesis) which does not reflect the theory under test or, on the other hand, unfalsifiability.

A common way to study mechanisms is to conduct a mediation analysis, where one variable’s (X) impact on another (Y) is modelled to pass through a third variable (M). In its classical form, one expects the path X-Y to go near zero, when M is added to the model.

The good news is, that nowadays we can do power analyses for both simple and complex mediation models (14).  The bad news is, that in the presence of randomisation of X but not M, the observed M-Y relation entails strong assumptions which are usually ignored (15). Researchers should e.g. justify why there exist no other mediating variables than the ones in the model; leaving variables out is effectively the same as assuming their effect to be zero. Also, the investigator should demonstrate why no omitted variables affect both M and Y – if there are such variables, the causal effect may be distorted at best and misleading at worst.

Now that we know it’s bad to omit variables, how do we avoid overfitting the model (i.e. be fooled by looking too much into what the data says)? It is very common for seemingly supported theories to fail to generalise to slightly different situations or other samples (16), and subgroup claims regularly fail to pan out in new data (17). Some solutions include ridge regression in the frequentist framework and regularising priors in the Bayesian one, but the simplest (though not the easiest) solution would be cross-validation. In cross-validation, you basically divide your sample in two (or even up to n) parts, use the first one to explore and the second one to “replicate” the analysis. Unfortunately, you need to have a large enough sample so that you can break it down to parts.

What does all this tell us? Mainly, that investigators would do well to heed Kenny’s (18) admonition: “mediation is not a thoughtless routine exercise that can be reduced down to a series of steps. Rather, it requires a detailed knowledge of the process under investigation and a careful and thoughtful analysis of data”. I would conjecture that researchers often lack such process knowledge. It may also be, that under complexity, the exact processes become both unknown and unknowable (19). Tools like structural equation modelling are wonderful, but I’m curious if they are up to the task of advising us about how to live in interconnected systems, where trends and cascades are bound to happen, and everything causes everything else.

These are just relatively disorganised thoughts, and I’m curious to hear if someone can shed hope to the situation. Specifically, hearing of interventions that work consistently and robustly, would definitely make my day.

ps. If you’re interested in replication matters in health psychology, there’s an upcoming symposium on the topic in EHPS17 featuring Martin Hagger, Gjalt-Jorn Peters, Rik Crutzen, Marie Johnston and me. My presentation is titled “Disentangling replicable mechanisms of complex interventions: What to expect and how to avoid fooling ourselves?

pps. A recent piece in Lancet (20) called for a complex systems model of evidence for public health. Here’s a small conversation with the main author, regarding the UK Medical Research Council’s take on the subject. As you see, the science seems to be in some sort of a limbo/purgatory-type of place currently, but smart people are working on it so I have hope 🙂

complexity rutter twitter.PNG

https://twitter.com/harryrutter/status/876219437430517761

 

Bibliography

 

  1. Moore GF, Audrey S, Barker M, Bond L, Bonell C, Hardeman W, et al. Process evaluation of complex interventions: Medical Research Council guidance. BMJ. 2015 Mar 19;350:h1258.
  2. Abraham C, Johnson BT, de Bruin M, Luszczynska A. Enhancing reporting of behavior change intervention evaluations. JAIDS J Acquir Immune Defic Syndr. 2014;66:S293–S299.
  3. Shiell A, Hawe P, Gold L. Complex interventions or complex systems? Implications for health economic evaluation. BMJ. 2008 Jun 5;336(7656):1281–3.
  4. Sterman JD. Learning from Evidence in a Complex World. Am J Public Health. 2006 Mar 1;96(3):505–14.
  5. Resnicow K, Page SE. Embracing Chaos and Complexity: A Quantum Change for Public Health. Am J Public Health. 2008 Aug 1;98(8):1382–9.
  6. Bollen KA. Structural equations with latent variables. New York: John Wiley. 1989;
  7. Borsboom D. A network theory of mental disorders. World Psychiatry. 2017 Feb;16(1):5–13.
  8. Hilborn RC. Sea gulls, butterflies, and grasshoppers: A brief history of the butterfly effect in nonlinear dynamics. Am J Phys. 2004 Apr;72(4):425–7.
  9. LeBel EP, Berger D, Campbell L, Loving TJ. Falsifiability Is Not Optional. Accepted pending minor revisions at Journal of Personality and Social Psychology. [Internet]. 2017 [cited 2017 Apr 21]. Available from: https://osf.io/preprints/psyarxiv/dv94b/
  10. Morey R D, Lakens D. Why most of psychology is statistically unfalsifiable. GitHub [Internet]. in prep. [cited 2016 Oct 23]; Available from: https://github.com/richarddmorey/psychology_resolution
  11. Lakens D. Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses [Internet]. 2016 [cited 2017 Feb 24]. Available from: https://osf.io/preprints/psyarxiv/97gpc/
  12. Dienes Z. Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan; 2008. 185 p.
  13. Dienes Z. Using Bayes to get the most out of non-significant results. Quant Psychol Meas. 2014;5:781.
  14. Schoemann AM, Boulton AJ, Short SD. Determining Power and Sample Size for Simple and Complex Mediation Models. Soc Psychol Personal Sci. 2017 Jun 15;194855061771506.
  15. Bullock JG, Green DP, Ha SE. Yes, but what’s the mechanism? (don’t expect an easy answer). J Pers Soc Psychol. 2010;98(4):550–8.
  16. Yarkoni T, Westfall J. Choosing prediction over explanation in psychology: Lessons from machine learning. FigShare Httpsdx Doi Org106084m9 Figshare. 2016;2441878:v1.
  17. Wallach JD, Sullivan PG, Trepanowski JF, Sainani KL, Steyerberg EW, Ioannidis JPA. Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials. JAMA Intern Med. 2017 Apr 1;177(4):554–60.
  18. Kenny DA. Reflections on mediation. Organ Res Methods. 2008;11(2):353–358.
  19. Bar-Yam Y. The limits of phenomenology: From behaviorism to drug testing and engineering design. Complexity. 2016 Sep 1;21(S1):181–9.
  20. Rutter H, Savona N, Glonti K, Bibby J, Cummins S, Finegood DT, et al. The need for a complex systems model of evidence for public health. The Lancet [Internet]. 2017 Jun 13 [cited 2017 Jun 17];0(0). Available from: http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(17)31267-9/abstract

 

The scientific foundation of intervention evaluation

In the post-replication-crisis world, people are increasingly arguing, that even applied people should actually know what they’re doing when they do what they call science. In this post I expand upon some points I made in these slides about the philosophy of science behind hypothesis testing in interventions.

How does knowledge grow when we do intervention research? Evaluating whether an intervention worked can be phrased in relatively straightforward terms; “there was a predicted change in the pre-specified outcome“. This is, of course, a simplification. But try and contrast it with the attempt to phrase what you mean when you want to claim how the intervention worked, or why it did not. To do this, you need to spell out the program theory* of the intervention, which explicates the logic and causal assumptions behind intervention development.

* Also referred to as programme logic, intervention logic, theory-based (or driven) evaluation, theory of change, theory of action, impact pathway analysis, or programme theory-driven evaluation science… (Rogers, 2008). These terms are equivalent for the purposes of this piece.

The way I see it (for a more systematic approach, see intervention mapping), we have background theories (Theory of Planned Behaviour, Self-Determination Theory, etc.) and knowledge from earlier studies, which we synthesise into a program theory. This knowledge informs us about how we believe an intervention in our context would achieve its goals, regarding the factors (“determinants”) that determine the target behaviour. From (or during the creation of) this mesh of substantive theory and accompanying assumptions, we deduce a boxes-and-arrows diagram, which describes the causal mechanisms at play. These assumed causal mechanisms then help us derive a substantive hypothesis (e.g. “intervention increases physical activity”), which informs a statistical hypothesis (e.g. “accelerometer-measured metabolic equivalent units will be statistically significantly higher in the intervention group than the control group”). The statistical hypothesis then dictates what sort of observations we should be expecting. I call this the causal stream; each one of the entities follows from what came before it.

program_2streams.PNG

The inferential stream runs to the other direction. Hopefully, the observations are informative enough so that we can make judgements regarding the statistical hypothesis. The statistical hypothesis’ fate then informs the substantive hypothesis, and whether our theory upstream get corroborated (supported). Right?

Not so fast. What we derived the substantive and statistical hypotheses from, was not only the program theory (T) we wanted to test. We also had all the other theories the program theory was drawn from (i.e. auxiliary theories, At), as well as an assumption that the accelerometers measure physical activity as they are supposed to, and other assumptions about instruments (Ai). Not only this, we assume that the intervention was delivered as planned and all other presumed experimental conditions (Cn) hold, and that there are no other systematic, unmeasured contextual effects that mess with the results (“all other things being equal”; a ceteris paribus condition, Cp).

Program_link tells.png

We now come to a logical implication (“observational conditional”) for testing theories (Meehl, 1990b, p. 119, 1990a, p. 109). Oi is the observation of an intervention having taken place, and Op is an observation of increased physical activity:

(T and At and Ai and Cn and Cp) → (Oi → Op)

[Technically, the first arrow should be logical entailment, but that’s not too important here.] The first bracket can be thought of as “all our assumptions hold”, the second bracket as “if we observe the intervention, then we should observe increased physical activity”. The whole thing thus roughly means “if our assumptions (T, A, C) hold, we should observe a thing (i.e. Oi → Op)”.

Now here comes falsifiability: if we observe an intervention but no increase in physical activity, the logical truth value of the second bracket comes out false, which also destroys the conjunction in the first bracket. By elementary logic, we must conclude that one or more of the elements in the first bracket is false – the big problem is that we don’t know which element(s) was or were false! And what if the experiment pans out? It’s not just our theory that’s been corroborated, but the bundle of assumptions as a whole. This is known as the Duhem-Quine problem, and it has brought misery to countless induction-loving people for decades.

EDIT: As Tal Yarkoni pointed out, this corroboration can be negligible unless one is making a risky prediction. See the damn strange coincidence condition below.

Program_link fails.png

EDIT: There was a great comment by Peter Holtz. Knowledge grows when we identify the weakest links in the mix of theoretical and auxiliary assumptions, and see if we can falsify them. And things do get awkward if we abandon falsification.

If wearing an accelerometer increases physical activity in itself (say people who receive an intervention are more conscious about their activity monitoring, and thus exhibit more pronounced measurement effects when told to wear an accelerometer), you obviously don’t conclude the increase is due to the program theory’s effectiveness. Also, you would not be very impressed by setups where you’d likely get the same result, whether the program theory was right or wrong. In other words, you want a situation where, if the program theory was false, you would doubt a priori that among those who increased their physical activity, many would have underwent the intervention. This is called the theoretical risk; prior probability p(Op|Oi)—i.e. probability of observing increase in physical activity, given that the person underwent the intervention—should be low absent the theory (Meehl, 1990a, p. 199, mistyped in Meehl, 1990b, p. 110), and the lower the probability, the more impressive the prediction. In other words, spontaneous improvement absent the program theory should be a damn strange coincidence.

Note that solutions for handling the Duhem-Quine mess have been proposed both in the frequentist (e.g. error statistical piecewise testing, Mayo, 1996), and Bayesian (Howson & Urbach, 2006) frameworks.

What is a theory, anyway?

A lot of the above discussion hangs upon what we mean by a “theory” – and consequently, should we apply the process of theory testing to intervention program theories. [Some previous discussion here.] One could argue that saying “if I push this button, my PC will start” is not a scientific theory, and that interventions use theory but logic models do not capture them. It has been said that if the theoretical assumptions underpinning an intervention don’t hold, the intervention will fail, but that doesn’t make an intervention evaluation a test of the theory. This view has been defended by arguing that behaviour change theories underlying an intervention may work, but e.g. the intervention targets the wrong cognitive processes.

To me it seems like these are all part of the intervention program theory, which we’re looking to make inferences from. If you’re testing statistical hypotheses, you should have substantive hypotheses you believe are informed by the statistical ones, and those come from a theory – it doesn’t matter if it’s a general theory-of-everything or one that applies in very specific context such as the situation of your target population.

Now, here’s a question for you:

If the process described above doesn’t look familiar and you do hypothesis testing, how do you reckon your approach produces knowledge?

Note: I’m not saying it doesn’t (though that’s an option), just curious of alternative approaches. I know that e.g. Mayo’s error statistical perspective is superior to what’s presented here, but I’m yet to find an exposition of it I could thoroughly understand.

Please share your thoughts and let me know where you think this goes wrong!

With thanks to Rik Crutzen for comments on a draft of this post.

ps. If you’re interested in replication matters in health psychology, there’s an upcoming symposium on the topic in EHPS17 featuring Martin Hagger, Gjalt-Jorn Peters, Rik Crutzen, Marie Johnston and me. My presentation is titled “Disentangling replicable mechanisms of complex interventions: What to expect and how to avoid fooling ourselves?“

pps. Paul Meehl’s wonderful seminar Philosophical Psychology can be found in video and audio formats here.

Bibliography:

Abraham, C., Johnson, B. T., de Bruin, M., & Luszczynska, A. (2014). Enhancing reporting of behavior change intervention evaluations. JAIDS Journal of Acquired Immune Deficiency Syndromes, 66, S293–S299.

Dienes, Z. (2008). Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan.

Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Quantitative Psychology and Measurement, 5, 781. https://doi.org/10.3389/fpsyg.2014.00781

Hilborn, R. C. (2004). Sea gulls, butterflies, and grasshoppers: A brief history of the butterfly effect in nonlinear dynamics. American Journal of Physics, 72(4), 425–427. https://doi.org/10.1119/1.1636492

Howson, C., & Urbach, P. (2006). Scientific reasoning: the Bayesian approach. Open Court Publishing.

Lakatos, I. (1971). History of science and its rational reconstructions. Springer. Retrieved from http://link.springer.com/chapter/10.1007/978-94-010-3142-4_7

Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.

Meehl, P. E. (1990a). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141.

Meehl, P. E. (1990b). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195–244. https://doi.org/10.2466/pr0.1990.66.1.195

Moore, G. F., Audrey, S., Barker, M., Bond, L., Bonell, C., Hardeman, W., … Baird, J. (2015). Process evaluation of complex interventions: Medical Research Council guidance. BMJ, 350, h1258. https://doi.org/10.1136/bmj.h1258

Rogers, P. J. (2008). Using Programme Theory to Evaluate Complicated and Complex Aspects of Interventions. Evaluation, 14(1), 29–48. https://doi.org/10.1177/1356389007084674

Shiell, A., Hawe, P., & Gold, L. (2008). Complex interventions or complex systems? Implications for health economic evaluation. BMJ, 336(7656), 1281–1283. https://doi.org/10.1136/bmj.39569.510521.AD

 

Missing data, the inferential assassin

Last week, I attended the Methods festival 2017 in Jyväskylä. Slides and program for the first day are here, and for the second day, here (some are in Finnish, some in English).

One interesting presentation was on missing data by Juha Karvanen [twitter profile] (slides for the talk). It involved toilet paper and Hans Rosling, so I figured I’ll post my recording of the display. Thing is, missing data lurks in the shadows and if you don’t do your utmost to get full information, it may be lethal.

juhakarvanen tribuutti.PNG

  1. Intro and missing completely at random (MCAR): Video. Probability of missingness for all cases is the same. Rare in real life?
  2. Missing at random (MAR): Video. Probability of missingness depends on something we know. For example, if men leave more questions unanswered than women, but among men and women, the missingness is MCAR.
  3. Missing not at random (MNAR): Video. Probability of missingness depends on unobserved values. Your analysis becomes misleading and you may not know it; misinformation reigns and angels cry.

There was an exciting question on a slide. I’ll post the answer in this thread later.

Random sampling vs web data question methods festival.PNGBy the way, one of Richard McElreath’s Statistical Rethinking lectures has a nice description on how to do Bayesian imputation when one assumes MCAR. He also discusses of how irrational complete case analysis (throwing away the cases that don’t have full data) is, when you really think about it. Also, never substitute a missing value with the mean of other values!

p.s. I would love it if someone dropped a comment saying “this problem is actually not too dire, because…”

Replication is impossible, falsification unnecessary and truth lies in published articles (?)

joonasautocomic
Is psychology headed towards being a science conducted by “zealots”, or to a post-car (or train) crash metaphysics, where anything goes because nothing is even supposed to replicate?

I recently peer reviewed a partly shocking piece called “Reproducibility in Psychological Science: When Do Psychological Phenomena Exist?“ (Iso-Ahola, 2017). In the article, the author makes some very good points, which unfortunately get drowned under very strange statements and positions. Me, Eiko Fried and Etienne LeBel addressed those shortly in a commentary (preprint; UPDATE: published piece). Below, I’d like to expand upon some additional thoughts I had about the piece, to answer Martin Hagger’s question.

On complexity

When all parts do the same thing on a certain scale (planets on Newtonian orbits), their behaviour is relatively easy to predict for many purposes. Same thing, when all molecules act independently in a random fashion: the risk that most or all beer molecules in a pint move upward at the same time is ridiculously low, and thus we don’t have to worry about the yellow (or black, if you’re into that) gold escaping the glass. Both situations are easy-ish systems to describe, as opposed to complex systems where the interactions, sensitivity to initial conditions etc. can produce a huge variety of behaviour and states. Complexity science is the study of these phenomena, which have become increasingly common since the 1900s (Weaver, 1948).

Iso-Ahola (2017) quotes (though somewhat unfaithfully) the complexity scientist Bar-Yam (2016b): “for complex systems (humans), all empirical inferences are false… by their assumptions of replicability of conditions, independence of different causal factors, and transfer to different conditions of prior observations”. He takes this to mean that “phenomena’s existence should not be defined by any index of reproducibility of findings” and that “falsifiability and replication are of secondary importance to advancement of scientific fields”. But this is a highly misleading representation of the complexity science perspective.

In Bar-Yam’s article, he used an information theoretic approach to analyse the limits of what we can say about complex systems. The position is that while full description of systems via empirical observation is impossible, we should aim to identify the factors which are meaningful in terms of replicability of findings, or the utility of the acquired knowledge. As he elaborates elsewhere: “There is no utility to information that is only true in a particular instance. Thus, all of scientific inquiry should be understood as an inquiry into universality—the determination of the degree to which information is general or specific” (Bar-Yam, 2016a, p. 19).

This is fully in line with the Fisher quote presented in Mayo’s slides:

Fisher quote Mayo

The same goes for replications; no single one-lab study can disprove a finding:

“’Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low-level empirical hypothesis which describes such an effect is proposed and  corroborated’ (Popper, 1959, p. 66)” (see Holtz & Monnerjahn, 2017)

So, if the high-quality non-replication replicates, one must consider that something may be off with the original finding. This leads us to the question of what researchers should study in the first place.

On research programmes

Lakatos (1971) posits a difference between progressive and degenerating research lines. In a progressive research line, investigators explain a negative result by modifying the theory in a way which leads to new predictions that subsequently pan out. On the other hand, coming up with explanations that do not make further contributions, but rather just explain away the negative finding, leads to a degenerative research line. Iso-Ahola quotes Lakatos to argue that, although theories may have a “poor public record” that should not be denied, falsification should not lead to abandonment of theories. Here’s Lakatos:

“One may rationally stick to a degenerating [research] programme until it is overtaken by a rival and even after. What one must not do is to deny its poor public record. […] It is perfectly rational to play a risky game: what is irrational is to deceive oneself about the risk” (Lakatos, 1971, p. 104)

As Meehl (1990, p. 115) points out, the quote continues as follows:

“This does not mean as much licence as might appear for those who stick to a degenerating programme. For they can do this mostly only in private. Editors of scientific journals should refuse to publish their papers which will, in general, contain either solemn reassertions of their position or absorption of counterevidence (or even of rival programmes) by ad hoc, linguistic adjustments. Research foundations, too, should refuse money.” (Lakatos, 1971, p. 105)

Perhaps researchers should pay more attention which program they are following?

As an ending note, here’s one more interesting quote: “Zealotry of reproducibility has unfortunately reached the point where some researchers take a radical position that the original results mean nothing if not replicated in the new data.” (Iso-Ahola, 2017)

For explorative research, I largely agree with these zealots. I believe exploration is fine and well, but the results do mean nearly nothing unless replicated in new data (de Groot, 2014). One cannot hypothesise and confirm with the same data.

Perhaps I focus too much on the things that were said in the paper, not what the author actually meant, and we do apologise if we have failed to abide with the principle of charity in the commentary or this blog post. I do believe the paper will best serve as a pedagogical example to aspiring researchers, on how strangely arguments could be constructed in the olden times.

ps. Bar-Yam later commented on this blog post, confirming the mis(present/interpr)etation of his research by the author of the reproducibility paper:

baryam

pps. Here’s Fred Hasselman‘s comment on the article, from the Frontiers website (when you scroll down all the way to the bottom, there’s a comment option):

1. Whether or not a posited entity (e.g. a theoretical object of measurement) exists or not, is a matter of ontology.

2. Whether or not one can, in principle, generate scientific knowledge about a posited entity, is a matter of epistemology.

3. Whether or not the existence claim of a posited entity (or law) is scientifically plausible depends on the ability of a theory or nomological network to produce testable predictions (predictive power) and the accuracy of those predictions relative to measurement outcomes (empirical accuracy).

4. The comparison of the truth status of psychological theoretical constructs to the Higgs Boson is a false equivalence: One is formally defined and deduced from a highly corroborated model and predicts the measurement context in which its existence can be verified or falsified (the LHC), the other is a common language description of a behavioural phenomenon “predicted” by a theory constructed from other phenomena published in the scientific record of which the reliability is… unknown.

5. It is the posited entity itself -by means of its definition in a formalism or theory that predicts its existence- that decides how it can be evidenced empirically. If it cannot be evidenced using population statistics, don’t use it! If the analytic tools to evidence it do not exist, develop them! Quantum physics had to develop a new theory of probability, new mathematics to be able to make sense of measurement outcomes of different experiments. Study non-ergodic physics, complexity science, emergence and self-organization in physics, decide if it is sufficient, if not, develop a new formalism. That is how science advances and scientific knowledge is generated. Not by claiming all is futile.

To summarise: The article continuously confuses ontological and epistemic claims, it does not provide a future direction even though many exist or are being proposed by scholars studying phenomena of the mind, moreover the article makes no distinction between sufficiency and necessity in existence claims, and this is always problematic.

Contrary to the claim here, a theory (and the ontology and epistemology that spawned it) can enjoy high perceived scientific credibility even if some things cannot be known in principle, or if there’s always uncertainty in measurements. It can do so by being explicit about what it is that can and cannot be known about posited entities.

E.g. Quantum physics is a holistic physical theory, also in the epistemic sense: It is in principle not possible to know anything about a quantum system at the level of the whole, based on knowledge about its constituent parts. Even so, quantum physical theories have the highest predictive power and empirical accuracy of all scientific theories ever produced by human minds!

As evidenced by the history of succession of theories in physics, successful scientific theorising about the complex structure of reality seems to be a highly reproducible phenomenon of the mind. Let’s apply it to the mind itself!

Bibliography:

Bar-Yam, Y. (2016a). From big data to important information. Complexity, 21(S2), 73–98.

Bar-Yam, Y. (2016b). The limits of phenomenology: From behaviorism to drug testing and engineering design. Complexity, 21(S1), 181–189. https://doi.org/10.1002/cplx.21730

de Groot, A. D. (2014). The meaning of “significance” for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas]. Acta Psychologica, 148, 188–194. https://doi.org/10.1016/j.actpsy.2014.02.001

Holtz, P., & Monnerjahn, P. (2017). Falsificationism is not just ‘potential’ falsifiability, but requires ‘actual’ falsification: Social psychology, critical rationalism, and progress in science. Journal for the Theory of Social Behaviour. https://doi.org/10.1111/jtsb.12134

Iso-Ahola, S. E. (2017). Reproducibility in Psychological Science: When Do Psychological Phenomena Exist? Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.00879

Lakatos, I. (1971). History of science and its rational reconstructions. Springer. Retrieved from http://link.springer.com/chapter/10.1007/978-94-010-3142-4_7

Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141.

Weaver, W. (1948). Science and complexity. American Scientist, 36(4), 536–544.

 

Evaluating intervention program theories – as theories

How do we figure out, whether our ideas worked out? To me, it seems that in psychology we seldom rigorously think about this question, despite having been criticised for dubious inferential practices for at least half a century. You can download a pdf  of my talk at the Finnish National Institute for Health and Welfare (THL) here, or see the slide show in the end of this post. Please solve the three problems in the summary slide! 🙂

TLDR: is there a reason, why evaluating intervention program theories shouldn’t follow the process of scientific inference?

summary

Preprints, short and sweet

preprints_eff
Photo courtesy of Nelli Hankonen

These are slides (with added text content to make more sense) from a small presentation I held at the University of Helsinki. Mainly of interest to academic researchers.

TL;DR: To get the most out of scientific publishing, we may need imitate physics a bit, and bypass the old gatekeepers. If the slideshare below is of crappy quality, check out the slides here.

UPDATE: There’s a new (September 2019) paper out on peer review effectiveness. Doesn’t look superfab:

timvanderzee pic

ps. if you prefer video, this explains things in four minutes 🙂

Deterministic doesn’t mean predictable

In this post, I argue against the intuitively appealing notion that, in a deterministic world, we just need more information and can use it to solve problems in complex systems. This presents a problem in e.g. psychology, where more knowledge does not necessarily mean cumulative knowledge or even improved outcomes.

Recently, I attended a talk where Misha Pavel happened to mention how big data can lead us astray, and how we can’t just look at data but need to know mechanisms of behaviour, too.

IMG_20161215_125659.jpg
Misha Pavel arguing for the need to learn how mechanisms work.

Later, a couple of my psychologist friends happened to present arguments discounting this, saying that the problem will be solved due to determinism. Their idea was that the world is a deterministic place—if we knew everything, we could predict everything (an argument also known as Laplace’s Demon)—and that we eventually a) will know, and b) can predict. I’m fine with the first part, or at least agnostic about it. But there are more mundane problems to prediction than “quantum randomness” and other considerations about whether truly random phenomenon exist. The thing is, that even simple and completely deterministic systems can be utterly unpredictable to us mortals. I will give an example of this below.

Even simple and completely deterministic systems can be utterly unpredictable.

Let’s think of a very simple made-up model of physical activity, just to illustrate a phenomenon:

Say today’s amount of exercise depends only on motivation and exercise of the previous day. Let’s say people have a certain maximum amount of time to exercise each day, and that they vary from day to day, in what proportion of that time they actually manage to exercise. To keep things simple, let’s say that if a person manages to do more exercise on Monday, they give themselves a break on Tuesday. People also have different motivation, so let’s add that as factor, too.

Our completely deterministic, but definitely wrong, model could generalise to:

Exercise percentage today = (motivation) * (percentage of max exercise yesterday) * (1 – percentage of max exercise yesterday)

For example, if one had a constant motivation of 3.9 units (whatever the scale), and managed to do 80% of their maximum exercise on Monday, they would use 3.9 times 80% times 20% = 62% of their maximum exercise time on Tuesday. Likewise, on Wednesday they would use 3.9 times 62% times 38% = 92% of the maximum possible exercise time. And so on and so on.

We’re pretending this model is the reality. This is so that we can perfectly calculate the amount of exercise on any day, given that we know a person’s motivation and how much they managed to exercise the previous day.

Imagine we measure a person, who obeys this model with a constant motivation of 3.9, and starts out on day 1 reaching 50% of their maximum exercise amount. But let’s say there is a slight measurement error: instead of 50.000%, we measure 50.001%. In the graph below we can observe, how the error (red line) quickly diverges from the actual (blue line). The predictions we make from our model after around day 40 do not describe our target person’s behaviour at all. The slight deviation from the deterministic system has made it practically chaotic and random to us.

chaosplot_animation.gif
Predicting this simple, fully deterministic system becomes impossible to predict in a short time due to a measurement error of 0.001%-points. Blue line depicts actual, red line the measured values. They diverge around day 35 and are soon completely off. [Link to gif]

What are the consequences?

The model is silly, of course, as we probably would never try to predict an individual’s exact behaviour on any single day (averages and/or bigger groups help, because usually no single instance can kill the prediction). But this example does highlight a common feature of complex systems, known as sensitive dependence to initial conditions: even small uncertainties cumulate to create huge errors. It is also worth noting, that increasing model complexity doesn’t necessarily help us with prediction, due to a problems such as overfitting (thinking the future will be like the past; see also why simple heuristics can beat optimisation).

Thus, predicting long-term path-dependent behaviour, even if we knew the exact psycho-socio-biological mechanism governing it, may be impossible in the absence of perfect measurement. Even if the world was completely deterministic, we still could not predict it, as even trivially small things left unaccounted for could throw us off completely.

Predicting long-term path-dependent behaviour, even if we knew the exact psycho-socio-biological mechanism governing it, may be impossible in the absence of perfect measurement.

The same thing happens when trying to predict as simple a thing as how billiard balls impact each other on the pool table. The first collision is easy to calculate, but to compute the ninth you already have to take into account the gravitational pull of people standing around the table. By the 56th impact, every elementary particle in the universe has to be included in your assumptions! Other examples include trying to predict the sex of a human fetus, or trying to predict the weather 2 weeks out (this is the famous idea about the butterfly flapping its wings).

Coming back to Misha Pavel’s points regarding big data, I feel somewhat skeptical about being able to acquire invariant “domain knowledge” in many psychological domains. Also, as shown here, knowing the exact mechanism is still no promise of being able to predict what happens in a system. Perhaps we should be satisfied when we can make predictions such as “intervention x will increase the probability that the system reaches a state where more than 60% of the goal is reached on more than 50% of the days, by more than 20% in more than 60% of the people who belong in a group it was designed to affect”?

But still: for determinism to solve our prediction problems, the amount and accuracy of data needed is beyond the wildest sci-fi fantasies.

I’m happy to be wrong about this, so please share your thoughts! Leave a comment below, or on these relevant threads: Twitter, Facebook.

References and resources:

  • Code for the plot can be found here.
  • The billiard ball example explained in context.
  • A short paper on the history about the butterfly (or seagull) flapping its wings-thing.
  • To learn about dynamic systems and chaos, I highly recommend David Feldman’s course on the topic, next time it comes around at Complexity Explorer.
  • … Meanwhile, the equation I used here is actually known as the “logistic map”. See this post about how it behaves.

 

Post scriptum:

Recently, I was happy and surprised to see a paper attempting to create a computational model of a major psychological theory. In a conversation, Nick Brown expressed doubt:

nick_usb

Do you agree? What are the alternatives? Do we have to content with vague statements like “the behaviour will fluctuate” (perhaps as in: fluctuat nec mergitur)? How should we study the dynamics of human behaviour?

 

Also: do see Nick Brown’s blog, if you don’t mind non-conformist thinking.

 

The art of expecting p-values

In this post, I try to present the intuition behind the fact that, when studying real effects, one usually should not expect p-values near the 0.05 threshold. If you don’t read quantitative research, you may want to skip this one. If you think I’m wrong about something, please leave a comment and set the record straight!

Recently, I attended a presentation by a visiting senior scholar. He spoke about how their group had discovered a surprising but welcome correlation between two measures, and subsequently managed to replicate the result. What struck me, was his choice of words:

“We found this association, which was barely significant. So we replicated it with the same sample size of ~250, and found that the correlation was almost the same as before and, as expected, of similar statistical significance (p < 0.05)“.

This highlights a threefold, often implicit (but WRONG), mental model:

[EDIT: due to Markus’ comments, I realised the original, off-the-top-of-my-head examples were numerically impossible and changed them a bit. Also, added stuff in brackets that the post hopefully clarifies as you read on.]

  1. “Replications with a sample size similar to the original, should produce p-values similar to the original.”
    • Example: in subsequent studies with n = 100 each, a correlation (p = 0.04) should replicate as the same correlation (p ≈ 0.04) [this happens about 0.02% of the time when population r is 0.3; in these cases you actually observe an r≈0.19]
  2. “P-values are linearly related with sample size, i.e. bigger sample gives you proportionately more small p-values.”
    • Example: a correlation (n = 100, p = 0.04), should replicate as a correlation of about the same, when n = 400, with e.g. a p ≈ 0.02. [in the above-mentioned case, the replication gives observed r±0.05 about 2% of the time, but the p-value is smaller than 0.0001 for the replication]
  3. “We study real effects.” [we should think a lot more about how our observations could have come by in the absence of a real effect!]

It is obvious that the third point is contentious, and I won’t consider it here much. But the first two points are less clear, although the confusion is understandable if one has learned and always applied Jurassic (pre-Bem) statistics.

[Note: “statistical power” or simply “power” is the probability of finding an effect, if it really exists. The more obvious an effect is, and the bigger your sample size, the better are your chances of detecting these real effects – i.e. you have bigger power. You want to be pretty sure your study detects what it’s designed to detect, so you may want to have a power of 90%, for example.]

revolving_lottery_machinekaitenshiki-cyusenkijapan
Figure 1. A lottery machine. Source: Wikipedia

To get a handle of how the p behaves, we must understand the nature of p-values as random variables 1. They are much like the balls in a lottery machine, with values between zero and one marked on them. The lottery machine of real effects has disproportionately more low (e.g. < 0.01) values on the balls, while the lottery machine of null effects contains a “fair” distribution of numbers on balls (where each number is as likely as any other). If this doesn’t make sense yet, read on.

Let us exemplify this with a simulation. Figure 2 shows the expected distribution of p-values, when we do 10 000 studies with one t-test each, and every time report the p of the test. You can think of this as 9999 replications with the same sample size as the original.

50percent-power-distribution
Figure 2: p-value distribution for 10 000 simulated studies, under 50% power when the alternative hypothesis is true. (When power increases, the curve gets pushed even farther to the left, leaving next to no p-values over 0.01)

Now, if we would do just five studies with the parameters laid out above, we could see a set of p-values like {0.002, 0.009, 0.024, 0.057, 0.329, 0.479}, half of them being “significant” (in bold). If we had 80% power to detect the difference we are looking for, about 80% of the p-values would be “significant”. As an additional note, with 50% power, 4% of the 10 000 studies give a p between 0.04 and 0.05. With 80% power, this number goes down to 3%. For 97.5% power, only 0.5%  of studies (yes, five for every thousand studies) are expected to give such a “barely significant” p-value.

The senior scholar, who was mentioned in the beginning, was studying correlations. They work the same way. The animation below shows, how p-values are distributed for different sample sizes, if we do 10 000 studies with every sample size (i.e. every frame is 10 000 studies with that sample size). The samples are from a population where the real correlation is 0.3. The red dotted line is p = 0.05.

corrplot_varyspeed
Figure 3. P-value distributions for different sample sizes, when studying a real correlation of 0.3. Each frame is 10 000 replications with a given sample size. If pic doesn’t show, click here for the gif (and/or try another browser).

The next animation zooms in on “significant” p-values in the same way as in figure 2 (though the largest bar goes off the roof quickly here). As you can see, it is almost impossible to get a p-value close to 5% with large power. Thus, there is no way we should “expect” a p-value over 0.01 when we replicate a real effect with large power. Very low p-values are always more probable than “barely significant” ones.

zoomplot_quick
Figure 4. Zooming in on the “significant” p-values. It is more probable to get a very low p than a barely significant one, even with small samples. If pic doesn’t show, click here for the gif.

But what if there is no effect? In this case, every p-value is equally likely (see Figure 5). This means, that in the long run, getting a p = 0.01 is just as likely as getting a p = 0.97, and by implication, 5% of all p-values are under 0.05. Therefore, the number of studies that generated a p between 0.04 and 0.05, is 1%. Remember, how this percentage was 0.5% (five in a thousand) when the alternative hypothesis was true under 97.5% power? Indeed, when power is high, these “barely significant” p-values may actually speak for the null, not the alternative hypothesis! Same goes for e.g. p=0.024, when power is 99% [see here].

5percent-power-distribution
Figure 5. p-value distribution when the null hypothesis is true. Every p is just as likely as any other.

Consider the lottery machine analogy again. Does it make better sense now?

The lottery machine of real effects has disproportionately more low (e.g. < 0.01) values on the balls, while the lottery machine of null effects contains a “fair” distribution of numbers on balls (each number is as likely as any other).

Let’s look at one more visualisation of the same thing:

pcurveplot_quick.gif
Figure 6. The percentages of “statistically significant” p-values evolving as sample size increases. If the gif doesn’t show, you’ll find it here.

Aside: when the effect one studies is enormous, sample size naturally matters less. I calculated Cohen’s d for the Asch 2 line segment study, and a whopping d = 1.59 emerged. This is surely a very unusual effect size in psychological experiments, and leads to high statistical power even under low sample sizes. In such a case, by the logic presented above, one should be extremely cautious of p-values closer to 0.05 than zero.

Understanding all this is vital in interpreting past research. We never know what the data generating system has been (i.e. are p-values extracted from a distribution under the null, or under the alternative), but the data gives us hints about what is more likely. Let us take an example from a social psychology classic, Moscovici’s “Towards a theory of conversion behaviour” 3. The article reviews results, which are then used to support a nuanced theory of minority influence. Low p-values are taken as evidence for an effect.

Based on what we learned earlier about the distribution of p-values under the null vs. the alternative, we can now see, under which hypothesis the p-values are more likely to occur. The tool to use here is called the p-curve 4, and it is presented in Figure 6.

moscovici-p-curve
Figure 6. A quick-and-dirty p-curve of Moscovici (1980). See this link for the data you can paste onto p-checker or p-curve.

You can directly see, how a big portion of p-values is in the 0.05 region, whereas you would expect them to cluster near 0.01. The p-curve analysis (from the p-curve website) shows that evidential value, if there is any, is inadequate (Z = -2.04, p = .0208). Power is estimated to be 5%, consistent with the null hypothesis being true.

The null being true may or may not have been the case here. But looking at the curve might have helped researchers, who spent some forty years trying to unsuccessfully replicate the pattern of Moscovici’s afterimage study results 5.

In a recent talk, I joked about a bunch of researchers who tour around holiday resorts every summer, making people fill in IQ tests. Each summer they keep the results which show p < 0.05 and scrap the others, eventually ending up in the headlines with a nice meta-analysis of the results.

Don’t be those guys.

lomavessa

Disclaimer: the results discussed here may not generalise to some more complex models, where the p-value is not uniformly distributed under the null. I don’t know much about those cases, so please feel free to educate me!

Code for the animated plots is here. It was inspired by code from Daniel Lakens, whose blog post inspired this piece. Check out his MOOC here. Additional thanks to Jim Grange for advice on gif making and Alexander Etz for constructive comments.

Bibliography:

  1. Murdoch, D. J., Tsai, Y.-L. & Adcock, J. P-Values are Random Variables. The American Statistician 62, 242–245 (2008).
  2. Asch, S. E. Studies of independence and conformity: I. A minority of one against a unanimous majority. Psychological monographs: General and applied 70, 1 (1956).
  3. Moscovici, S. in Advances in Experimental Social Psychology 13, 209–239 (Elsevier, 1980).
  4. Simonsohn, U., Simmons, J. P. & Nelson, L. D. Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a Reply to Ulrich and Miller (2015). J Exp Psychol Gen 144, 1146–1152 (2015).
  5. Smith, J. R. & Haslam, S. A. Social psychology: Revisiting the classic studies. (SAGE Publications, 2012).