Introduction to data management best practices


With the realisation that even linked data may not be enough for scientists (1), and as the European Union decided to embrace open access and best practices in data management (2–4), many psychologists find themselves treading on an unfamiliar terrain. Given that ~85% of health research is wasted, this is nothing short of a pressing issue in related fields.

Here, I comment on the FAIR Guiding Principles for scientific data management and stewardship (5) for the benefit of myself and perhaps others, who have not been involved with data management best practices.

[Note: all this does NOT mean that you are forced to share sensitive data. But if your work can not be checked or reused (even after anonymisation), calling it scientific might be a stretch.]

What goes in a data management plan?

A necessary document to accompany any research plan is the data management plan. This plan should first of all specify the purpose of the data collection, and how it relates to the objectives of one’s research project. It should state which types of data are collected – for an example in the context of an intervention to promote physical activity, one might collect survey data, as well as accelerometer and body composition measures. The steps to assure the quality of the data can be described, too.

Next, the file formats for this data should be specified, along with which parts of the data will be made openly available, if the whole data is not made so. When and where will the data be made available, and what software is needed to read it? Will there be restrictions to access? Will there be an embargo, and if so, why?

The data management plan should also state, whether existing data is being re-used. The researcher should clarify the origin of data, whether existing or new, comment on its size (if known), and outline for whom the data will be useful to (4).

Bad practices leading to unusable data are still common, so adopting proper data management practices can incur costs. The data management plan should explicate these, how they are covered and who is responsible for the data management process.

The importance of collecting original data in psychology cannot be overstated. Data are a conditio sine qua non for any empirical science. Anyone who generates data and shares them publicly should be adequately recognized. (6)

Note: metadata means any information about the data. For example, descriptive metadata increases discovery and identification; includes elements such as keywords, title, abstract, author. Administrative metadata informs the management of the data; creation dates, file types, version numbers.

The FAIR principles for data management

The FAIR principles have been composed to help both machines and humans (such as meta-analysts) to find and use existing data. The principles consist of four requirements: Findability, Accessibility, Interoperability and Reusability. Note that the adherence to these principles is not just a yes-no question, but a gradient where data stewards should aspire for an increased uptake.

Below, the exact formulation of the (sub-)principles is in italics, my comments in bullet points.


F1. data are assigned a globally unique and eternally persistent identifier.

  • This is mostly handled in psychological research by making sure the research document is supplied with a DOI (Digital Object Identifier (7)). In addition to journals (for published research), most repositories where one can deposit any material (such as FigShare or Zenodo), or preprints (such as PsyArxiv), assign the work a DOI automatically.

F2. data are described with rich metadata.

  • This relates to R1 below. There should be data about the data telling you what the data is. Also: What is your approach to making versioning clear? In the Open Science Framework (OSF), you can upload new versions of your document and it automatically saves the previous version behind the new one, given that the new file has the same name as the old one.
  • Your data archiver helps you with metadata. E.g. the Finnish Social Science Data Archive (FSD) uses the DDI 2.1. metadata standard.

F3. data are registered or indexed in a searchable resource.

  • The researcher should deposit the data in a searchable repository. Your own website, or the website of your research group, is unfortunately not enough.

F4. metadata specify the data identifier.

  • Make sure your data actually shows its DOI somewhere, and include a link to the dataset in the metadata. As far as I know, repositories such as the OSF do this for you.

Non-transparent, inaccessible data. [Photo by Maarten van den Heuvel on Unsplash.]

  • From what I understand, these are not too relevant to individual researchers. Basically, if your work can be accessed via “http://”, you are complying with this. You should also be mindful of storing your data in one repository only, and avoid having multiple DOIs. Regarding A2: if your data is sensitive and you cannot share it openly, the description of the data should still be accessible to researchers. I am not certain about how repositories deal with accessibility after the data has been taken offline.

A1. data are retrievable by their identifier using a standardized communications protocol.

A1.1 the protocol is open, free, and universally implementable.

A1.2 the protocol allows for an authentication and authorization procedure, where necessary.

A2. metadata are accessible, even when the data are no longer available.


  • Behind these items (and the FAIR principles in general) is the idea that machines could read the data and mine it for e.g. meta-analyses. I am blissfully unaware of the intricacies related to that endeavour, so I just comment from the perspective of a common researcher here.

I1. data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

  • It is better to prefer simple formats (e.g. spreadsheets with comma-separated values, “file.csv”) that can be opened without special software (e.g. SPSS, “file.sav”).

I2. data use vocabularies that follow FAIR principles.

  • This principle may seem somewhat vague and hard for others than computer scientists to grasp. It relates to index terms or glossaries used. In psychology, one possibility would be the APA thesaurus used by Psycinfo.

I3. data include qualified references to other (meta)data.

  • This should be a given, and the citation culture of psychology seems well-equipped to follow. But it is still important to cite the original source of questionnaires, accelerometer algorithms etc.

Accessible, transparent and FAIR data. [Photo by Pahala Basuki on Unsplash.]

R1. data have a plurality of accurate and relevant attributes.

  • This means that the research should be accompanied with e.g. tags or a description, which provides sufficient information to determine the value of reuse for the information seekers.

R1.1. data are released with a clear and accessible data usage license.

  • You should state what licence is the work under. It is commonly recommended to use “CC0”, which allows all reuse, even without attribution. The second-best alternative, “CC-BY” (which requires attribution), can lead to interpretation problems of attribution stacking, when licences pile on each other (see chapter 10.4 in reference 8). It is a commonly accepted practice to cite others’ work in psychology, so CC0 seems a reasonable option, though I sympathise with the (almost invariably unfounded) fear of being scooped.

R1.2. data are associated with their provenance.

  • This means that the source of the data is clear, so that the data can be cited.

R1.3. data meet domain-relevant community standards.

  • In psychology, there are not many well-known community standards, but e.g. the DFG guidelines (6) are showing the way.


The FAIR principles can be hard to comply with exhaustively, as they are sometimes difficult to interpret (even by people who work in data archives) and take a lot of effort implement. Hence, everyone should consider whether their data is FAIR enough. As with open data in general, one should be able to describe why best practices could not be followed, when that is the case. But—for the sake of ethics if nothing else—we should aim to do the best we can.

Additional information on the FAIR principles can be found here, and some difficulties in assessing the adherence to them in (9). A 20min webinar in Finnish is available here.



  1. Bechhofer S, Buchan I, De Roure D, Missier P, Ainsworth J, Bhagat J, et al. Why linked data is not enough for scientists. Future Gener Comput Syst. 2013;29(2):599–611.
  2. Khomami N. All scientific papers to be free by 2020 under EU proposals. The Guardian [Internet]. 2016 May 28 [cited 2017 Mar 29]; Available from:
  3. European Commission. Open access – H2020 Online Manual [Internet]. [cited 2017 Mar 29]. Available from:
  4. European Commission. Guidelines on data management in Horizon 2020 [Internet]. 2016 [cited 2017 Mar 29]. Available from:
  5. Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3:160018.
  6. Schönbrodt F, Gollwitzer M, Abele-Brehm A. Data Management in Psychological Science: Specification of the DFG Guidelines [Internet]. 2017 [cited 2017 Mar 29]. Available from:
  7. International DOI Foundation. Digital Object Identifier System FAQs [Internet]. [cited 2017 Mar 29]. Available from:
  8. Briney K. Data Management for Researchers: Organize, maintain and share your data for research success [Internet]. Pelagic Publishing Ltd; 2015 [cited 2017 Mar 29]. Preview available from:
  9. Dunning A. FAIR Principles – Connecting the Dots for the IDCC 2017 [Internet]. Open Working. 2017 [cited 2017 Mar 29]. Available from:


Is it possible to unveil intervention mechanisms under complexity?

In this post, I wonder what complex systems, as well as the nuts and bolts of mediation analysis, imply for studying processes of health psychological interventions.

Say we make a risky prediction and find an intervention effect that replicates well (never mind for now that replicability is practically never tested in health psychology). We could then go on to investigating boundary conditions and intricacies of the effect. What’s sometimes done is a study of “mechanisms of action”, also endorsed by the MRC guidelines for process evaluation (1), as well as the Workgroup for Intervention Development and Evaluation Research (WIDER) (2). In such a study, we investigate whether the intervention worked as we thought it should have worked (in other words, to test the program theory; see previous post). It would be spectacularly useful to decision makers, if we could disentangle the mechanisms of the intervention; “by increasing autonomy support, autonomous motivation goes up and physical activity ensues”. But attempting to evaluate this opens a spectacular can of worms.

Complex interventions include multiple interacting components, targeting several facets of a behaviour on different levels of the environment the individual operates in (1). This environment itself can be described as a complex system (3). In complex, adaptive systems such as the society or a human being, causality is thorny issue (4): Feedback loops, manifold interactions between variables over time, path-dependence and sensitivity to initial conditions make it challenging at best to state “a causes b” (5). But what does it even mean to say something causes something else?

Bollen (6) presents three conditions for causal inference: isolation, association and direction. Isolation means that no other variable can reasonably cause the outcome. This is usually impossible to achieve strictly, which is why researchers usually aim to control for covariates and thus reach a condition of pseudo-isolation. A common, but not often acknowledged problem is overfitting; adding covariates to a model leads to also fitting the measurement error they carry with them. Association means there should be a connection between the cause and the effect – in real life, usually a probabilistic one. In social sciences, a problem arises as everything is more or less correlated with everything else, and high-dimensional datasets suffer of the “curse of dimensionality”. Direction, self-evidently, means that the effect should flow from one direction to the other, not the other way around. This is highly problematic in complex systems. For an example in health psychology, it seems obvious that depression symptoms (e.g. anxiety and insomnia) feed each other, resulting in self-enforcing feedback loops (7).

When we consider the act of making efficient inferences, we want to be able to falsify our theories of the world (9); something that’s only recently really starting to be understood among psychologists (10). An easy-ish way about this, is to define the smallest effect size of interest (SESOI) a priori, ensure one has proper statistical power and attempt to reject the hypotheses that effects are larger than the upper bound of the SESOI, and lower than the lower bound. This procedure, also known as equivalence testing (11) allows for rejecting the falsification of statistical hypotheses in situations, where a SESOI can be determined. But when testing program theories of complex interventions, there may be no such luxury.

The notion of non-linear interactions with feedback loops makes the notion of causality in a complex system an evasive concept. If we’re dealing with complexity, it is a situation where even miniscule effects can be meaningful when they interact with other effects: even small effects can have huge influences down the line (“the butterfly effect” in nonlinear dynamics; 8). It is hence difficult to determine the SESOI for intermediate links in the chain from intervention to outcome. And if we only say we expect an effect to be “any positive number”, this leads to the postulated processes, as described in intervention program theories, being unfalsifiable: If a correlation of 0.001 between intervention participation and a continuous variable would corroborate a theory, one would need more than six million participants to detect it (at 80% power and an alpha of 5%; see also 12, p. 30). If researchers are unable to reject the null hypothesis of no effect, they cannot determine whether there is evidence for a null effect, or if a more elaborate sample was needed (e.g. 13).

Side note: One could use Bayes factors to compare whether a point null data generator (effect size being zero) would predict the data better than, for example, an alternative model where most effects are near zero but half of them over d = 0.2. But still, the smaller effects you consider potentially important, the less the data can distinguish between alternative and null models. A better option could be to estimate, how probable it is that the effect has a positive sign (as demonstrated here).

In sum, researchers are faced with an uncomfortable trade-off: Either they must specify a SESOI (and thus, a hypothesis) which does not reflect the theory under test or, on the other hand, unfalsifiability.

A common way to study mechanisms is to conduct a mediation analysis, where one variable’s (X) impact on another (Y) is modelled to pass through a third variable (M). In its classical form, one expects the path X-Y to go near zero, when M is added to the model.

The good news is, that nowadays we can do power analyses for both simple and complex mediation models (14).  The bad news is, that in the presence of randomisation of X but not M, the observed M-Y relation entails strong assumptions which are usually ignored (15). Researchers should e.g. justify why there exist no other mediating variables than the ones in the model; leaving variables out is effectively the same as assuming their effect to be zero. Also, the investigator should demonstrate why no omitted variables affect both M and Y – if there are such variables, the causal effect may be distorted at best and misleading at worst.

Now that we know it’s bad to omit variables, how do we avoid overfitting the model (i.e. be fooled by looking too much into what the data says)? It is very common for seemingly supported theories to fail to generalise to slightly different situations or other samples (16), and subgroup claims regularly fail to pan out in new data (17). Some solutions include ridge regression in the frequentist framework and regularising priors in the Bayesian one, but the simplest (though not the easiest) solution would be cross-validation. In cross-validation, you basically divide your sample in two (or even up to n) parts, use the first one to explore and the second one to “replicate” the analysis. Unfortunately, you need to have a large enough sample so that you can break it down to parts.

What does all this tell us? Mainly, that investigators would do well to heed Kenny’s (18) admonition: “mediation is not a thoughtless routine exercise that can be reduced down to a series of steps. Rather, it requires a detailed knowledge of the process under investigation and a careful and thoughtful analysis of data”. I would conjecture that researchers often lack such process knowledge. It may also be, that under complexity, the exact processes become both unknown and unknowable (19). Tools like structural equation modelling are wonderful, but I’m curious if they are up to the task of advising us about how to live in interconnected systems, where trends and cascades are bound to happen, and everything causes everything else.

These are just relatively disorganised thoughts, and I’m curious to hear if someone can shed hope to the situation. Specifically, hearing of interventions that work consistently and robustly, would definitely make my day.

ps. If you’re interested in replication matters in health psychology, there’s an upcoming symposium on the topic in EHPS17 featuring Martin Hagger, Gjalt-Jorn Peters, Rik Crutzen, Marie Johnston and me. My presentation is titled “Disentangling replicable mechanisms of complex interventions: What to expect and how to avoid fooling ourselves?

pps. A recent piece in Lancet (20) called for a complex systems model of evidence for public health. Here’s a small conversation with the main author, regarding the UK Medical Research Council’s take on the subject. As you see, the science seems to be in some sort of a limbo/purgatory-type of place currently, but smart people are working on it so I have hope 🙂

complexity rutter twitter.PNG




  1. Moore GF, Audrey S, Barker M, Bond L, Bonell C, Hardeman W, et al. Process evaluation of complex interventions: Medical Research Council guidance. BMJ. 2015 Mar 19;350:h1258.
  2. Abraham C, Johnson BT, de Bruin M, Luszczynska A. Enhancing reporting of behavior change intervention evaluations. JAIDS J Acquir Immune Defic Syndr. 2014;66:S293–S299.
  3. Shiell A, Hawe P, Gold L. Complex interventions or complex systems? Implications for health economic evaluation. BMJ. 2008 Jun 5;336(7656):1281–3.
  4. Sterman JD. Learning from Evidence in a Complex World. Am J Public Health. 2006 Mar 1;96(3):505–14.
  5. Resnicow K, Page SE. Embracing Chaos and Complexity: A Quantum Change for Public Health. Am J Public Health. 2008 Aug 1;98(8):1382–9.
  6. Bollen KA. Structural equations with latent variables. New York: John Wiley. 1989;
  7. Borsboom D. A network theory of mental disorders. World Psychiatry. 2017 Feb;16(1):5–13.
  8. Hilborn RC. Sea gulls, butterflies, and grasshoppers: A brief history of the butterfly effect in nonlinear dynamics. Am J Phys. 2004 Apr;72(4):425–7.
  9. LeBel EP, Berger D, Campbell L, Loving TJ. Falsifiability Is Not Optional. Accepted pending minor revisions at Journal of Personality and Social Psychology. [Internet]. 2017 [cited 2017 Apr 21]. Available from:
  10. Morey R D, Lakens D. Why most of psychology is statistically unfalsifiable. GitHub [Internet]. in prep. [cited 2016 Oct 23]; Available from:
  11. Lakens D. Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses [Internet]. 2016 [cited 2017 Feb 24]. Available from:
  12. Dienes Z. Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan; 2008. 185 p.
  13. Dienes Z. Using Bayes to get the most out of non-significant results. Quant Psychol Meas. 2014;5:781.
  14. Schoemann AM, Boulton AJ, Short SD. Determining Power and Sample Size for Simple and Complex Mediation Models. Soc Psychol Personal Sci. 2017 Jun 15;194855061771506.
  15. Bullock JG, Green DP, Ha SE. Yes, but what’s the mechanism? (don’t expect an easy answer). J Pers Soc Psychol. 2010;98(4):550–8.
  16. Yarkoni T, Westfall J. Choosing prediction over explanation in psychology: Lessons from machine learning. FigShare Httpsdx Doi Org106084m9 Figshare. 2016;2441878:v1.
  17. Wallach JD, Sullivan PG, Trepanowski JF, Sainani KL, Steyerberg EW, Ioannidis JPA. Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials. JAMA Intern Med. 2017 Apr 1;177(4):554–60.
  18. Kenny DA. Reflections on mediation. Organ Res Methods. 2008;11(2):353–358.
  19. Bar-Yam Y. The limits of phenomenology: From behaviorism to drug testing and engineering design. Complexity. 2016 Sep 1;21(S1):181–9.
  20. Rutter H, Savona N, Glonti K, Bibby J, Cummins S, Finegood DT, et al. The need for a complex systems model of evidence for public health. The Lancet [Internet]. 2017 Jun 13 [cited 2017 Jun 17];0(0). Available from:


The scientific foundation of intervention evaluation

In the post-replication-crisis world, people are increasingly arguing, that even applied people should actually know what they’re doing when they do what they call science. In this post I expand upon some points I made in these slides about the philosophy of science behind hypothesis testing in interventions.

How does knowledge grow when we do intervention research? Evaluating whether an intervention worked can be phrased in relatively straightforward terms; “there was a predicted change in the pre-specified outcome“. This is, of course, a simplification. But try and contrast it with the attempt to phrase what you mean when you want to claim how the intervention worked, or why it did not. To do this, you need to spell out the program theory* of the intervention, which explicates the logic and causal assumptions behind intervention development.

* Also referred to as programme logic, intervention logic, theory-based (or driven) evaluation, theory of change, theory of action, impact pathway analysis, or programme theory-driven evaluation science… (Rogers, 2008). These terms are equivalent for the purposes of this piece.

The way I see it (for a more systematic approach, see intervention mapping), we have background theories (Theory of Planned Behaviour, Self-Determination Theory, etc.) and knowledge from earlier studies, which we synthesise into a program theory. This knowledge informs us about how we believe an intervention in our context would achieve its goals, regarding the factors (“determinants”) that determine the target behaviour. From (or during the creation of) this mesh of substantive theory and accompanying assumptions, we deduce a boxes-and-arrows diagram, which describes the causal mechanisms at play. These assumed causal mechanisms then help us derive a substantive hypothesis (e.g. “intervention increases physical activity”), which informs a statistical hypothesis (e.g. “accelerometer-measured metabolic equivalent units will be statistically significantly higher in the intervention group than the control group”). The statistical hypothesis then dictates what sort of observations we should be expecting. I call this the causal stream; each one of the entities follows from what came before it.


The inferential stream runs to the other direction. Hopefully, the observations are informative enough so that we can make judgements regarding the statistical hypothesis. The statistical hypothesis’ fate then informs the substantive hypothesis, and whether our theory upstream get corroborated (supported). Right?

Not so fast. What we derived the substantive and statistical hypotheses from, was not only the program theory (T) we wanted to test. We also had all the other theories the program theory was drawn from (i.e. auxiliary theories, At), as well as an assumption that the accelerometers measure physical activity as they are supposed to, and other assumptions about instruments (Ai). Not only this, we assume that the intervention was delivered as planned and all other presumed experimental conditions (Cn) hold, and that there are no other systematic, unmeasured contextual effects that mess with the results (“all other things being equal”; a ceteris paribus condition, Cp).

Program_link tells.png

We now come to a logical implication (“observational conditional”) for testing theories (Meehl, 1990b, p. 119, 1990a, p. 109). Oi is the observation of an intervention having taken place, and Op is an observation of increased physical activity:

(T and At and Ai and Cn and Cp) → (Oi → Op)

[Technically, the first arrow should be logical entailment, but that’s not too important here.] The first bracket can be thought of as “all our assumptions hold”, the second bracket as “if we observe the intervention, then we should observe increased physical activity”. The whole thing thus roughly means “if our assumptions (T, A, C) hold, we should observe a thing (i.e. Oi → Op)”.

Now here comes falsifiability: if we observe an intervention but no increase in physical activity, the logical truth value of the second bracket comes out false, which also destroys the conjunction in the first bracket. By elementary logic, we must conclude that one or more of the elements in the first bracket is false – the big problem is that we don’t know which element(s) was or were false! And what if the experiment pans out? It’s not just our theory that’s been corroborated, but the bundle of assumptions as a whole. This is known as the Duhem-Quine problem, and it has brought misery to countless induction-loving people for decades.

EDIT: As Tal Yarkoni pointed out, this corroboration can be negligible unless one is making a risky prediction. See the damn strange coincidence condition below.

Program_link fails.png

EDIT: There was a great comment by Peter Holtz. Knowledge grows when we identify the weakest links in the mix of theoretical and auxiliary assumptions, and see if we can falsify them. And things do get awkward if we abandon falsification.

If wearing an accelerometer increases physical activity in itself (say people who receive an intervention are more conscious about their activity monitoring, and thus exhibit more pronounced measurement effects when told to wear an accelerometer), you obviously don’t conclude the increase is due to the program theory’s effectiveness. Also, you would not be very impressed by setups where you’d likely get the same result, whether the program theory was right or wrong. In other words, you want a situation where, if the program theory was false, you would doubt a priori that among those who increased their physical activity, many would have underwent the intervention. This is called the theoretical risk; prior probability p(Op|Oi)—i.e. probability of observing increase in physical activity, given that the person underwent the intervention—should be low absent the theory (Meehl, 1990a, p. 199, mistyped in Meehl, 1990b, p. 110), and the lower the probability, the more impressive the prediction. In other words, spontaneous improvement absent the program theory should be a damn strange coincidence.

Note that solutions for handling the Duhem-Quine mess have been proposed both in the frequentist (e.g. error statistical piecewise testing, Mayo, 1996), and Bayesian (Howson & Urbach, 2006) frameworks.

What is a theory, anyway?

A lot of the above discussion hangs upon what we mean by a “theory” – and consequently, should we apply the process of theory testing to intervention program theories. [Some previous discussion here.] One could argue that saying “if I push this button, my PC will start” is not a scientific theory, and that interventions use theory but logic models do not capture them. It has been said that if the theoretical assumptions underpinning an intervention don’t hold, the intervention will fail, but that doesn’t make an intervention evaluation a test of the theory. This view has been defended by arguing that behaviour change theories underlying an intervention may work, but e.g. the intervention targets the wrong cognitive processes.

To me it seems like these are all part of the intervention program theory, which we’re looking to make inferences from. If you’re testing statistical hypotheses, you should have substantive hypotheses you believe are informed by the statistical ones, and those come from a theory – it doesn’t matter if it’s a general theory-of-everything or one that applies in very specific context such as the situation of your target population.

Now, here’s a question for you:

If the process described above doesn’t look familiar and you do hypothesis testing, how do you reckon your approach produces knowledge?

Note: I’m not saying it doesn’t (though that’s an option), just curious of alternative approaches. I know that e.g. Mayo’s error statistical perspective is superior to what’s presented here, but I’m yet to find an exposition of it I could thoroughly understand.

Please share your thoughts and let me know where you think this goes wrong!

With thanks to Rik Crutzen for comments on a draft of this post.

ps. If you’re interested in replication matters in health psychology, there’s an upcoming symposium on the topic in EHPS17 featuring Martin Hagger, Gjalt-Jorn Peters, Rik Crutzen, Marie Johnston and me. My presentation is titled “Disentangling replicable mechanisms of complex interventions: What to expect and how to avoid fooling ourselves?“

pps. Paul Meehl’s wonderful seminar Philosophical Psychology can be found in video and audio formats here.


Abraham, C., Johnson, B. T., de Bruin, M., & Luszczynska, A. (2014). Enhancing reporting of behavior change intervention evaluations. JAIDS Journal of Acquired Immune Deficiency Syndromes, 66, S293–S299.

Dienes, Z. (2008). Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference. Palgrave Macmillan.

Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Quantitative Psychology and Measurement, 5, 781.

Hilborn, R. C. (2004). Sea gulls, butterflies, and grasshoppers: A brief history of the butterfly effect in nonlinear dynamics. American Journal of Physics, 72(4), 425–427.

Howson, C., & Urbach, P. (2006). Scientific reasoning: the Bayesian approach. Open Court Publishing.

Lakatos, I. (1971). History of science and its rational reconstructions. Springer. Retrieved from

Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.

Meehl, P. E. (1990a). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141.

Meehl, P. E. (1990b). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195–244.

Moore, G. F., Audrey, S., Barker, M., Bond, L., Bonell, C., Hardeman, W., … Baird, J. (2015). Process evaluation of complex interventions: Medical Research Council guidance. BMJ, 350, h1258.

Rogers, P. J. (2008). Using Programme Theory to Evaluate Complicated and Complex Aspects of Interventions. Evaluation, 14(1), 29–48.

Shiell, A., Hawe, P., & Gold, L. (2008). Complex interventions or complex systems? Implications for health economic evaluation. BMJ, 336(7656), 1281–1283.


Missing data, the inferential assassin

Last week, I attended the Methods festival 2017 in Jyväskylä. Slides and program for the first day are here, and for the second day, here (some are in Finnish, some in English).

One interesting presentation was on missing data by Juha Karvanen [twitter profile] (slides for the talk). It involved toilet paper and Hans Rosling, so I figured I’ll post my recording of the display. Thing is, missing data lurks in the shadows and if you don’t do your utmost to get full information, it may be lethal.

juhakarvanen tribuutti.PNG

  1. Intro and missing completely at random (MCAR): Video. Probability of missingness for all cases is the same. Rare in real life?
  2. Missing at random (MAR): Video. Probability of missingness depends on something we know. For example, if men leave more questions unanswered than women, but among men and women, the missingness is MCAR.
  3. Missing not at random (MNAR): Video. Probability of missingness depends on unobserved values. Your analysis becomes misleading and you may not know it; misinformation reigns and angels cry.

There was an exciting question on a slide. I’ll post the answer in this thread later.

Random sampling vs web data question methods festival.PNGBy the way, one of Richard McElreath’s Statistical Rethinking lectures has a nice description on how to do Bayesian imputation when one assumes MCAR. He also discusses of how irrational complete case analysis (throwing away the cases that don’t have full data) is, when you really think about it. Also, never substitute a missing value with the mean of other values!

p.s. I would love it if someone dropped a comment saying “this problem is actually not too dire, because…”

Replication is impossible, falsification unnecessary and truth lies in published articles (?)

Writing this piece crammed in the backseat of a car, because I’m a zealot (also, because I wanted to have a picture here).

I recently peer reviewed a partly shocking piece called “Reproducibility in Psychological Science: When Do Psychological Phenomena Exist?“ (Iso-Ahola, 2017). In the article, the author makes some very good points, which unfortunately get drowned under very strange statements and positions. Me, Eiko Fried and Etienne LeBel addressed those shortly in a commentary (preprint; UPDATE: published piece). Below, I’d like to expand upon some additional thoughts I had about the piece, to answer Martin Hagger’s question.

On complexity

When all parts do the same thing on a certain scale (planets on Newtonian orbits), their behaviour is relatively easy to predict for many purposes. Same thing, when all molecules act independently in a random fashion: the risk that most or all beer molecules in a pint move upward at the same time is ridiculously low, and thus we don’t have to worry about the yellow (or black, if you’re into that) gold escaping the glass. Both situations are easy-ish systems to describe, as opposed to complex systems where the interactions, sensitivity to initial conditions etc. can produce a huge variety of behaviour and states. Complexity science is the study of these phenomena, which have become increasingly common since the 1900s (Weaver, 1948).

Iso-Ahola (2017) quotes (though somewhat unfaithfully) the complexity scientist Bar-Yam (2016b): “for complex systems (humans), all empirical inferences are false… by their assumptions of replicability of conditions, independence of different causal factors, and transfer to different conditions of prior observations”. He takes this to mean that “phenomena’s existence should not be defined by any index of reproducibility of findings” and that “falsifiability and replication are of secondary importance to advancement of scientific fields”. But this is a highly misleading representation of the complexity science perspective.

In Bar-Yam’s article, he used an information theoretic approach to analyse the limits of what we can say about complex systems. The position is that while full description of systems via empirical observation is impossible, we should aim to identify the factors which are meaningful in terms of replicability of findings, or the utility of the acquired knowledge. As he elaborates elsewhere: “There is no utility to information that is only true in a particular instance. Thus, all of scientific inquiry should be understood as an inquiry into universality—the determination of the degree to which information is general or specific” (Bar-Yam, 2016a, p. 19).

This is fully in line with the Fisher quote presented in Mayo’s slides:

Fisher quote Mayo

The same goes for replications; no single one-lab study can disprove a finding:

“’Thus a few stray basic statements contradicting a theory will hardly induce us to reject it as falsified. We shall take it as falsified only if we discover a reproducible effect which refutes the theory. In other words, we only accept the falsification if a low-level empirical hypothesis which describes such an effect is proposed and  corroborated’ (Popper, 1959, p. 66)” (see Holtz & Monnerjahn, 2017)

So, if the high-quality non-replication replicates, one must consider that something may be off with the original finding. This leads us to the question of what researchers should study in the first place.

On research programmes

Lakatos (1971) posits a difference between progressive and degenerating research lines. In a progressive research line, investigators explain a negative result by modifying the theory in a way which leads to new predictions that subsequently pan out. On the other hand, coming up with explanations that do not make further contributions, but rather just explain away the negative finding, leads to a degenerative research line. Iso-Ahola quotes Lakatos to argue that, although theories may have a “poor public record” that should not be denied, falsification should not lead to abandonment of theories. Here’s Lakatos:

“One may rationally stick to a degenerating [research] programme until it is overtaken by a rival and even after. What one must not do is to deny its poor public record. […] It is perfectly rational to play a risky game: what is irrational is to deceive oneself about the risk” (Lakatos, 1971, p. 104)

As Meehl (1990, p. 115) points out, the quote continues as follows:

“This does not mean as much licence as might appear for those who stick to a degenerating programme. For they can do this mostly only in private. Editors of scientific journals should refuse to publish their papers which will, in general, contain either solemn reassertions of their position or absorption of counterevidence (or even of rival programmes) by ad hoc, linguistic adjustments. Research foundations, too, should refuse money.” (Lakatos, 1971, p. 105)

Perhaps researchers should pay more attention which program they are following?

As an ending note, here’s one more interesting quote: “Zealotry of reproducibility has unfortunately reached the point where some researchers take a radical position that the original results mean nothing if not replicated in the new data.” (Iso-Ahola, 2017)

For explorative research, I largely agree with these zealots. I believe exploration is fine and well, but the results do mean nearly nothing unless replicated in new data (de Groot, 2014). One cannot hypothesise and confirm with the same data.

Perhaps I focus too much on the things that were said in the paper, not what the author actually meant, and we do apologise if we have failed to abide with the principle of charity in the commentary or this blog post. In a later post, I will attempt to show how the ten criteria Iso-Ahola proposed could be used to evaluate research.

ps. If you’re interested in replication matters in health psychology, there’s an upcoming symposium on the topic in EHPS17 featuring Martin Hagger, Gjalt-Jorn Peters, Rik Crutzen, Marie Johnston and me. My presentation is titled “Disentangling replicable mechanisms of complex interventions: What to expect and how to avoid fooling ourselves?


Bar-Yam, Y. (2016a). From big data to important information. Complexity, 21(S2), 73–98.

Bar-Yam, Y. (2016b). The limits of phenomenology: From behaviorism to drug testing and engineering design. Complexity, 21(S1), 181–189.

de Groot, A. D. (2014). The meaning of “significance” for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas]. Acta Psychologica, 148, 188–194.

Holtz, P., & Monnerjahn, P. (2017). Falsificationism is not just ‘potential’ falsifiability, but requires ‘actual’ falsification: Social psychology, critical rationalism, and progress in science. Journal for the Theory of Social Behaviour.

Iso-Ahola, S. E. (2017). Reproducibility in Psychological Science: When Do Psychological Phenomena Exist? Frontiers in Psychology, 8.

Lakatos, I. (1971). History of science and its rational reconstructions. Springer. Retrieved from

Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1(2), 108–141.

Weaver, W. (1948). Science and complexity. American Scientist, 36(4), 536–544.


Evaluating intervention program theories – as theories

How do we figure out, whether our ideas worked out? To me, it seems that in psychology we seldom rigorously think about this question, despite having been criticised for dubious inferential practices for at least half a century. You can download a pdf  of my talk at the Finnish National Institute for Health and Welfare (THL) here, or see the slide show in the end of this post. Please solve the three problems in the summary slide! 🙂

TLDR: is there a reason, why evaluating intervention program theories shouldn’t follow the process of scientific inference?