Why you should share Data Nudes instead of just Shitty Tables

This post summarises what I wanted to say with a recent paper published in Health Psychology and Behavioural Medicine, which includes an RMarkdown website supplement with code. Related slideshow and a video walkthrough is available here. Note: If it’s not obvious, These are my opinions as the first author, and may or may not be shared with collaborators who are nice people and surely wouldn’t use such foul language in public.

Some Problems in Summarising and Presenting Data

Many research reports include lots of variables, presented in tables comparing two or more groups, say an intervention and a control, or males and females. Readers often look at the means and standard deviations, looking for statistically significant differences between the two. What’s the problem?

1. It’s often not clear what significance even means, or whether some correction for multiple testing has been applied.

First of all, following the logic of Neyman-Pearson hypothesis testing, to keep error rate under the alpha level, one would have to correct for multiple testing, and it is unclear how many tests one should correct for when hypotheses are not pre-specified. Ignoring this – especially, where it is unclear how to heed the recommendation to justify one’s alpha level – error rates can become surprisingly high, much more than the conventionally assumed 5%.

2. In the absence of randomisation, increased sample size leads to detecting more and more tiny differences.

When there has not been randomisation (as in the case of genders or baseline cohort descriptions), the null hypothesis of zero difference is never true, and its rejection only depends on statistical power. We are pretty much never interested in whether the populations differ by any arbitrarily small amount on any of the presented variables. What usually matters, is whether this difference is large enough to make a difference, that is, how big is the effect size. Two caveats follow: Firstly, in behavioural field trials, your participants are rarely independent from each other, but come clustered in e.g. classrooms (students), hospitals (patients) or offices (9-to-5 mental patients). Secondly, you almost always need to randomise clusters instead of individuals (here‘s why), which gives statistical power a huge ass-whooping.

Not accounting for the multilevel structure of the data when calculating effect sizes inflates the standard errors, possibly even making zero effects appear as medium-sized ones. But it is not a trivial task to derive trustworthy effect sizes for nested data (Lai & Kwok 2016). Although some solutions exist, they have not yet been empirically validated for finite populations in the second or third levels, nor is there currently a straightforward software implementation available – to my knowledge, that is. Therefore, a sensible option may be to present the means with their corresponding confidence intervals, encouraging the readers to refrain from merely considering non-overlapping intervals between groups as dichotomous hypothesis tests. In Shitty Table 1 you can see how this is done. That seem clear to you? Don’t worry, there are alternatives!

shitty table 1
Shitty Table 1. Means and confidence intervals for lots of things. Click to enlarge. Source.

3. The shape of the distribution may matter much, much more than simple arithmetic mean.

Difference between two means is fun and neat, but only informative for approximately normal or symmetric distributions, which are not the norm in social and life sciences. See reading list in the end. But hey, surely everyone reports things like skewness and kurtosis? [Of course they don’t, and even if they did, a minority of social scientists could actually interpret the numbers.] Look at Shitty Table 2 to see for yourself, whether you consider this a good way to convey information.

shitty table 2
Shitty Table 2. Means, standard deviations and some distributional properties of a single variable in different educational tracks the participants were nested in. Nur = Practical nurse, HRC = Hotel, restaurant and catering studies, BA = Business and administration, IT = Business information technology. Click to enlarge. Source.

An aside as regards the means: Few individual participants are described by the group-level summary statistics. In fact, using Daniels’ definition of an ‘approximately average individual’ as falling in the middle 30% of the range of values, only 1.50% of participants can be considered ‘average’ on all of the primary outcome measures (see supplementary website, section https://git.io/fpOy1). Also see this and this blog post, as well as the papers listed in the end.

Data Wants to be Seen Naked

star trek android GIF

In our paper, we present some ways behaviour change researchers could visualise their data, discuss some limitations and provide links to R code. Many, many other dedicated sources do this better, so feel free to check out this or this, for example. A principle I particularly like is to, whenever possible, include the raw data in the visualisation. This is because in abstractions, I personally have a hard time keeping in mind that I’m dealing with individuals operating in the world (complex dynamic systems in complex dynamic systems), and the raw data tends to ground me to some reality.

pretty picture 1
Pretty Picture 1. Visualising the information in Shitty Table 1 with raw data. Click to enlarge.

Data-visualisation and data exploration techniques (e.g. network analysis) can help reveal the dynamics involved in complex multi-causal systems – a challenging task with Shitty Tables. Data visualisations are crucial supplements to large numerical tables of descriptive statistics. With visualisations, researchers can communicate large amounts of information – including the associated uncertainty – in an accessible format, without requiring extensive mathematical expertise from the reader. This is important for researchers who intend to build on previous results, and in the paper we argue that such practices may also reduce problems that have led to the recent loss of confidence in the reproducibility and replicability of research findings in social and life sciences. Fully open data sharing would be ideal, but this is not always possible due to privacy concerns and, at the time of writing, remains a lamentably rare practice. In addition, open data does not necessarily accommodate stakeholders with low technical expertise in data analysis and visualisation, such as clinicians, patients and policy makers.

The benefits of presenting complex data visually should encourage researchers to publish extensive analyses and descriptions as website supplements, which would increase the speed and quality of scientific communication, as well as help to address the crisis of reduced confidence in research findings.

pretty picture 2
Pretty Picture 2. Visualising the information in Shitty Table 2. Shows hours of accelerometer-measured moderate-to-vigorous physical activity for different educational tracks. Midpoints of diamonds indicate means, endpoints 95% credible intervals. Individual observations are presented under the density curves, with random scatter on the y-axis to ease inspection. Nur = Practical nurse, HRC = Hotel, restaurant and catering, BA = Business and administration, IT = Information and communications technology.

In Pretty Picture 2, looking closely you can observe that boys did more moderate-to-vigorous physical activity (x-axis is average daily hours) in every educational track. In spite of this, girls appeared more active when combining the educational tracks (shown as rows in the figure), because there is much more people in the practical nurse track, ,as well as those people being mostly girls. This is also known as the Simpson’s paradox, and is best investigated by visualising data.

pretty picture 3.PNG
Pretty Picture 3. See paper for elaboration.

Conventional approaches would have e.g. left the reader with an impression that the means of the multimodal or skewed variables (see Pretty Picture 1) are interpretable as central tendencies, and that the sample is homogenous (see Pretty Picture 2). Transparent and accessible sharing of data characteristics, analyses and analytical choices is imperative for increasing confidence in research findings; if nothing else, the elaborate supplements can act as a platform to present robustness tests and assumption explorations in.

pretty picture 4
Pretty Picture 4. See paper for elaboration.

Reading list

The paper described in this post:

  • Heino, M. T. J., Knittle, K., Fried, E., Sund, R., Haukkala, A., Borodulin, K., … Hankonen, N. (2019). Visualisation and network analysis of physical activity and its determinants: Demonstrating opportunities in analysing baseline associations in the let’s move it trial. Health Psychology and Behavioral Medicine, 7(1), 269–289. https://doi.org/10.1080/21642850.2019.1646136
  • Supplementary website: Link

On data visualisation:

  • Tay, L.Parrigon, S.Huang, Q., & LeBreton, J. M. (2016). Graphical descriptives a way to improve data transparency and methodological rigor in psychologyPerspectives on Psychological Science11(5), 692701

On hypothesis testing for non-prespecified comparisons:

  • de Groot AD. The meaning of “significance” for different types of research [translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas]. Acta Psychologica. 2014;148:188–94.
  • Nosek BA, Ebersole CR, DeHaven AC, Mellor DT. The preregistration revolution. Proceedings of the National Academy of Sciences. 2018;201708274.

On effect sizes for cluster randomised situations:

  • Lai MHC, Kwok O-m. Estimating Standardized Effect Sizes for Two- and Three-Level Partially Nested Data. Multivariate Behavioral Research. 2016;51:740–56.
  • Lai MHC, Kwok O-m, Hsiao Y-Y, Cao Q. Finite population correction for two-level hierarchical linear models. Psychological methods. 2018;23:94.

On distributional shapes:

  • Choi, S. W. (2016). Life is lognormal! What to do when your data does not follow a normal distribution. Anaesthesia71(11), 1363-1366.
  • Saxon, E. (2015). Beyond bar chartsBMC Biology13(1), 60. doi: 10.1186/s12915-015-0169-6
  • Taleb, N. N. (2007). Black swans and the domains of statistics. The American Statistician61(3), 198-200.
  • van Rooij, M. M., Nash, B., Rajaraman, S., & Holden, J. G. (2013). A fractal approach to dynamic inference and distribution analysis. Frontiers in physiology, 4, 1.
  • Weissgerber, T. L.Garovic, V. D.Savic, M.Winham, S. J., & Milic, N. M. (2016). From static to interactive: Transforming data visualization to improve transparencyPLOS Biology14(6), e1002484. doi: 10.1371/journal.pbio.1002484
  • Weissgerber, T. L.Milic, N. M.Winham, S. J., & Garovic, V. D.(2015). Beyond bar and line graphs: time for a new data presentation paradigmPLOS Biology13(4), e1002128. doi: 10.1371/journal.pbio.1002128

On averages:

  • Daniels, G. S. (1952). The“average man”?Wright-Patterson Air Force Base, OHAir Force Aerospace Medical Research Lab.
  • Rose, T. (2016). The end of average: How to succeed in a world that values sameness. Penguin UK.
  • Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2017). Beyond differences in means: Robust graphical methods to compare two groups in neuroscienceEuropean Journal of Neuroscience46(2), 17381748. doi: 10.1111/ejn.13610
  • Trafimow, D., Wang, T., & Wang, C. (2018). Means and standard deviations, or locations and scales? That is the question!New Ideas in Psychology503437. doi: 10.1016/j.newideapsych.2018.03.001

Modern tools to enhance reproducibility and comprehension of research findings (VIDEO WALKTHROUGH 14min)

presen eka slide

These are the slides of my presentation at the annual conference of the European Health Psychology Society. It’s about presenting data visually, and taking publishing culture from the journals to our own hands. I hint to a utopia, where the journal publication is a side product of a comprehensively reported data set. 

Please find a 14min video walkthrough of the slides (which can be found here) below. The site presented in the slides is here, and the tutorial by the most awesome Lisa DeBruine is here!

 

After the talk, I saw what was probably the best tweet about a presentation of mine ever. For a fleeting moment, I was happy to exist:

ehps cap

Big thanks to everyone involved, especially Gjalt-Jorn Peters for helpful suggestions on code and the plots. For the diamond plots, check out diamondplots.com.

Authors of the conference abstract:

Matti Heino; Reijo Sund; Ari Haukkala; Keegan Knittle; Katja Borodulin; Antti Uutela; Vera Araújo-Soares, Falko Sniehotta, Tommi Vasankari; Nelli Hankonen

Abstract

Background: Comprehensive reporting of results has traditionally been constrained by limited reporting space. In spite of calls for increased transparency, researchers have had to choose carefully what to report, and what to leave out; choices made based on subjective evaluations of importance. Open data remedies the situation, but privacy concerns and tradition hinder rapid progress. We present novel possibilities for comprehensive representation of data, making use of recent software developments.

Methods: We illustrate the opportunities using the Let’s Move It trial baseline data (n=1084). Descriptive statistics and group comparison results on psychosocial correlates of physical activity (PA) and accelerometry-assessed PA were reported in an easily accessible html-supplement, directly created from a combination of analysis code and data using existing tools within R.

Findings: Visualisations (e.g. network graphs, combined ridge and diamond plots) enabled presenting large amounts of information in an intelligible format. This bypasses the need to create narrative explanations for all data, or compress nuanced information into simple summary statistics. Providing all analysis code in a readily accessible format further contributed to transparency.

Discussion: We demonstrate how researchers can make their extensive analyses and descriptions openly available as website supplements, preferably with abundant visualisation to avoid overwhelming the reader with e.g. large numeric tables. Uptake of such practice could lead to a parallel form of literature, where highly technical and traditionally narrated documents coexist. While we may have to wait for fully open and documented data, comprehensive reporting of results is available to us now.

 

 

Open technical reports for the technically-minded

When Roger Giner-Sorolla three years ago lamented to me, how annoying it can be to dig out interesting methods/results information from a manuscript with a carefully crafted narrative, I wholeheartedly agreed. When I saw the 100%CI post on reproducible websites a year ago, I thought it was cool but way too tech-y for me.

Well, it turned out that when you learn a tiny bit of elementary R Markdown, you can follow idiot-proof instructions on how to make cool websites out of your analysis code. I was also working on the manuscript-version of my Master’s thesis, and realised several commenters thought much of the methods stuff I considered interesting, was just unnecessary and/or boring.

So I made this thing of what I thought was the beef of the paper (also, to motivate me to finally submit that damned piece):

sms-supplement

It got me thinking: Perhaps we could create a parallel form of literature, where (open) highly technical and (closed) traditionally narrated documents coexist. The R Markdown research notes could be read with only a preregistration or a blog post to guide the reader, while the journals could just continue with business as usual. The great thing is that, as Ruben Arslan pointed out in the 100%CI post, you can present a lot of results and analyses, which is nice if you’d do them anyway and data sharing a no-no in your field. In general, if there’s just too much conservative inertia in your field, this could be a way around it: Let the to-be-extinct journals build paywalls around your articles, but put the important things openly available. The people who get pissed off by that sort of stuff rarely look at technical supplements anyway 🙂

I’d love to hear your thoughts of the feasibility of the approach, as well as how to improve such supplements!

Afterthought

After some insightful comments by Gjalt-Jorn Peters, I started thinking how this could be abused. We’ve already seen how e.g. preregistration can be used as a signal of illusory quality (1, 2), and supplements like this could do the same thing. Someone could just bluff by cramming the thing full of difficult-to-interpret analyses, and claim “hey, it’s all there!”. One helpful thing is to expect heavy use of visualisations, which are less morbid to look at than numeric tables and raw R output. Another option would be creating a wonderful shiny app, like Emorie Beck did.

Actually, let’s take a moment to marvel at how super awesomesauce that thing is.

Thanks.

So, to continue: I don’t know how difficult it really is to make such a thing. I’m sure a lot of tech-savvy people readily say it’s the simplest thing in the world, and I’m sure a lot of people will see the supplements I presented here as a shitton of learning to do. I don’t have a solution. But if you’re a PI, you can do both yourself and your doctoral students a favour by nudging them towards learning R; maybe they’ll make a shiny app (or whatever’s in season then) for you one day!

ps. If I’d do the R Markdown all over again, I’d do more and better plots, as well as put more emphasis on readability, including better annotation of my code and decisions. Some of that code is from when I first learned R, and it’s a bit … rough. (In the last moment before submitting my Master’s thesis I decided, in a small state of frustrated fury, to re-do all analyses in R so that I needn’t mention SPSS or Excel in the thesis…)

pps. In the manuscript, I link to the page via a GitHub Pages url shortener, but provide permalink (web page stored with the Wayback Machine) in the references. We’ll see what the journal thinks of that.

ppps. There are probably errors lurking around, so please notify me when you see them 🙂

Introduction to data management best practices

data2.png

With the realisation that even linked data may not be enough for scientists (1), and as the European Union decided to embrace open access and best practices in data management (2–4), many psychologists find themselves treading on an unfamiliar terrain. Given that ~85% of health research is wasted, this is nothing short of a pressing issue in related fields.

Here, I comment on the FAIR Guiding Principles for scientific data management and stewardship (5) for the benefit of myself and perhaps others, who have not been involved with data management best practices.

[Note: all this does NOT mean that you are forced to share sensitive data. But if your work can not be checked or reused (even after anonymisation), calling it scientific might be a stretch.]

What goes in a data management plan?

A necessary document to accompany any research plan is the data management plan. This plan should first of all specify the purpose of the data collection, and how it relates to the objectives of one’s research project. It should state which types of data are collected – for an example in the context of an intervention to promote physical activity, one might collect survey data, as well as accelerometer and body composition measures. The steps to assure the quality of the data can be described, too.

Next, the file formats for this data should be specified, along with which parts of the data will be made openly available, if the whole data is not made so. When and where will the data be made available, and what software is needed to read it? Will there be restrictions to access? Will there be an embargo, and if so, why?

The data management plan should also state, whether existing data is being re-used. The researcher should clarify the origin of data, whether existing or new, comment on its size (if known), and outline for whom the data will be useful to (4).

Bad practices leading to unusable data are still common, so adopting proper data management practices can incur costs. The data management plan should explicate these, how they are covered and who is responsible for the data management process.

The importance of collecting original data in psychology cannot be overstated. Data are a conditio sine qua non for any empirical science. Anyone who generates data and shares them publicly should be adequately recognized. (6)

Note: metadata means any information about the data. For example, descriptive metadata increases discovery and identification; includes elements such as keywords, title, abstract, author. Administrative metadata informs the management of the data; creation dates, file types, version numbers.

The FAIR principles for data management

The FAIR principles have been composed to help both machines and humans (such as meta-analysts) to find and use existing data. The principles consist of four requirements: Findability, Accessibility, Interoperability and Reusability. Note that the adherence to these principles is not just a yes-no question, but a gradient where data stewards should aspire for an increased uptake.

Below, the exact formulation of the (sub-)principles is in italics, my comments in bullet points.

Findability:

F1. data are assigned a globally unique and eternally persistent identifier.

  • This is mostly handled in psychological research by making sure the research document is supplied with a DOI (Digital Object Identifier (7)). In addition to journals (for published research), most repositories where one can deposit any material (such as FigShare or Zenodo), or preprints (such as PsyArxiv), assign the work a DOI automatically.

F2. data are described with rich metadata.

  • This relates to R1 below. There should be data about the data telling you what the data is. Also: What is your approach to making versioning clear? In the Open Science Framework (OSF), you can upload new versions of your document and it automatically saves the previous version behind the new one, given that the new file has the same name as the old one.
  • Your data archiver helps you with metadata. E.g. the Finnish Social Science Data Archive (FSD) uses the DDI 2.1. metadata standard.

F3. data are registered or indexed in a searchable resource.

  • The researcher should deposit the data in a searchable repository. Your own website, or the website of your research group, is unfortunately not enough.

F4. metadata specify the data identifier.

  • Make sure your data actually shows its DOI somewhere, and include a link to the dataset in the metadata. As far as I know, repositories such as the OSF do this for you.

maarten-van-den-heuvel-63284.jpg
Non-transparent, inaccessible data. [Photo by Maarten van den Heuvel on Unsplash.]
Accessibility:

  • From what I understand, these are not too relevant to individual researchers. Basically, if your work can be accessed via “http://”, you are complying with this. You should also be mindful of storing your data in one repository only, and avoid having multiple DOIs. Regarding A2: if your data is sensitive and you cannot share it openly, the description of the data should still be accessible to researchers. I am not certain about how repositories deal with accessibility after the data has been taken offline.

A1. data are retrievable by their identifier using a standardized communications protocol.

A1.1 the protocol is open, free, and universally implementable.

A1.2 the protocol allows for an authentication and authorization procedure, where necessary.

A2. metadata are accessible, even when the data are no longer available.

Interoperability:

  • Behind these items (and the FAIR principles in general) is the idea that machines could read the data and mine it for e.g. meta-analyses. I am blissfully unaware of the intricacies related to that endeavour, so I just comment from the perspective of a common researcher here.

I1. data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

  • It is better to prefer simple formats (e.g. spreadsheets with comma-separated values, “file.csv”) that can be opened without special software (e.g. SPSS, “file.sav”).

I2. data use vocabularies that follow FAIR principles.

  • This principle may seem somewhat vague and hard for others than computer scientists to grasp. It relates to index terms or glossaries used. In psychology, one possibility would be the APA thesaurus used by Psycinfo.

I3. data include qualified references to other (meta)data.

  • This should be a given, and the citation culture of psychology seems well-equipped to follow. But it is still important to cite the original source of questionnaires, accelerometer algorithms etc.

pahala-basuki-4829.jpg
Accessible, transparent and FAIR data. [Photo by Pahala Basuki on Unsplash.]
Re-usability:

R1. data have a plurality of accurate and relevant attributes.

  • This means that the research should be accompanied with e.g. tags or a description, which provides sufficient information to determine the value of reuse for the information seekers.

R1.1. data are released with a clear and accessible data usage license.

  • You should state what licence is the work under. It is commonly recommended to use “CC0”, which allows all reuse, even without attribution. The second-best alternative, “CC-BY” (which requires attribution), can lead to interpretation problems of attribution stacking, when licences pile on each other (see chapter 10.4 in reference 8). It is a commonly accepted practice to cite others’ work in psychology, so CC0 seems a reasonable option, though I sympathise with the (almost invariably unfounded) fear of being scooped.

R1.2. data are associated with their provenance.

  • This means that the source of the data is clear, so that the data can be cited.

R1.3. data meet domain-relevant community standards.

  • In psychology, there are not many well-known community standards, but e.g. the DFG guidelines (6) are showing the way.

Conclusion

The FAIR principles can be hard to comply with exhaustively, as they are sometimes difficult to interpret (even by people who work in data archives) and take a lot of effort implement. Hence, everyone should consider whether their data is FAIR enough. As with open data in general, one should be able to describe why best practices could not be followed, when that is the case. But—for the sake of ethics if nothing else—we should aim to do the best we can.

Additional information on the FAIR principles can be found here, and some difficulties in assessing the adherence to them in (9). A 20min webinar in Finnish is available here.

 

Bibliography

  1. Bechhofer S, Buchan I, De Roure D, Missier P, Ainsworth J, Bhagat J, et al. Why linked data is not enough for scientists. Future Gener Comput Syst. 2013;29(2):599–611.
  2. Khomami N. All scientific papers to be free by 2020 under EU proposals. The Guardian [Internet]. 2016 May 28 [cited 2017 Mar 29]; Available from: https://web.archive.org/web/20170329092259/https://www.theguardian.com/science/2016/may/28/eu-ministers-2020-target-free-access-scientific-papers
  3. European Commission. Open access – H2020 Online Manual [Internet]. [cited 2017 Mar 29]. Available from: https://web.archive.org/web/20170329092016/https://ec.europa.eu/research/participants/docs/h2020-funding-guide/cross-cutting-issues/open-access-data-management/open-access_en.htm
  4. European Commission. Guidelines on data management in Horizon 2020 [Internet]. 2016 [cited 2017 Mar 29]. Available from: https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
  5. Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016 Mar 15;3:160018.
  6. Schönbrodt F, Gollwitzer M, Abele-Brehm A. Data Management in Psychological Science: Specification of the DFG Guidelines [Internet]. 2017 [cited 2017 Mar 29]. Available from: https://osf.io/preprints/psyarxiv/vhx89
  7. International DOI Foundation. Digital Object Identifier System FAQs [Internet]. [cited 2017 Mar 29]. Available from: https://www.doi.org/faq.html
  8. Briney K. Data Management for Researchers: Organize, maintain and share your data for research success [Internet]. Pelagic Publishing Ltd; 2015 [cited 2017 Mar 29]. Preview available from: https://books.google.fi/books?id=gw1iCgAAQBAJ&lpg=PT7&dq=Data%20management%20for%20researchers%3A%20organize%2C%20maintain%20and%20share%20your%20data%20for%20research%20success&lr&hl=fi&pg=PT6#v=onepage&q&f=false
  9. Dunning A. FAIR Principles – Connecting the Dots for the IDCC 2017 [Internet]. Open Working. 2017 [cited 2017 Mar 29]. Available from: https://openworking.wordpress.com/2017/02/10/fair-principles-connecting-the-dots-for-the-idcc-2017/