Reflections on replication: Psychology's current crisis


Sacha Epskamp, University of Amsterdam

Reproducibility and Replicability in a Fast-paced Methodological World

Methodological developments and software implementations progress in increasingly faster time-frames. The introduction and increasingly wide-spread acceptance of pre-print archived reports and open-source software make state-of-the-art methods readily accessible to researchers. At the same time, researchers more and more emphasize that their results should be reproducible (using the same data obtaining the same results), which is a basic requirement for assessing the replicability (obtaining similar results in new data) of results; if results are not reproducible, we cannot expect them to be replicable. While the age of fast-paced methodology greatly facilitates reproducibility, it also undermines it in ways not often realized by researchers. The goal of this paper is to make researchers aware of these caveats. I discuss sources of unreproducibility and unreplicability in both the development of software routines and the development of methodology itself. In software-development, unreproducibility may arise due to software developing and changing over time, a problem that is greatly magnified by large dependency-trees between software-packages. In methodology, novel methodology comes with many researcher degrees of freedom, and new understanding comes with changing standards over time. The paper concludes with a list of recommendations for both developers and users of new methodologies to improve reproducibility of results.

Uljana FeestLeibniz Universität Hannover

Thinking about Concepts in (Conceptual) Replication

One of the recurring themes in debates about replication turns on the question of what exactly constitutes a replication of an empirical study. Many agree that exact replication is impossible, in part because – if nothing else – the time-variable will be different, leaving the question of when an empirical study is similar enough to a previous one to count as a replication. As pointed out by Shavit & Ellison (2017), responses to this question will vary between disciplines and fields of study. Moreover, they will involve the (implicit or explicit) conceptual judgments as to whether the experimental design is likely to yield data relevant to the subject matter under investigation. This suggests that whereas much of the debate about reproducibility in psychology focuses on formal issues of reliability and statistical inference, there is also an irreducibly material aspect to this question: Empirical studies are typically designed with the aim of producing specific types of effect, and thus conceptual questions are in play both in the design and the implementation of a study.

In recent debates, the notion of “conceptual replication” has highlighted a similar point (e.g., Zwaan 2013), suggesting that in order to establish the validity (as opposed to the mere reliability) of a finding, different operational definitions of a concept should be used. My paper will explore this notion, focusing on issues such as whether there can be such a thing as non-conceptual replication, what kinds of concepts are in play, what (kinds of) roles they play in the research process, and – most importantly – what insights can be gleaned from this perspective about the current “replication crisis” in psychology.

Klaus Fiedler and Johannes Prager, Ruprecht-Karls-Universität Heidelberg

The Regression Trap and other Pitfalls of Replication Science – Illustrated by the Report of the Open Science Collaboration

Reviews of replication success (such as the Open-Science Collaboration’s 2015 report) suggest that replication effect sizes in psychology are modest. However, closer inspection reveals serious problems when replication effect sizes are plotted against the corresponding original effect sizes. The regression trap must be taken into account: Expecting replication effects to be equally strong as original effects is logically unwarranted; they are inevitably subject to regressive shrinkage. To control for regression, it is necessary to estimate the reliability of original and replication studies. Other problems arise from missing manipulation checks and sampling biases. The neglect of these problems in the current replication debate highlights the need to develop a distinct methodology for replication science, which must meet the same standards of scientific scrutiny as demanded of other research.

Christopher D. Green, York University

How Perverse Career Incentives Undermine Efforts to Fix Psychology’s Replication Problem

Much has been made of how the combination of apparently pervasive p-hacking by psychological researchers and the preference of journals to publish only novel and positive results (publication bias) have combined to undermine the credibility of a great deal of psychological research. Because of p-hacking, many psychological phenomena have proven to be  non-replicable (i.e., not real), and due to editorial bias against publishing replications, we were unable to discover this until it had already consumed a distressingly large portion of the field. Research psychologists did not fall into this quagmire by themselves, however. Incentive structures that were created to satisfy administrative and political demands have played a strong role in shaping the behavior of researchers and, worse still, in selecting for career “survival” those who were already inclined to prioritize personal career success over scientific quality.

Since the rise of the modern research university in the mid-19th century, institutions of this sort have preferred professors who are active and successful in research to professors who were content to merely teaching the influential works of others. Traditionally, the assessment of research activity and success was left to experts in the field – “peers” – who could most correctly evaluate the quality of the research. Increasingly, over the last quarter of the 20th century and into the 21st, metrics were developed and used that putatively enabled people with no expertise in the field – namely, administrators – to assess how active and successful a particular candidate for hiring or promotion has been at research.

Originally, the sheer number of publications was used as such a metric. Once it became obvious, however, that many publications were rarely read and had little influence, the metrics used began to focus not on the number of articles, per se, but on the number of times the articles were cited in other publications. Soon, citation metrics began to focus exclusively on recent publications. For instance, “Impact Factor,” which considers only publications of the past two years, started to be widely (mis-)used to measure the influence of individual researchers. “Impact Factor” was eventually followed by “improved” variants such as “h-index” and “i-10 index.”

The problem with all of these metrics is that they increasingly drew attention away from the quality of the content and toward the numbers that were supposed to serve as proxies for it. In combination with the rapid decline in availability of career positions, the focus of many researchers shifted away from producing the highest quality science and toward maximizing the numbers that would decide their academic futures. Thousands of fake journals rapidly appeared on the scene, willing to publish nearly anything for a price. Some particularly ingenious, if unscrupulous, researchers went so far as to publish, under pseudonyms in fake journals, nonsense articles that cited their legitimate work profusely in order to create the illusion of greater impact.

In such a ruthless academic environment, it is little wonder that psychological researchers have been increasing cutting statistical corners in a desperate scramble to get their articles into journals more quickly and in greater numbers than ever before. But that is precisely the research that is least likely to later replicate (and be true). It is these sorts of career pressures, along with more common explanations such as insufficient statistical and methodological education, that are causing the replication “crisis,” and it is these perverse career incentives that must be addressed if the “crisis” is to be resolved.

Daniël Lakens, TU Eindhoven

The Replication Problem is a Collaboration Problem

High quality research requires expertise. The time researchers have to develop expertise is limited, and the domains one needs expertise in continue to increase as science becomes more complex. Therefore, I believe it is no longer possible for small groups of researchers to cover all bases needed to perform high-quality research. Nevertheless, several fields continue to practice science in relatively small research units. This reduces the quality of theory formation, experimental design, research methods, statistics, and open science. High quality research also requires an environment that rewards the pursuit of cumulative science. Within a competitive system, small independently operating research units give rise to social dilemmas that lead to low quality science, inefficiency, and imbalance in the type of research questions that are addressed within scientific disciplines. I argue that the replication crisis should be understood as a relatively salient consequence of this larger collaboration problem, and argue that high quality research requires a more collaborative science.

Jill Morawski, Wesleyan University

Replication as a Psychological Problem

In a 2016 pre-published editorial, Susan Fiske denounced “destructico-critics” of psychology’s methodological practices for “ignoring ethical rules of conduct.”  These “bullies” and “online vigilantes,” according to her, attack researchers personally and damage careers through their “public shaming and blaming.” Readers’ reactions to this vehement exposition resulted in publication of a modestly toned-done yet still emotionally-charged version.  Fiske’s editorial is just one instance of psychological reasoning about replication, showing that while statements on the so-called replication “crisis” is replete with mandates on data collection, statistics, publication standards, and the like, so they also give ample attention to scientists’ attitudes and behaviors; they offer diagnostic scrutiny of their psychology.  Such commentaries on the psychology of psychological scientists indicate that just as replication is claimed to be the “cornerstone” of science, so objectivity is a foundation. More importantly, they show that objectivity requires a certain kind of scientific actor. This paper takes up what Ted Port described as the “tight relationship between modes of objectivity and conceptions of the scientific self” to explore the ways in which psychologists involved in the replication conversations identify and explain how psychological processes of researchers impede or prevent objective research, thereby stymying the reproducibility of scientific studies.  Analysis of the replication literature reveals two broad domains of psychologizing the scientist: emphases on cognitions/social cognitions and on a folk psychology of moral personhood.  While claims of scientists’ faulty cognitions draw upon such concepts as “confirmation bias” and “hindsight bias”, claims of scientists’ moral psychology engage popular cultural beliefs about human fallibility in the face of temptation and similarly common economic notions of the moral hazards that economic actors routinely confront.  When examined for content and frequency, the psychologies provide insight into researchers’ affective self-reflexive thinking and its relation to their causal reasoning about scientific practice.  The psychological diagnoses of the scientist’s self also illuminate contemporary understandings of objectivity, enabling us to consider how epistemic ideals are informed by the public status of science and scientific expertise, economics, and psychological science itself.

Annette MülbergerUniversitat Autònoma de Barcelona

Why replicate? Replications in psychology’s past

My talk starts with some general definition of the term “replication” and remarks on the historiography about experimentation, recalling the discussion about this feature of scientific practice in the works of Shapin & Schaffer; Collins, Galison, etc. Then I’ll deal with the history of experimentation and how scientists became concerned with repetition and replication of experiments. Schickore (2011), for example, stated that re-doing other investigators’ experiments became an issue around 1670. Moreover, she examined the case of an Italian microscopist and physiologist (Felice Fontana), who in the late eighteenth-century stressed the importance of repetition of his own experiments in which he proved the origin and effects of viper’s venom. Some historians of science go even further back in time and find in the writings of the Arabic-Islamic astronomer of the tenth century a first interest in precise measurements and testing of empirical observations. Nevertheless, it would take more time until repeating experiments would become object of critical reflection and controversy. Grower (1997) deals with the controversy around Newton’s experiments in the seventeenth century. His contemporaries found it very difficult to replicate his experimental work with prisms. The review of such historical cases seems helpful to get to know what these scientists understood by “repetition”, “re-doing an experiment” or “replication” and why they thought that this should be part of the process of knowledge construction.

When did psychologists get concerned about replication? Kant who examined the possibility of psychology becoming a real science mentioned as one obstacle the non-replicability of psychological introspection (Sturm, 2005). Therefore, for Wundt and his disciples the demonstration of psychological experiments being replicated was crucial. When Calkins and her student Nevers replicated in the 1890ies Jastrow’s association experiments, she found slightly different results. Jastrow reacted furiously, arguing that their experiments were no real replications of his. The discussions about replication which took place in psychology in the period of the end of the nineteenth century and first half of the twentieth show that, within the experimental school in psychology, there was a concern to achieve universal and replicable results since its very beginnings. Moreover, the analysis of experimental reports of that time evidences how the technical aspects and epistemological approaches to replication changed over time.

Jelte Wicherts, Tilburg University

Why replications fail

Several recent large-scale collaborative projects have attempted to replicate – as closely as possible – previous results from highly cited psychology papers or papers published in top psychology journals. Despite considerations of power and rigorous pre-registration of analysis plans, the majority of these replications failed when judged on the basis of significance or when the size of the effect was compared to that in the original studies. Here I discuss substantive, statistical, and methodological reasons for these (apparently) discrepant results. I discuss the so-called hidden moderator account of failures to replicate, which states that replications differ from original studies in a priori unknown yet crucial ways. I also discuss the widespread failure to publish non-significant results and considerable potential bias caused by researcher’s tendency to exploit the flexibility of designs and analyses of data in original studies. I conclude that although it is wise to always consider moderators even in fairly close replications, publication bias and common exploitation of researcher degrees of freedom in designing, running, analysing, and reporting of psychological studies together likely explain the vast majority of failed replications. I discuss some preferred solutions.