Crisis? What Crisis?

 


Even in 2016 I assumed someone must have used this title for a post about the replication crisis. 





I don’t think I had come across this quote in 2016, though:


"Science is built of facts the way a house is built of bricks: but an accumulation of facts is no more science than a pile of bricks is a house” (Henri Poincaré)


If I had ever wanted to put a quotation at the end of my emails, this would have been the one. Especially during my dalliance at SRCD supervising Developmental Science...


I thought of this essay when I read the Experimental History sub stack: https://experimentalhistory.substack.com/p/psychology-might-be-a-big-stinkin


There is a castle vs. scattered rocks imagery that reminded me of this quote, and if the author was unaware of it, as I had been, maybe he would be interested.


Also a bit in response, as a semi-friendly neighborhood observer of Social Psychology:


There are a lot of scattered rocks around showing priming effects. The field has conclusively established that anything can prime anything under the right conditions. Establishing the specific priming conditions is a tricky matter, however, much more akin to establishing a theme in a novel than setting a dial (or even tuning a parameter). It’s a bleak lunar landscape of scattered pebbles. 


But think of the pebbles as anchoring a blanket— holding down some patch of intellectual territory. There are a lot of pebbles holding down the canvas of priming effects.



(I haven’t edited this text from 2016, except to add a link to the paper I was reacting to. Not sure why there were no names.)


The recent publication of XX’s ethnography of practices in several infant labs has raised deepened concerns about the state of psychological science. It seems that researchers employ a range of fairly idiosyncratic, biasing, and undocumented practices to produce their data. Partially as a consequence of these practices, results often fail to replicate: One team of researchers cannot reproduce the effects observed by another. This state of affairs has come to be known as “the replication crisis.” But how serious is this crisis? Sure failures to replicate and dependence on specialized practices are concerns, but there are lots of things to worry about and try to get right when doing science. In experimental psychology, replication is not our biggest problem.


In evaluating an experimental procedure or result there are generally two things people worry about: validity and reliability. Validity and reliability are very closely related to accuracy and precision. If a procedure is reliable it gives the same answer each time it is applied. When we aim at a target with a precise/reliable procedure, we always hit the same spot. A valid procedure is one that produces what it is supposed to. Aiming at a target with an accurate/valid procedure directs you to the bullseye. Accuracy does not guarantee precision, nor vice versa. We might be accurately aiming at the bullseye without reliably hitting it. We might be able to hit the same spot over and over, but be missing the target.


Replication is a problem of reliability. We apply the same procedures twice and get different results. That’s a problem. But so is getting a precise but meaningless answer. Mature sciences, like physics or chemistry can mostly worry about reliability. They know what they are measuring, and how, they just want to do it carefully and precisely. Immature sciences, like psychology, are, or should be, mostly concerned with validity. They need to figure out what to measure and how to measure it. Take the example of IQ tests. There are a number of very reliable IQ tests—you give the test to the same person twice and you get the same answer. You give the test to pretty different groups of people, and you get about the same result (e.g., an average score of 100). The trouble is that most psychologists don’t believe the IQ tests are particularly valid. We don’t know what they are measuring, but it doesn’t seem to be intelligence. 


Experimental psychologists are really concerned about not developing new IQ tests. The main thing we care about is that we are actually measuring what we think we are. In designing, reporting, and evaluating (via peer review) studies, the central concern is validity. This concern is often phrased in terms of study being “well controlled”: Does it rule out alternative interpretations? For example, Karen Wynn’s infant addition studies (referenced in reports of XX’s ethnography) had to make the case that they were really measuring addition. If infants’ behavior could be explained some other way (e.g., a preference for 2 objects rather than 1), then there is a threat to validity. Validity is also central to the significance of the work. If it is not clear what the study is measuring, then it is not clear why it is important to know how the results turned out. Debates about a study, and the peer-review process, will generally focus on these validity issues. 

The replication literature has drawn attention to a number of research practices that threaten the reliability of results (e.g., selective reporting of measures, unclear “stopping” rules for data collection).  Statistical tests of experimental effects are measures of reliability: If the assumptions of these tests are violated, then the reliability is unclear. Such failures of reliability are a concern. But I would venture to state, on behalf of my colleagues in the discipline, that of far greater concern would be practices that systematically threaten the validity of research reports.  Ideally, the science we produce would be both valid and reliable. But there is a sense in which reliability kind of takes care of itself. If a study makes a breakthrough in providing a valid way of assessing some interesting phenomenon, then other people will attempt to build on the work. It is frustrating when such attempts reveal failures to replicate, but a) the problem gets recognized and b) the effort can reveal important issues of validity (we thought the original study was measuring X, but now it looks like it was measuring Y). The problem is not (just) replicating results, but knowing which results are worth replicating.


In some sense this denial of a crisis in Psychology may actually make things seem worse than they appeared: Psychologists don’t even know what to measure or why they are measuring it. That is certainly not the case in physical sciences. But Psychology should not pretend to that status. The important advances and debates in the field concern just this problem of validity. Of course it is better to have reliable results. But concerns about reliability do not mean that anything goes or that (even unreliable) studies cannot be significant. There is a tendency to think of science as the accumulation of (reliable) results. But in Psychology we have plenty of facts, we just don’t know what to do with them. The real crisis in Psychology is not that many effects fail to replicate, but how little difference it would make to the state of the field to if the effects turned out to be false.



  1. Poincare would have been an awesome ending. Still in 2022 the Supertramp album seems a particularly good fit. We clearly are in crisis. Always. It’s just not clear which crisis we are in— under whatever small protective umbrella we have been able to interpose between us and the elements. Anchor the towel a little more firmly and spread the shade incrementally?
  2. Despite my modest efforts, I think this is actually the best statement on the replication crisis: https://simplystatistics.org/posts/2016-08-24-replication-crisis/

Comments

Popular posts from this blog

Assholes, Autonomy, and the AHCA

Who Needs Wonder Woman When You Have Elizabeth Jennings?

Better to be Lucky than to be Good