[Thanks to some people on Rationalist Tumblr, especially prophecyformula, for help and suggestions.]
There’s an old philosophers’ saying – trust those who seek the truth, distrust those who say they’ve found it. The psychiatry version of this goes “Trust those who seek biological underpinnings for mental illness, distrust those who say they’ve found them.”
Niculescu et al (2015) say they’ve found them. Their paper describes a process by which they hunted for biomarkers – in this case changes in gene expression – that predict suicide risk among psychiatric patients. They test various groups of psychiatric patients (including post-mortem tissue from suicide victims) to find some plausible genes. Then they use those genes to predict suicidality in two cohorts of about 100 patients each, including people with depression, schizophrenia, schizoaffective disorder, and bipolar disorder. They arrive at an impressive 92% AUC – that being the area under the curve graphing sensitivity vs. specificity, a common measure of the accuracy with which they can distinguish people who will vs. won’t be suicidal in the future.
The science press, showing the skepticism and restraint for which they are famous, jump on board immediately. A New Blood Test Can Predict Whether A Patient Will Have Suicidal Thoughts With More Than 90% Accuracy, says Popular Science. New Blood Test Predicts Future Suicide Attempts, says PBS.
There is a procedure for this sort of thing. The procedure is that the rest of us sit back and quietly wait for James Coyne, author of How To Critique Claims For A Blood Test For Depression, to tell us exactly why it is wrong. But it’s been over a week now and this hasn’t happened and I’m starting to worry he’s asleep on the job. So even though this is somewhat outside my area of expertise, let me discuss a couple of factors that concern me about this study.
The 92% accuracy claim is for the authors’ model, called UP-SUICIDE, which combines 11 biomarkers and two clinical prediction instruments. A clinical prediction instrument is a test which asks questions like “How depressed are you feeling right now?” or “How many times have you attempted suicide before?”. By combining the predictive power of the eleven genes and two instruments, they managed to reach the 92% number advertised in the abstract.
It might occur to you to ask “Wait, a test in which you can just ask people if they’re depressed and hate their life sounds a lot easier than this biomarker thing. Are we sure that they’re not just getting all of their predictive power from there?”
The answer is: no, we’re not sure at all, and as far as I can tell the study goes to great pains in order to make it hard to tell to what degree they are doing this.
Conventional wisdom says that clinical instruments for predicting suicidality can attain AUCs of 0.74 to 0.88. This is most of the way to the 0.92 shown in the current study, but not quite as high. But the current study combines two different clinical prediction instruments. In Combining Scales To Assess Suicide Risk, a Spanish team combines a few different clinical prediction instruments to get an AUC of…0.92.
If you look really closely at Niculescu et al’s big results table, you find that each of the individual prediction instruments they use does almost as well – and in some cases better than – their UP-SUICIDE model as a whole. For example, when predicting suicidal ideation in all patients, the CFI-S instrument has an AUC of 0.89, compared to the entire model’s 0.92. When predicting suicide-related hospitalizations in depressed patients, the CFI-S has an AUC of 0.78, compared to the entire model’s 0.7. Here the biomarkers are just adding noise!
Are the cases where the entire model outperforms the CFI-S cases where the biomarkers genuinely help? We have no way of knowing. There are two clinical prediction instruments, the CFI-S and the SASS. Combined, they should outperform either one alone. So, for example, on suicidal ideation among all patients, the SASS has an AUC of 0.85, the CFI-S has an AUC of 0.89, and the model as a whole (both instruments combined + 11 biomarkers) has an AUC of 0.92. If we just combined the CFI-S and SASS, and threw out the biomarkers, would we do better or worse than 0.92? I don’t know and they don’t tell us. When all we’re doing is looking at the overall model, the biomarkers may be helping, hurting, or totally irrelevant.
So what if we throw out the clinical prediction instruments and just look at the biomarkers?
The authors use their panel of biomarkers for four different conditions: depression, bipolar, schizophrenia, and schizoaffective. And they have two different outcomes: suicidal ideation according to a test of such, and actual hospitalization for suicide. That’s a total of 4 x 2 = 8 tests that they’re conducting.
Of these eight different tests, the panel of biomarkers taken together come back insignificant on seven of them.
And there’s such a thing as “trending towards significance”, but this isn’t it. Here, I’ll give p-values:
Depression/ideation: p = 0.26
Depression/hospitalization: p = 0.48
Schizoaffective/ideation: p = 0.46
Schizoaffective/hospitalization: p = 0.94
Schizophrenia/ideation: p = 0.16
Schizophrenia/hospitalization: p = 0.72
Bipolar/hospitalization: p = 0.24
The only test of the eight that comes out significant is bipolar/ideation, where p = 0.007. This is fine (well, it’s fine if it’s supposed to be post-Bonferroni correction, which I can’t be sure of from the paper). But I notice three things. Number one, there were only 29 people in this group. Number two, some of the most impressive looking genes for the ideation condition were worthless for the hospitalization condition. CLIP4, which got p = 0.005 for the ideation condition, got p = 0.91 for the second condition and actually had negative predictive value. And third, some of the genes that best predicted bipolar in the validation data had no predictive value for bipolar at all in the training data, and were included only because they predicted major depressive disorder alone. Given that the effects jump across diagnoses and fail to carry over into even a slightly different method of assessing suicidality, this looks a lot less like a real finding and a lot more like a statistical blip.
Finally, note that even in bipolar ideation, their one apparent success, the biomarkers only got an AUC of 0.75, lower than either clinical predictive instrument. The only reason their model did better was because it added on the clinical predictive instruments themselves.
So here it looks like seven out of their eight tests failed miserably, one of them succeeded in a very suspicious way, and they covered over this by combining the data with the clinical predictive instruments which always worked very well. Then everyone interpreted this as the sexy and exciting result “biomarkers work!” rather than the boring result “biomarkers fail, but if you use other stuff instead you’ll still be okay.”
The absolute strongest conclusion you can draw from this study is “biomarkers may predict risk of suicidal ideation in bipolar disorder with an AUC of 0.75”. Instead, everyone thinks biomarkers predict suicidality and hospitalization in a set of four different disorders with AUC of 0.92, which is way beyond what the evidence can support.
II.
So much for that. Now let me explain why it wouldn’t matter much even if they were right.
AUC is a combination of two statistics called sensitivity and specificity. It’s a little complicated, but if we assume it means sensitivity and specificity are both 92% we won’t be far off.
Sensitivity is the probability that a randomly chosen positive case in fact tests positive. In this case, it means the probability that, if someone is actually going to be suicidal, the model flags them as high suicide risk.
Specificity is the probability that a randomly chosen negative case in fact tests negative. In this case, it means the probability that, if someone is not going to be suicidal, the model flags them as low suicide risk.
In this study population, about 7.5% of their patients are hospitalized for suicidality each year. So suppose you got a million depressed people similar to these. 75,000 would attempt suicide that year, and 925,000 wouldn’t.
Now, suppose you gave your million depressed people this test with a 92% sensitivity and specificity.
Of the 925,000 non-suicidal people, 92% – 851,000 – will be correctly evaluated as non-suicidal. 74,000- 8% – will be mistakenly evaluated as suicidal.
Of the 75,000 suicidal people, 92% – 69,000 – will be correctly evaluated as suicidal. 8% – 6,000 – will be mistakenly evaluated as non-suicidal.
But this means that of the 143,000 people the test says are suicidal, only 69,000 – less than half – actually will be!
So when people say “We have a blood test to diagnose suicidality with 92% accuracy!”, even if it’s true, what they mean is that they have a blood test which, if it comes back positive, there’s still less than 50-50 odds the person involved is suicidal. Okay. Say you’re a psychiatrist. There’s a 48% chance your patient is going to be suicidal in the next year. What are you going to do? Commit her to the hospital? I sure hope not. Ask her some questions, make sure she’s doing okay, watch her kind of closely? You’re a psychiatrist and she’s your depressed patient, you would have been doing that anyway. This blood test is not really actionable.
And then remember that this isn’t the blood test we have. We have some clinical prediction instruments that do this, and we have a blood test which maybe, if you are very trusting, diagnoses suicidality in bipolar disorder with 75% accuracy. At 75% sensitivity and specificity, only twenty percent of the people who test positive will be suicidal. So what?
There will never be a blood test for suicide that works 100%, because suicide isn’t 100% in the blood. I am the most biodeterminist person you know (unless you know JayMan), I am happy to agree with Martin and Tesser that that the heritability of learning Latin is 26% and the heritability of jazz is 45% and so on, but suicide is not just biological. Maybe people need some kind of biological predisposition to consider suicide. But whether they go ahead with it or not depends on whether they have a good or bad day, whether their partner breaks up with them, whether a friend hands them a beer and they get really drunk, et cetera. Taking all of this into account, it’s really unlikely that a blood test will ever get sensitive and specific enough to overcome these hurdles.
We should continue research on the biological underpinnings of depression and suicide, both for the sake of knowledge and because it might lead to better treatments. But having “a blood test for suicide” won’t be very useful, even if it works.