In nearly half the false negatives from both the mammogram and DWI datasets, the cancer was categorized as occult by two breast radiologists, meaning the cancer was invisible to a trained eye. The AI model's non-occult false negative rate on the mammography data is 19.3%.
For that 19.3% figure, see Table 2: 68 non-occult in AI-missed cancer, 285 non-occult in AI-detected cancer.
This study did not compare the AI to a radiologist on a mixed set of healthy and cancer images.
Their conclusion is that 39% is supported by evidence. Furthermore, there is a persistent erroneous belief that mammographic sensitivity is 90-95%.
On the medical side, we need statistically significant tests that physicians can know and rely on - this paper was likely obsolete when it was published, depending on what "AI CAD" means in practice.
I think this impedance mismatch between disciplines is pretty interesting; any thoughts from someone who understands the med side better?
"The images were analyzed using a commercially available AI-CAD system (Lunit INSIGHT MMG, version 1.1.7.0; Lunit Inc.), developed with deep convolutional neural networks and validated in multinational studies [1, 4]."
It's presumably a proprietary model, so you're not going to get a lot more information about it, but it's also one that's currently deployed in clinics, so...it's arguably a better comparison than a SOTA model some lab dumped on GitHub. I'd add that the post headline is also missing the point of the article: many of the missed cases can be detected with a different form of imaging. It's not really meant to be a model shoot-out style paper.
* Kim, J. Y., Kim, J. J., Lee, H. J., Hwangbo, L., Song, Y. S., Lee, J. W., Lee, N. K., Hong, S. B., & Kim, S. (2025). Added value of diffusion-weighted imaging in detecting breast cancer missed by artificial intelligence-based mammography. La Radiologia medica, 10.1007/s11547-025-02161-1. Advance online publication. https://doi.org/10.1007/s11547-025-02161
Subsequent trials generally compare against the best known current treatment as the control instead.
This study has no such concerns. It's ethical to include images of non-cancerous breast tissue. The things are not comparable.
Vaccine studies today almost always use a previously approved vaccine as the "control" group. That isn't a true control and if you walk back the chain of approvals you'd be hard pressed to find a starting point that did use proper control groups.
Anyway, my point here wasn't to directly debate vaccines themselves, only to point out that its interest to me as someone without a career in health to see the same effective argument used in two different scenarios with drastically different common responses.
1) a double blind RCT with a placebo control is a very good way to understand the effectiveness of a treatment.
2) it's not always ethical to do that, because if you have an effective treatment, you must use it.
Even without a placebo control you can still estimate both FN and FPs through careful study design, it's just harder and has more potential sources of error. A retrospective study is the usual approach. Here, the problem is they only included true positives in the retrospective study, so they missed the opportunity to measure false positives.
And the problem with -that- is that it's very easy to have zero false negatives if you always say " it's positive". Almost every diagnostic instrument has something we call a receiver operating curve that trades off false positives for false negatives by changing the sensitivity for where you decide something is a positive. By omitting the false negatives, they present a very incomplete picture of the diagnostic capabilities.
(In medicine you will often see the terms "sensitivity" and "selectivity" for how many TPs you detect and how many TNs you call negative. It's all part of the same type of characterization.)
We may broadly agree that submitting a control group to a placebo treatment for a particular disease is immoral, but that doesn't mean such a study isn't necessary to prove out the efficacy or safety of the treatment. As for modelling, for example trying to estimate FN and FP, it can only ever indicate correlation at best and will never indicate likely causation.
If you have a new vaccine for a disease for which there is no existing vaccine you do a standard placebo controlled RCT which gives you a direct, high quality measurement of efficacy and side effects.
Not just vaccines, in each study on the effectiveness of a drug, especially when dealing with potentially life-threatening conditions, the same question is posed. From[0]:
. . . ethical guidance permit the use of placebo controls in randomized trials when scientifically indicated in four cases: (1) when there is no proven effective treatment for the condition under study; (2) when withholding treatment poses negligible risks to participants; (3) when there are compelling methodological reasons for using placebo, and withholding treatment does not pose a risk of serious harm to participants; and, more controversially, (4) when there are compelling methodological reasons for using placebo, and the research is intended to develop interventions that can be implemented in the population from which trial participants are drawn, and the trial does not require participants to forgo treatment they would otherwise receive.
It was retrospective-only, i.e. a case series on women who were known to have breast cancer, so there were zero false negatives and zero true negatives, because all patients in the study truly had cancer.
The AI system used was a ConvNet used commercially circa 2021, which is when the data for this case series were collected.
Well yes, that's the denominator for determining selectivity, which is what the headline claim is about.
Also, they need to set up their next paper:
> However, the retrospective, cancer-only design limits generalizability, highlighting the need for prospective multicenter screening trials for validation.
Does this mean that newer AI systems would perform significantly differently?
The main known way to improve performance on tasks like this is getting more data.
Wouldn't this mean that AI identitied them all has having cancer?
If we're saying there was a discrepancy and we're saying that all of the patients had cancer, then it would seem that there must have been some that were identified as not having cancer by AI.
Edit: I have a problem with the way the title uses "AI" as a singular unchanging entity. It should really be "An AI system misses nearly...". There is no single AI and models are constantly improving - sometimes exponentially better.
If it said "AI something", I'd be fine with it. It's a statement about that something, not about AI in general. Use it as an adjective (short for "AI-using" I guess?), not a noun.
I trust the meaning of this article is just that it requires hospitals to rethink their decision to substitute all doctors today.
AI doesn't have that option yet.
The assumption when gathering these statistics is that more or less you can average these out, but with AI you might have a model with literal 100% error rate or a model with a much lower error rate, and that changes a lot depending on the AI method its using.
But it is. It's LLMs. There is no other "AI".
Haven't you read HN in the past 1-2 years?
If I pluck a guy off the street, get him to analyze a load of MRI scans and he doesn't correctly identify cancer from them I'm not going to publish an article saying "Humans miss X% of breast cancers" am I.
In the end it is on the model marketer to prove that what they sell can do what it says. And counter examples is fully valid thing to then release.
Why does this matter? Because procurement in the medical world is a pain in the ass. And no medical center wants to be dealing with 32 different startups each selling their own specific cancer detection tool.
If the TechBros fail us here, we may then assume they may fail us everywhere else as well.
I don't we can make any conclusive verdict about the promise of ML for radiography right now; the life-critical nature of the application it's in the unusable middle, but it might get better in a few years or it might not. Time will tell.
1. They only tested 2 Radiologists. And they compared it to one model. Thus the results don’t say anything about how Radiologists in general perform against AI in general. The most generous thing the study can say is that 2 Radiologists outperformed a particular model.
2. The Radiologists were only given one type of image, and then only for those patients that were missed by the AI. The summaries don’t say if the test was blind. The study has 3 authors, all of which appear to be Radiologists, and it mentions 2 Radiologists looked at the ai-missed scans. This raises questions about whether the test was blind or not.
Giving humans data they know are true positives and saying “find the evidence the AI missed” is very different from giving an AI model also trained to reduce false positives a classification task.
Humans are very capable at finding patterns (even if they don’t exist) when they want to find a pattern.
Even if the study was blind initially, trained humans doctors would likely quickly notice that the data they are analyzing is skewed.
Even if they didn’t notice, humans are highly susceptible to anchoring bias.
Anchoring bias is a cognitive bias where individuals rely too heavily on the first piece of information they receive (the "anchor") when making subsequent judgments or decisions.
They skewed nature or the data has a high potential to amplify any anchoring bias.
If the experiment had controls, any measurement error resulting from human estimation errors could potentially cancel out (a large random sample of either images or doctors should be expected to have the same estimation errors in each group). But there were no controls at all in the experiment, and the sample size was very small. So the influence of estimation biases on the result could be huge.
From what I can read in the summary, these results don’t seem reliable.
Am I missing something?
Utility of the study is to evaluate potential AI sensitivity if used for mass fully automated screenings using mammography data. But says NOTHING about the CRUCIAL false positive rate (no healthy controls) and NOTHING about AI vs. human performance.
See my main comment elsewhere in this threat.
Can you clarify?
I also hinted at the fact that I only had access to the posted summary and the original linked article, and not the study. So if there is data I am missing… please enlighten me.
> Am I missing something?
Yes. The article is not about AI performance vs human performance.
> Humans are very capable at finding patterns (even if they don’t exist) when they want to find a pattern
Ironic
It also has the following quotes:
1. "The results were striking: 127 cancers, 30.7% of all cases, were missed by the AI system"
2. "However, the researchers also tested a potential solution. Two radiologists reviewed only the diffusion-weighted imaging"
3. "Their findings offered reassurance: DWI alone identified the majority of cancers the AI had overlooked, detecting 83.5% of missed lesions for one radiologist and 79.5% for the other. The readers showed substantial agreement in their interpretations, suggesting the method is both reliable and reproducible."
So, if you are saying that the article is "not about AI performance vs human performance", that's not correct.
The article very clearly makes claims about the performance of AI vs the performance of doctors.
The study doesn't have the ability to state anything about the performance of doctors vs the performance of AI, because of the issues I mentioned. That was my point.
But the study can't state anything about the sensitivity of AI either because it doesn't compare the sensitivity of AI based mammography (XRay) analysis with that of human reviewed mammography. Instead it compares AI based mammography vs human based DWI when the humans knew the results were all true positives. It's both a different task ("diagnose" vs "find a pattern to verify an existing diagnosis") and different data (XRay vs MRI).
So, I don't think the claims from the article are valid in any way. And the study seems very flawed.
Also, attempting to measure sensitivity without also measuring specificity seems doubly flawed, because there are very big tradeoffs between the two.
Increasing sensitivity while also decreasing specificity can lead to unnecessary amputations. That's a very high cost. Also, apparently studies have show that high false positive rates for breast cancer can lead to increased cancer risks because they deter future screening.
Given that I don't have access to the actual study, I have to assume I am missing something. But I don't think it's what you think I'm missing.
There is other comment very correctly noting that this result is on 100% positive input. Same AI in “real life” would score probably much better eventually. But as you point out, if used as a confirmation tool, is definitely bad.
Either I don't understand your reasoning or you are very much wrong. A "real life" dataset would contain real negatives too and the result would be equal if false positive rate was zero and strictly worse if the rate was any higher. One should expect the same AI to score significantly worse in a real life setting.
What I mean with “score” is having a relatively high accuracy.
Come let’s do the math: incidence of BC is 1 every 12, lets say. Now let’s say we have 12000 patients:
Acuracy = (TP + TN) / (TP + TN + FP + FN) = (1000 + 11000) / (1000 + 11000 + 300 + 0) = 12000 / 12300 =0.976 the test is 97.6% accurate… pretty impressive huh?
Tell me if I’m wrong. Is a know fact that you have to be careful when doctor speak of % accuracy.
There's a roundup of such findings here, but they're a mixed bag: https://www.uxtigers.com/post/humans-negative-value I suspect you need careful process design to get better outcomes, and it's not one-size-fits-all.
> Their findings offered reassurance: DWI alone identified the majority of cancers the AI had overlooked, detecting 83.5% of missed lesions for one radiologist and 79.5% for the other.
The combination of AI and this DWI methodology seems to identify most of the cancer, but there’s still about 20% of 1/3 that gets missed. I assume that as these were confirmed diagnoses, they were caught with another method beyond DWI.
I can detect 100% by
def detect(x):
Return TrueFrom the paper:
> Two cancers had abnormality scores greater than 10 but were not correctly localized and were therefore categorized as AI-missed.
the setup: 1. 400s confirmed patients 2. AI reads Mammography ONLY and missed 1/3 3. on those AI missed patients, radiologists do a second read on MRI, which is the gold standard for differential. evidence: the referenced paper at the bottom <Added value of diffusion-weighted imaging in detecting breast cancer missed by artificial intelligence-based mammography.>
So, the whole point it (or its reference paper) is trying to make is: Mammography sucks, MRI is much better, which is a KNOWN FACT.
Now, let me give you some more background missing from the paper: 1. Why does Mammography suck? well, go google/gpt some images, its essentially X-ray for the breast, which compress 3D volumes into 2D average poole plane, which is infomation lossy. SO, AI or not, the sensitivity is limited by the modality. 2. How bad/good is Mammography AI? I would say 80~85% sensitivity agaist very thorough+experienced radiologist without making unbearable amount of FP, that probably translates to 2/3 sensitivity against real cancer cohert, so the referenced number is about right. 3. Mammography sucks, what's the point? its cheap AND fast, you can probably do walk-in and get interpretation back in hours. Whereas MRI you probably need to schedule 2 weeks ahead if not MORE. For a yearly screening, it works for the majority of polulation.
and final pro tip, 1. breast tumor is more pravelent than you think (maybe 1 in 2 by age 70+) 2. most guides recommend women with 45+ do yearly checkup 3. if you have dense breast (basically small and firm), add ultrasound screening to make sure. 4. breast feeding does good for both the mother and child, do that.
peace & love
I'll supplement by directing others to consider how number needed to screen may be a more useful metric than mammographic sensitivity when making policy decisions. They're related, obviously, but only one of them concerns outcomes.
I hope something good comes out of this, as I have known women whose lives were deeply affected by this.
This is Skynet 2.0 or 3.0. But shit. James Cameron may have to redo The Terminator, to include AI. Then again, who would watch such a movie?
"One AI is not great" is not an interesting finding and certainly not conclusive of "AI can't help or do the job".
It's like saying "some dude can't detect breast cancer" and suggesting all humans are useless.
*Compared to a human.
[0]presumptively
[0] accounting for false positives, screening costs for true negatives, etc. etc.
Increase in false negative rate significantly reduces survival rate and increases cost of treatment. We have huge multiplication factor here so decreasing false negative rate is the net positive option at relatively low rates.
Based on my very superficial medical understanding, screening is already the cheap part. But every false-positive would lead to a doctor follow up at best and a biopsy at worst. Not to mention the significant psychological effects this has on a patient.
So I would counter that the potential increase of false-positive MRI scans could be enough to tip off the scale to make screening less useful