This article gave an LLM a bunch of health metrics and then asked it to reduce it to a single score, didn't tell us any of the actual metric values, and then compared that to a doctor's opinion. Why anyone would expect these to align is beyond my understanding.
The most obvious thing that jumps out to me is that I've noticed doctors generally, for better or worse, consider "health" much differently than the fitness community does. It's different toolsets and different goals. If this person's VO2 max estimate was under 30, that's objectively a poor VO2 max by most standards, and an LLM trained on the internet's entire repository of fitness discussion is likely going to give this person a bad score in terms of cardio fitness. But a doctor who sees a person come in who isn't complaining about anything in particular, moves around fine, doesn't have risk factors like age or family history, and has good metrics on a blood test is probably going to say they're in fine cardio health regardless of what their wearable says.
I'd go so far to say this is probably the case for most people. Your average person is in really poor fitness-shape but just fine health-shape.
It gave you a probabilistic output. There were no guns and nothing to stick to. If you had disrupted the context with enough countervailing opinion it would have "relented" simply because the conversational probabilities changed.
I don't get it... a doctor ordered the blood work, right? And surely they did not have this opinion or you would have been sent to a specialist right away. In this case, the GP who ordered the blood work was the gatekeeper. Shouldn't they have been the person to deal with this inquiry in the first place?
I would be a lot more negative about "the medical establishment" if they had been the ones who put you through the trauma. It sounds like this story is putting yourself through trauma by believing "Dr. GPT" instead of consulting a real doctor.
I will take it as a cautionary tale, and remember it next time I feed all of my test results into an LLM.
What model?
Care to share the conversation? Or try again and see how the latest model does?
I would never let an LLM make an amputate or not decision, but it could convince me to go talk with an expert who sees me in person and takes a holistic view.
You can't feed an LLM years of time-series meteorological data, and expect it to work as a specialized weather model, you can't feed it years of medical time-series and expect it to work as a model specifically trained, and validated on this specific kind of data.
An LLM generates a stream of tokens. You feed it a giant set of CSVs, if it was not RL'd to do something useful with it, it will just try to make whatever sense of it and generate something that will most probably have no strong numerical relationship to your data, it will simulate an analysis, it won't do it.
You may have a giant context windows, but attention is sparse, the attention mechanism doesn't see your whole data at the same time, it can do some simple comparisons, like figuring out that if I say my current pressure is 210X180 I should call an ER immediately. But once I send it a time-series of my twice a day blood-pressure measurements for the last 10 years, it can't make any real sense of it.
Indeed, it would have been better for the author to ask the LLM to generate a python notebook to do some data analysis on it, and then run the notebook and share the result with the doctor.
That is not a plausible outcome of the current technology of any of OpenAI's demonstrated capabilities.
There's plenty of blame to go around for everyone, but at least for some of it (such as the above) I think the blame more rests on Apple for falsely representing the quality of their product (and TFA seems pretty clearly to be blasting OpenAI for this, not others like Apple).
What would you expect the behavior of the AI to be? Should it always assume bad data or potentially bad data? If so, that seems like it would defeat the point of having data at all as you could never draw any conclusions from it. Even disregarding statistical outliers, it's not at all clear what part of the data is "good" vs "unrealiable" especially when the company that collected that data claims that it's good data.
Behind the scenes, it's using a pretty cool algorithm that combines deep learning with physiological ODEs: https://www.empirical.health/blog/how-apple-watch-cardio-fit...
Then there's confounders like altitude, elevation gain that can sully the numbers.
It can be pretty great, but it needs a bit of control in order to get a proper reading.
Seems like Apple's 95% accuracy estimate for VO2 max holds up.
Thirty participants wore an Apple Watch for 5-10 days to generate a VO2 max estimate. Subsequently, they underwent a maximal exercise treadmill test in accordance with the modified Åstrand protocol. The agreement between measurements from Apple Watch and indirect calorimetry was assessed using Bland-Altman analysis, mean absolute percentage error (MAPE), and mean absolute error (MAE).
Overall, Apple Watch underestimated VO2 max, with a mean difference of 6.07 mL/kg/min (95% CI 3.77–8.38). Limits of agreement indicated variability between measurement methods (lower -6.11 mL/kg/min; upper 18.26 mL/kg/min). MAPE was calculated as 13.31% (95% CI 10.01–16.61), and MAE was 6.92 mL/kg/min (95% CI 4.89–8.94).
These findings indicate that Apple Watch VO2 max estimates require further refinement prior to clinical implementation. However, further consideration of Apple Watch as an alternative to conventional VO2 max prediction from submaximal exercise is warranted, given its practical utility.
https://pmc.ncbi.nlm.nih.gov/articles/PMC12080799/There was plenty of other concerning stuff in that article. And from a quick read it wasn't suggested or implied the VO2 max issue was the deciding factor for the original F score the author received. The article did suggest many times over the ChatGPT is really not equipped for the task of health diagnosis.
> There was another problem I discovered over time: When I tried asking the same heart longevity-grade question again, suddenly my score went up to a C. I asked again and again, watching the score swing between an F and a B.
Yes. You, and every other reasoning system, should always challenge the data and assume it’s biased at a minimum.
This is better described as “critical thinking” in its formal form.
You could also call it skepticism.
That impossibility of drawing conclusions assumes there’s a correct answer and is called the “problem of induction.” I promise you a machine is better at avoiding it than a human.
Many people freeze up or fail with too much data - put someone with no experience in front of 500 ppl to give a speech if you want to watch this live.
Well, I would expect the AI to provide the same response as a real doctor did from the same information. Which the article went over the doctors were able to.
I also would expect the AI to provide the same answer every time to the same data unlike what it did (from F to B over multiple attempts in the article)
OpenAI is entirely to blame here when they are putting out faulty products, (hallucinations even on accurate data are a fault of them).
There is this constant debate about how accurately VO2max is measured and its highly dependent on actually doing exercise to determine your VO2max using your watch. But yes if you want a lab/medically precise measure you need to do it a test that measures your actual oxygen uptake.
I would claim that ignoring the "ChatGPT is AI and can make mistakes. Check important info." text, right under the query they type in client, is clearly more irresponsible.
I think that a disclaimer like that is the most useful and reasonable approach for AI.
"Here's a tool, and it's sometimes wrong." means the public can have access to LLMs and AI. The alternative, that you seem to be suggesting (correct me if I'm wrong), means the public can't have access to an LLM until they are near perfect, which means the public can't ever have access to an LLM, or any AI.
What do you see as a reasonable approach to letting the public access these imperfect models? Training? Popups/agreement after every question "I understand this might be BS"? What's the threshold for quality of information where it's no longer considered "broken"? Is that threshold as good as or better than humans/news orgs/doctors/etc?
I live in a place where getting a blood test requires a referral from a doctor, who is also required to discuss the results with you.
Considering the number of people who take LLM responses as authoritative Truth, that wouldn't be the worst thing in the world.
If you are lucky, maybe it was finetuned to see a long comma-delimited sequence of values as a table and then emit a series of tool calls to generate some deterministic code to calculate a set of descriptive statistics that then will be close in the latent space to some hopefully current medical literature, and it will generate some things that makes sense and it is not absurdly wrong.
It is a fucking LLM, it is not 2001's HAL.
The basic idea was to adapt JEPA (Yann LeCun's Joint-Embedding Predictive Architecture) to multivariate time series, in order to learn a latent space of human health from purely unlabeled data. Then, we tested the model using supervised fine tuning and evaluation on on a bunch of downstream tasks, such as predicting a diagnosis of hypertension (~87% accuracy). In theory, this model could be also aligned to the latent space of an LLM--similar to how CLIP aligns a vision model to an LLM.
IMO, this shows that accuracy in consumer health will require specialized models alongside standard LLMs.
Paywall-free version at https://archive.ph/k4Rxt