LLM-assisted systematic review of large language models in clinical medicine

Spread the love

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Nature Medicine volume 32, pages 1152–1159 (2026)Cite this article
34k Accesses
5 Citations
50 Altmetric
Metrics details
Clinical evaluations of large language models (LLMs) have rapidly expanded since 2022, yet their evidence base remains opaque. The overwhelming volume of studies creates challenges for manual curation and review. However, LLMs themselves offer the scalability and capability to evaluate the ever-growing evidence base. This LLM-assisted review identified 4,609 peer-reviewed studies in clinical medicine between January 2022 and September 2025, equating to roughly 3.2 papers per day. Only 1,048 studies used real-world patient data and of these only 19 were prospective randomized trials; most addressed simulated scenarios (n = 1,857) or exam-style tasks (n = 1,704). ChatGPT and related OpenAI models constitute 65.7% of evaluated models, with Gemini/Bard a distant second constituting 13.1% of evaluated models. Patient-facing communication and education comprised 17% of tasks, followed by knowledge retrieval, and education and assessment simulation. Across 1,046 head-to-head comparisons, LLMs outperformed humans in 33% of comparisons, with a strong dependency on task realism and level of training. At least 25% of studies had sample sizes less than 30. Despite the growth of LLMs in medicine, rigorous, patient-centered evidence remains scarce, underscoring the need for larger prospective trials before clinical adoption.
Since the public release of ChatGPT in November 2022, large language models (LLMs) have catalyzed a paradigm shift in medical artificial intelligence (AI). LLMs have demonstrated emergent reasoning—the spontaneous appearance of complex problem-solving and abstract inference abilities that were not explicitly programmed—alongside in-context learning, in which models adapt to new tasks by leveraging examples or information provided within the prompt, and instruction-following, which is the capacity to interpret and execute natural language commands as intended by the user¹. These capabilities have sparked widespread interest in LLM applications in clinical medicine. Use cases are broad; LLMs can be used to answer patient medical questions, summarize clinical notes and literature² or even aid in clinical decision-making in complex cases^3,4. Some have started to use LLMs to streamline medical documentation², aid in medical education^5,6 and perform operative healthcare tasks⁷.
Early studies illustrated that general-purpose LLMs such as GPT-3.5, GPT-4 and PaLM-2 could perform at or near passing levels on standardized medical examinations, including the United States Medical Licensing Examination^8,9,10. Subsequent work expanded these tests to a growing set of unique tasks and domains, benchmarking LLMs against clinicians or patient-generated data^3,4.
Despite the volume and velocity of new publications, the real-world clinical impact of LLMs remains poorly characterized. Many high-profile studies are simulation-based, retrospective or narrow in scope. Previous systematic reviews have been performed to assess the safety, fairness and transparency of LLM-based systems in clinical medicine; these reviews are limited by scope and are often conducted manually, introducing issues with scalability and reproducibility^11,12,13,14.
Broadly, we began by scraping several databases for relevant studies. Next, we instructed a frontier LLM to screen all studies based on our inclusion and exclusion criteria and classify each study into a predefined evidence tier. Lastly, we query the LLM to extract and categorize relevant data fields from the study abstracts for further analysis. We validate and compute statistical bounds on the results of this automated review by comparison to ground truth, unseen human-labeled screening data.
We aimed to (1) estimate the true number and quality of clinical LLM studies, (2) quantify the evidence in these studies across task types and specialties, (3) identify methodological gaps and (4) provide a roadmap for future clinical and patient-centered LLM research. Lastly, we released an open-source browsing tool to allow for easy exploration and transparent review of the LLM-extracted fields with the codebase associated with this paper.
A search of PubMed, Embase and Scopus for articles published between 1 January 2022 and 6 September 2025 for strings with keywords such as ‘large language model,’ ‘GPT,’ ‘ChatGPT’ and other common language model names returned a total of 8,666, 5,633 and 9,315 records from Embase, PubMed and Scopus, respectively, yielding 12,894 unique studies after deduplication. The full query strings for each database are available in the Methods. We performed a human validation study of the filtering performance by randomly sampling 500 of the candidate works and assigning them to two independent groups of human reviewers, five reviewers each. The reviewers were given an option to include the study (‘Yes’), exclude the study (‘No’) or defer (‘Maybe?’). Deferring triggered an automatic tiebreak for that entry and counted as a disagreement for the purposes of inter-rater agreement calculations. The two human groups achieved an inter-rater agreement, as measured by Cohen’s κ, of 0.741 (95% confidence interval (CI) 0.685–0.796). A total of 64 out of 500 human verdicts required a tiebreak, most of which (51.6%) were due to one of the reviewers deferring. GPT-5 (reasoning effort set to high) achieved strong agreement with the tiebroken reviewer decisions, with a Cohen’s κ of 0.820 (95% CI 0.765–0.870) and 41 disagreements in total. A summary of this process is shown in Fig. 1 as a PRISMA¹⁵ flow diagram.
First, Embase, PubMed and Scopus were scraped, yielding 23,614 records. After title and digital object identifier deduplication, 12,894 studies remained. These studies were screened programmatically using an LLM, yielding 4,609 included studies, of which 500 were validated by humans. The included studies were then tiered programmatically with an LLM, with 250 validated by humans.
Using these human decisions as ground truth labels, we found high sensitivity (0.911; 95% CI 0.866–0.952) and specificity (0.921; 95% CI 0.892–0.949) for inclusion and exclusion. The bootstrapped estimates predicted the LLM would yield 4,644 included samples (within 1.0% of the observed number of included samples), with 669 (95% CI 431–929) false positives and 386 (95% CI 206–592) false negatives. Thus, we estimated the true number of studies that would fit the inclusion criteria published during this time interval to be 4,361 (95% CI 3,838–4,906). The actual number of studies filtered in and out by the LLM was 4,609 and 8,285, respectively. Between January 2022 and September 2025, approximately 3.2 studies on LLMs in clinical medicine were published per day.
Studies were then categorized into one of four tiers: Tier S represents gold standard prospective, randomized, controlled evaluations of deployed systems in live clinical environments; Tier I involves retrospective or prospective analyses on real clinical data; Tier II consists of simulated or synthetic clinical data and/or scenarios; and Tier III includes evaluations that test knowledge synthesis and recall, typically based on structured assessments of clinical knowledge that is not representative of clinical practice.
From the tiering procedure run on these included studies, we found an inter-rater Cohen’s κ of 0.645 (95% CI 0.560–0.726) between human reviewers, indicating good agreement. GPT-5 (reasoning effort set to high) achieved similar agreement with the tiebroken reviewer decisions group, with a Cohen’s κ of 0.695 (95% CI 0.611–0.772), with 48 disagreements in total. The LLM had a macro-averaged sensitivity and specificity of 0.822 (95% CI 0.772–0.869) and 0.896 (95% CI 0.868–0.923), respectively. Although the sensitivity and specificity were lower on the tiering task than the inclusion, we found that, of all errors the LLM made, 84.8% were within one tier of the ground truth label. Of these off-by-one errors, there was a modest bias toward assigning a higher tier (53.9% of errors) compared to a lower tier (46.1% of errors). The full confusion matrix can be seen in Extended Data Fig. 1.
We found no Tier S studies in our human screening process, though the LLM found 21 Tier S studies out of the 4,609 total studies tiered by the LLM (less than 0.5% of the studies). We reviewed those studies manually and found 19 of them to be true positives. As a result, we group Tier S studies into Tier I studies for our Bayesian analysis due to an insufficient sample size. The LLM detected 1,094 Tier I studies, 1,767 Tier II studies and 1,727 Tier III studies. After constructing a Bayesian model using our prior human-validated screens, our model predicts that there are 1,048 (95% CI 847–1,252) Tier S/I studies, 1,857 (95% CI 1,427–2,280) Tier II studies and 1,704 (95% CI 1,273–2,134) Tier III studies.
We found a significant difference between the number of Tier I and III studies (P = 6.6 × 10⁻³³) as well as Tier I and Tier II studies (P = 1.6 × 10⁻³⁶). There were significantly fewer Tier S studies than all other tiers (P < 10⁻³⁰⁷).
A notable increase in studies testing LLMs in a clinical context appeared after the release of ChatGPT (November 2022), as expected (Fig. 2a). We found two Tier I studies, both by Danilov et al., that evaluate the ability of ruGPT-3, a 760 M model trained on a Russian corpus, to predict length of stay for neurosurgical patients^16,17. These studies were published on 14 January 2022 and 29 June 2022, and both studies found ruGPT-3 to be inferior to neurosurgeons. To the best of our knowledge, these studies are the first works to evaluate a generative LLM in a clinical context on real patient data. After ChatGPT’s release, the number of studies published per month increased roughly linearly (coefficient of determination R² = 0.84) at a rate of 7.04 studies per month (Fig. 2b). The rate of increase of studies published per month did not significantly differ between Tier I and III studies (P = 0.75), nor did the rate of increase between Tier II studies and Tier III studies (P = 0.22). The rate of increase in Tier II studies was significantly higher than the rate of increase in Tier I studies (3.03 versus 2.03 studies per month; P = 0.02) (Fig. 2c–e).
a, The cumulative number of studies published for each tier throughout the period of time evaluated in this review. The significant jumps at January of each year are the result of database labels that are nonspecific and thus default to 1 January of the specified year. b–e, The number of studies published in each specified month for each tier: all tiers (b), Tier I (c), Tier II (d) and Tier III (e). For the purpose of analysis, all studies that did not have a month or day extracted from the database were spread evenly throughout the year.
The earliest Tier S study was published on 23 July 2024 and was a randomized controlled trial (RCT) comparing rates of smoking cessation over 42 days between participants who used a custom-designed LLM, QuitBot¹⁸, or the National Cancer Institute text-line, SmokefreeTXT, as a control. The trial found QuitBot yielded significantly higher smoking cessation rates (odds ratio 2.58, 95% CI 1.34–4.99; P = 0.005) than the control. To our knowledge, this is not only the first RCT involving LLMs, but it is also the first LLM RCT to outperform a well-established, existing control method. The rate at which the number of Tier S studies published each month increases is challenging to measure statistically due to the sparsity of these studies.
The vast majority of models (65.7%) evaluated are versions of either ChatGPT or OpenAI proprietary models. Beyond OpenAI’s models, the next most frequently studied models are Google’s Gemini and Bard (13.1%). The rarest models studied were Inflection’s ‘Pi’ chatbot (four studies), Google Assistant (four studies) and Amazon’s Alexa (three studies). The latter two language models are greatly understudied given their prevalence in many households and mobile devices (Fig. 3b). For example, it is reasonable to imagine a scenario in which a user spontaneously asks a household device for medical advice (for example, ‘Hey Alexa, I just cut my finger pretty deeply cutting onions, what do I do?’). Closed-source models were studied considerably more often than open-source models, comprising 87.7% of all models evaluated. The ratios of open-source to closed-source models did not differ significantly between evidence tiers. A full breakdown of study counts per model can be found in Supplementary Table 1.
a, The number of studies that mention each particular dataset type. b, The number of studies that evaluate each LLM listed. c, The percentage of studies that evaluate each listed specialty. Note that in all cases the sum of the values may be greater than the total number of studies because a given study may evaluate multiple datasets, models and/or specialties.
Of the abstracts screened, the plurality of studies evaluated two tasks (1,973 studies, 42.8%), with each study evaluating 1.93 tasks on average. Of these tasks, the most frequently evaluated tasks were ‘patient-facing communication and education’ (1,545 studies, 17.4%) and ‘knowledge retrieval and clinical Q/A’ (1,128 studies, 12.7%), which comprised multiple-choice questions, vignettes, patient question and answering, and more. The next most frequently evaluated tasks were ‘education, assessment and simulation’ (1,064 studies, 12.0%), ‘diagnostic reasoning and disease detection’ (979 studies, 11.0%) and ‘clinical management and workflow guidance’ (833 studies, 9.36%). A full breakdown of the tasks studied can be found in Extended Data Table 1.
Information regarding the datasets evaluated (if applicable) was not detected in about one quarter of the abstracts (1,091 studies, 23.7%). Of the abstracts that did mention datasets, ‘clinician board and self-assessment questions’ was the most studied dataset type (1,047 studies, 22.7%). The next most common dataset types evaluated were ‘patient-facing Q&A and FAQs’ (691 studies, 15.0%), followed by ‘clinical vignettes and case reports’ (652 studies, 14.2%), ‘clinical guidelines and consensus statements’ (579 studies, 12.6%), ‘imaging data and reports’ (472 studies, 10.2%) and ‘real-world electronic health records’ (423 studies, 9.18%) (Fig. 3a). Of all the datasets where the availability of the dataset evaluated was ascertainable by the model (2,732 datasets, 47.0%), only 1,163 (42.6%) of these datasets were open-access, limiting reproducibility for the majority of datasets studied. A full breakdown of the datasets studied can be found in Extended Data Table 2.
On average, each work pertained to 1.63 specialties or subspecialties, with the most commonly studied specialties being ‘internal medicine’ (1,500 studies, 32.5%), ‘interventional and diagnostic radiology’ (743 studies, 16.1%), ‘preventative medicine’ (657 studies, 14.2%), ‘orthopedic surgery’ (447 studies, 9.69%) and ‘family medicine’ (422 studies, 9.16%) (Fig. 3c). Of the surgical studies, the most common subcategories were orthopedic surgery (448 studies, 33.0%), otolaryngology (322 studies, 23.7%), urology (299 studies, 22.0%), general surgery (275 studies, 20.3%) and plastic surgery (268 studies, 19.7%). The most common internal medicine subspecialties studied were oncology (335 studies, 28.3%), cardiology (180 studies, 15.2%), sleep medicine (153 studies, 12.9%), gastroenterology (144 studies, 12.2%) and infectious disease (127 studies, 10.7%). Study counts by specialty and subspecialty, and by tier, can be found in Supplementary Table 2.
Of the studies where the type of human comparators or evaluators was detected in the abstract (2,249 studies, 48.8%), 2,983 comparisons were found. There was a total of 1,914 comparisons (64.2%) compared against a medical doctor, with 1,493 (78.0%) of these medical doctors being attendings, 263 (13.7%) residents, 120 (6.3%) unspecified (or the level of training was undetected) and 38 (2.0%) fellows (Fig. 4d). A total of 257 comparisons (8.6%) had a patient or member of the general public evaluate model outputs.
Of the studies where the outcome of a comparison to human experts was detected in the abstract (1,046 studies; 22.7%), the LLM studied outperformed the human comparator in 345 studies (33.0%), underperformed the human comparator in 675 studies (64.5%) and showed mixed results in 26 studies (2.49%). Mixed results were excluded from binary analysis of outperformance. Humans were outperformed by LLMs significantly more often in Tier III studies than Tier I (38.4% versus 25.9%; P < 0.001) (Fig. 4a). The rates at which LLMs outperform humans appeared to increase between years in Tier I and II studies, though rates appeared to decrease over time with Tier III studies. However, none of these trends are statistically significant (P = 0.15, 0.17 and 0.15 for Tiers I, II and III, respectively) (Fig. 4b). The greater rate of performance in Tier II and III studies compared to Tier I studies across all years aligns well with the intuition that LLMs are considerably stronger at knowledge retrieval and synthesis tasks (for example, multiple-choice questions, ‘what’s the diagnosis,’ and so on) compared to less objective, real-world clinical scenarios. Additionally, this discrepancy demonstrates that performance on contrived knowledge-based exams does not correlate well with real-world clinical practice. Broadly, there was no statistically significant difference in performance between open-source models and closed-source models by tier or overall, though these comparisons are low-powered due to the paucity of studies on open-source models.
a, The percentage of studies where the LLM being evaluated outperformed the human experts it was compared to. b, The percentage of studies where the LLM outperformed the human expert stratified by year and tier. c, The outperformance rate of LLMs against levels of experience. d, The overall composition of human experts in the studies (*P = 0.004606, **P = 0.021349, ***P = 0.031457; P values represent one-sided proportional z-test false discovery rate controlled by the Benjamini–Hochberg procedure). Md, medical doctor.
We find that the LLM outperforms attending physicians less often than unspecified medical doctors (P < 0.004) and medical students (P < 0.03). Additionally, nonphysician clinicians were outperformed by the LLM less often than unspecified medical doctors (P < 0.022). Residents were outperformed by LLMs 30% more often than attendings, though this difference was not significant after correction for multiple comparisons (P = 0.053) (Fig. 4c). Broadly, the outcome of human versus LLM comparisons depends strongly on the level of experience of the human in the comparison. A full list of the rates of outperformance by model and training level can be found in Supplementary Data Table 1, outperformance by specialty and tier can be found in Supplementary Data Table 2, and outperformance by model and task type can be found in Supplementary Data Table 3.
The LLM screen detected 3,289 studies (71.4%) that listed a sample size in the abstract. Of these studies, we find that a quarter have a sample size less than 20. The cumulative distribution of sample sizes for all and each tier can be found in Fig. 5. Because sample sizes appeared in only 71.4% of abstracts—and 35% (1,151 studies) of these had a sample size less than 30—we can conservatively conclude as a lower bound that at least 25% of all studies included in this work have a sample size below 30.
Sample sizes above 1,000 are omitted for clarity.
The number of works investigating LLMs in a clinical context has exploded since the release of ChatGPT in November of 2022, with around 3.2 papers published per day. The rate of publication has been increasing linearly since then, at a rate of increase of approximately 7.04 studies per month. Despite this rapid increase in studies, only 1,048 studies on real data were detected, with only 19 of those studies being true RCTs. The appearance of RCTs began in 2024, though the rate of publication of RCTs has not risen significantly since then. The vast majority of studies evaluated (3561 of 4609, 77.3%) analyze data that are not considered real clinical data. Of all analyzed studies, 22.7% evaluate clinical board exams and self-assessment tests, and 14.2% of studies evaluate clinical vignettes and case reports. When analyzing sample sizes, we find that at least 25% of all studies detected used a sample size below 30. Thus, all conclusions surrounding model performance should be interpreted cautiously.
Amid this surge in research, OpenAI’s models—particularly GPT-4 and GPT-4o—have been by far the most evaluated models. However, despite this extensive evaluation, the comparative performance of these models against human experts varies significantly based on context. Specifically, we find that, of the studies where a human comparator was detected, LLMs outperformed humans 33.0% of the time, with the rate decreasing with greater level of experience of the human expert and with the realism of the task. LLMs most frequently outperformed humans on knowledge-based evaluations on synthetic data (Tier III studies) compared to tests on real clinical data (Tier I studies), demonstrating that performance on contrived knowledge-based exams does not correlate well with performance in real-world clinical practice.
The distribution of studies investigating LLMs in a clinical context across specialties was highly skewed, with orthopedic surgery being studied in one-third of all surgical studies and oncology, cardiology and gastrointestinal medicine comprising over 50% of medicine studies. These specialties are some of the most competitive surgical and medical specialties, which may potentially lead to greater pressure to publish works of all kinds, including clinical LLM studies. Broadly, there is no evidence that the distribution of evidence quality varies between more competitive specialties. A significant literature gap exists exploring capability and potential use cases of generative AI in many medical and surgical specialties, constituting fields ripe for exploratory research.
The framework presented in this review provides a research roadmap for bringing generative AI into the clinical context from start to finish. First, if no available Tier III research exists for a clinical application, these studies can be conducted to assess the raw knowledge and capability of a given generative model. Once competency is established, Tier II research can be conducted to investigate the behavior of the generative model in a safe and less resource-intensive simulated context. Given success in these tiers, models can be evaluated on real-world data, both retrospectively and prospectively in Tier I studies at a sufficiently large sample size, encompassing the full, desired scope of future deployment. Finally, the models should ultimately be deployed in the clinical setting to measure outcomes in RCTs.
In any stage where human comparators are involved, care should be taken to compare the generative model against experts in the clinical task being measured, rather than comparing to qualified clinicians in other fields. For example, a generative model designed to interpret skin biopsies should be compared against a board-certified dermatopathologist, rather than a pathologist, dermatologist, resident or fellow. In a similar vein, studies that primarily compare against trainees should be discouraged unless there is a strong motivating factor or the study is low-tier or proof-of-concept.
Furthermore, significant gaps exist in the evaluation of open-source models and the development of open-access datasets. Open-source models are particularly critical for research as their results remain reproducible indefinitely, as opposed to experiments on closed-source models that may eventually be discontinued. Open-access datasets, while not always feasible given the sensitive nature of clinical data, are critical for rapid iteration and community scrutiny. In general, datasets should be created and evaluated with the intent to yield tangible insight; for example, it is challenging to draw meaningful conclusions from a ten-question multiple-choice exam curated from a mock board exam, even as a proof-of-concept.
A critical limitation of our work is that the automated analysis was performed on the title and abstract of a given work, rather than the full text. Ethically obtaining full texts of peer-reviewed scientific articles in a parsable format is a known challenge and a disconnect between scientific publishing and modern AI⁶. It is possible that clinically important nuances could be missed due to the lack of a comprehensive full-text review for each work. Fine-grained details surrounding methodology, data quality and much more may not be ascertainable from the abstract alone in many studies. We attempt to mitigate this with robust statistical analysis and human validation, and we quantify the uncertainty of our predictions and establish confidence intervals or lower bounds. Regardless, there remains uncertainty in the summary statistics gleaned from automated data extraction and classification. Notably, it was not possible to evaluate the performance of domain-specific LLMs trained specifically for clinical applications, as there are so few works owing to the scarcity of data and prohibitive cost of training and fine-tuning models. Many specialist models in the pulled studies use a generalist model with additional prompting or retrieval-augmented generation, which we did not analyze separately. Additionally, summative metrics that describe a broad range of fields, disciplines and tasks simultaneously should be interpreted with caution, given that many use cases and fields that are fundamentally different from each other. The homogenization of these fields limits the conclusions that can be drawn from these aggregate statistics. With the rapid pace of advancement in LLMs, the rapidly increasing context lengths and the continually falling prices of application programming interface services, we anticipate that a full-text automated analysis of thousands of works will be possible in future work. While our work adhered to standard reporting guidelines for systematic reviews, it becomes increasingly critical for studies to follow structured guidelines, especially in biomedicine, as LLMs continue to integrate into the research process¹⁹.
Overall, we find that there is a strong imbalance in studies evaluating LLMs on data that are not representative of actual clinical practice and a strong overrepresentation of studies evaluating LLMs on question answering and knowledge assessment. Given the demonstrated strong performance of LLMs on clinical knowledge and synthetic clinical data, we hope this work motivates effort on applying LLMs to real clinical data that are representative of real clinical scenarios. Additionally, underrepresented medical and surgical specialties deserve considerably more study to ensure that LLMs can benefit all medical practice. Care must be taken to evaluate LLMs against the appropriate human experts, as our results show that comparisons to human performance are significantly dependent on the level of training of the expert. A strong bias toward studying closed-source models is present in the current literature. This is expected as closed-source proprietary models are often high performing and are the most easily accessible and widespread models. However, it is critical that open-source models also receive significant attention in the future to enable widespread access to clinical LLMs. Lastly, given the early successes of the existing RCTs, we recommend that, after robust prospective validation on real data (Tier I studies), LLMs can and should be moved into RCTs to demonstrate their utility and enter clinical practice.
In conclusion, this systematic review demonstrates that, while research on LLMs in clinical medicine has grown at an extraordinary pace, the depth and clinical relevance of the evidence remains limited. Despite thousands of publications since late 2022, only a small fraction use real clinical data and just 19 randomized trials exist. A handful of medical and surgical specialties account for the majority of studies while other fields remain understudied. Key gaps include the scarcity of studies focusing on open-source model development and evaluation, the lack of open-access datasets and small sample sizes. Our review offers a roadmap and recommendations for transforming this rapidly expanding but uneven evidence base into clinically meaningful progress.
This review was conducted and reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta‑Analyses (PRISMA) 2020 guidelines. This review was not prospectively registered. A protocol was not prepared for this work.
We created a system for levels of evidence for LLM-based medical studies. We then introduced a scalable, LLM-assisted framework for evidence-tiered systematic review, and used this framework to assist in a systematic review of published studies evaluating LLMs in clinical medicine.
We searched PubMed, Embase and Scopus for studies published between 1 January 2022 and 6 September 2025. These databases were chosen to ensure comprehensive coverage of biomedical and clinical research. Search terms combined general descriptors of LLMs with specific model names (for example, GPT, ChatGPT, LLaMA, Claude, Gemini and Bard). To maximize specificity, we limited results to original research (articles, conference papers, preprints and letters) in health-related subject areas, and excluded reviews, meta-analyses, surveys and commentaries.
The data query string that was used for PubMed was as follows:
(“large language model”[Title/Abstract] OR “LLM”[Title/Abstract] OR “GPT”[Title/Abstract] OR “ChatGPT”[Title/Abstract] OR “LLaMA”[Title/Abstract] OR “Claude”[Title/Abstract] OR “Gemini”[Title/Abstract] OR “Bard”[Title/Abstract]) AND (humans[MeSH Terms]) NOT (review[Publication Type] OR meta-analysis[Publication Type] OR survey[Title])
The data query string that was used for Scopus was as follows:
TITLE-ABS (“large language model” OR llm OR gpt OR chatgpt OR llama OR claude OR gemini OR bard) AND PUBYEAR > 2021 AND PUBYEAR < 2026 AND (LIMIT-TO (SUBJAREA, “MEDI”) OR LIMIT-TO (SUBJAREA, “HEAL”) OR LIMIT-TO (SUBJAREA, “NURS”)) AND (LIMIT-TO (DOCTYPE, “ar”) OR LIMIT-TO (DOCTYPE, “le”) OR LIMIT-TO (DOCTYPE, “cp”))
The data query string that was used for Embase was as follows:
(’large language model’:ab,ti OR ’llm’:ab,ti OR ’gpt’:ab,ti OR ’chatgpt’:ab,ti OR ’llama’:ab,ti OR ’claude’:ab,ti OR ’gemini’:ab,ti OR ’bard’:ab,ti) AND [humans]/lim NOT (’review’/it OR ’meta analysis’/it OR survey:ti) AND (2022:py OR 2023:py OR 2024:py OR 2025:py) AND (’Article’/it OR ’Article in Press’/it OR ’Conference Paper’/it OR ’Letter’/it OR ’Preprint’/it)
All retrieved records were deduplicated by digital object identifier and exact title. Titles and abstracts were then screened against predefined eligibility criteria by GPT-5 (methodology described below): studies were included if they reported original evaluations of LLMs on clinical tasks, and excluded if they used non-LLM models or applied LLMs solely to nonclinical contexts (for example, literature summarization or abstract screening). We additionally conducted a blinded manual human review of 500 randomly chosen studies from the initial pull (before LLM screening) to validate and characterize LLM screening performance rigorously. The precise inclusion and exclusion criteria (given to both humans and the LLM) can be found in the ‘screening_intructions.txt’ file under the Prompts folder on the GitHub repository.
Due to the immense number of unique studies found in the initial study pull, it was infeasible to screen for inclusion and/or exclusion entirely by hand. To undertake this challenge, we implemented an LLM-based screening pipeline using GPT-5 (reasoning mode) via OpenAI’s application programming interface and validated this approach by comparing it to human screening performance on a smaller subset of the data. We set GPT-5’s reasoning mode to ‘high’ for tasks that required complex decision-making, specifically inclusion and/or exclusion screening, evidence tier assignment and structured data extraction. We set reasoning mode to ‘minimal’ for tasks that were standard and well-validated for language models, specifically natural language classification tasks. Many frontier models are available for use in this screening process. Ideally, we would have performed benchmarks and compared the results of full analysis for several frontier models, but this proved to be prohibitively expensive at scale. As a result, we chose to use the most recently released frontier model, which was GPT-5 at the time of analysis.
Each title and abstract pair was submitted with a standardized prompt instructing the model to classify the record as ‘include’ if it evaluated one or more LLMs on a clinical task and to exclude if the study used a non-LLM model (for example, convolutional networks, recurrent neural networks or vision transformers) or if the study used an LLM in a healthcare-related context but for a nonclinical task (for example, abstract screening, paper-writing, electronic health record data extraction). The full screening prompt is available in the GitHub repository. In total, 4,609 studies were included out of the pulled studies.
We additionally implemented a secondary ‘tiering’ phase, where we instructed GPT-5 with reasoning set to ‘high’, using a custom prompt (available in the GitHub repository) to perform tiering. The model was instructed to assign each included study to one of several tiers:
Tier S: real-world, prospective evaluations of a deployed system in a live clinical environment. These studies evaluate LLMs in a randomized, controlled, blinded (if applicable) study on real patient data in a clinically relevant, real-world task. We consider this the most robust tier of evidence, as results directly represent the effect of the LLM intervention on defined outcomes.
Tier I: retrospective or prospective evaluations on real, never-before-seen clinical data. The LLM neither needs be deployed in a live setting nor needs the evaluators to be blinded to the methodology. We consider this to be one of the strongest tiers of evidence available, providing preliminary predictions on how a similarly designed Tier S study may perform.
Tier II: simulated clinical situations, open-ended free-response questions and subjective patient ratings. Data are not taken directly from a real clinical setting and are usually post-processed or synthesized, yet they still represent a task or scenario relevant to clinical practice (for example, simulated patient conversations, common online questions and specially designed open-ended vignettes: ‘what are the best next steps?’). We consider this a medium tier of evidence, as these studies assess clinical competency to some degree, but do not make predictions on how a system would perform in a real clinical setting.
Tier III: board exams, multiple-choice exams and case-studies and vignettes with a clear-cut answer (for example, ‘what’s the diagnosis?’). Typically, performance is measured in accuracy. These studies represent the lowest tier of evidence: they offer little insight into real-world clinical performance aside from knowledge retrieval and synthesis, a task at which LLMs have already been shown to perform robustly.
Although frontier LLMs are generally recognized to be highly performant in classification and data extraction tasks, we sought to validate the screening and tiering performance of our frontier LLMs to validate our approach and determine robust statistical bounds on the number of each study found.
For inclusion screening validation, we took a random subset of 500 articles (without replacement) from the deduplicated data pull and assigned two independent groups of human screeners (five nonoverlapping screeners each) to assign each study to the inclusion or exclusion group. Additionally, an independent screener provided tiebreaks for studies upon which groups disagreed. Screeners consisted of third- and fourth-year medical students, graduate students and residents. We computed several metrics for agreement between the LLMs. First, we compared the agreement (Cohen’s κ) between the human screener groups. Next, we computed the agreement between the LLM decisions and the tiebroken human data, as well as characterizing various statistics around LLM performance compared to ground truth.
For tiering, we took a random subset of 250 articles (without replacement) from the studies deemed to be included by the LLM to be tiered by humans. We again proceeded with two randomly assigned groups of humans (four nonoverlapping screeners each) to tier the studies. Each group conducted a first pass to remove any studies that were mistakenly marked to be included by the model, leaving 234 valid studies to tier. Each human group then assigned each study to a tier. In a similar manner to the inclusion and exclusion criteria, interhuman agreement, human–LLM agreement for each screening group, and the agreement between tiebroken human decisions (serving as the ground truth) and the LLM were computed.
Next, treating the tiebroken human screening decisions as the ‘ground truth’ decision set, we computed the agreement between the LLM and the ground truth set, as well as sensitivity and specificity. We used these values to compute error bounds on the true number of included and excluded studies.
For each included study, we extracted a plethora of metadata from the title and abstract of each study using OpenAI’s GPT-5 (reasoning set to ‘high’). Specifically, we extracted (if applicable) the model(s) evaluated, the specialties/subspecialties related to the study, the type of human evaluator (medical student, resident, fellow, attending and so on), the type of dataset used in the study, whether the LLM performance was found to be superior to humans and the type of data source (for example, clinical notes, vignettes, patient messages and so on). The full prompt used for this data extraction is available in our GitHub repository.
Given the diverse range of responses, the set of responses for each metadata field was manually reviewed and an overarching set of categories for each metadata field was curated. For example, ‘GPT-4o, ChatGPT, Gemini’ would all belong to ‘Closed-source general LLM.’ A full list of categories can be found in the GitHub codebase. Next, the free-form data along with the title and abstract were re-input into GPT-5 (reasoning set to ‘minimal’) with instructions to classify the free-form data via the aforementioned category lists. This allowed for a dramatic reduction in the number of unique terms parsed by the LLM and enabled broad categorical analysis. Given that other performance metrics were strong, that this was purely an extraction task and that there was an overwhelming amount of metadata extracted, we opted not to perform human validation on this task.
To estimate the range of sensitivity, specificity and Cohen’s κ of the LLM in the inclusion and exclusion phase, we first resampled the human–LLM agreement data with replacements 50,000 times and computed the sensitivity and specificity of each sample. We then formed the 95% CI by taking the 1,250th (2.5th percentile) and 48,750th (97.5th percentile) values from these samples. To estimate the range of true positive, true negative, false positive and false negative values, we first modeled the true posterior prevalence distribution based on the tiebroken human inclusion and exclusion decisions (which we considered to be the ground truth) as a beta distribution with a Jeffreys prior, beta(I + 0.5, E + 0.5), where I is the number of included studies and E is the number of excluded studies. Next, we took each of the 50,000 sensitivity and specificity draws, sampled a prevalence, π, from the beta posterior, and propagated sensitivity, specificity and prevalence to each metric of interest. Finally, we took the 2.5th and 97.5th percentiles of each resulting distribution.
To estimate the true counts of the Tier S + I, II and III studies (Tier S was grouped with Tier I due to an extremely small sample size), we constructed a Bayesian hierarchical Dirichlet-multinomial model that jointly inferred the population prevalence of each tier and the LLM’s tier-specific misclassification rates. The model placed Dirichlet priors on the prevalence vector and each row of the confusion matrix and then fitted them to both the 234 ground truth human studies and the 4,609 LLM-assigned totals via a multinomial likelihood. We used Markov chain Monte Carlo sampling to obtain a large number of posterior samples from this distribution, using the 2.5th and 97.5th percentiles to construct our confidence interval.
Specifically, let (varphi =({varphi }_{I},{varphi }_{II},{varphi }_{III})) denote the population-level prevalences of Tiers S + I, II and III and let (varTheta in [0,1{]}^{3times 3}) be the LLM confusion matrix whose i‑th row gives the probabilities that a true tier‑i study is labeled as each tier by the model. We assign independent Dirichlet(1, 1, 1) priors to (varphi) and to each row of (varTheta). For the audit, the 3 × 3 table of human–LLM agreements, M, is modeled as row‑wise multinomials:
For the remaining N_tot = 4,609 studies, we observe only the LLM totals
Marginalizing the unknown true labels gives the likelihood:
The posterior can thus be modeled as:
We sample from this posterior via Markov chain Monte Carlo, with four NUTS chains (4,000 draws each, 2,000 warm up steps, accept ratio of 0.9). Convergence is confirmed by ({hat{{rm{R}}}}, < ,1.01) and effective sample sizes >400 for all parameters. The true tier counts are recovered deterministically as ({{bf{N}}}^{{rm{S}}}={varphi }^{{rm{S}}}{text{N}}_{mathrm{tot}}) for each posterior draw, s; the median of these draws is reported as the point estimate, and the 2.5th and 97.5th percentiles provide 95% credible intervals. All modeling was implemented in PyMCv5.
Comparison of tier counts was performed via the Bonferroni-corrected two-sided t-test.
Comparison of the rates of increase of the number of studies published per month between each tier was performed via pairwise comparisons of the linear-regression slopes using independent two-sample Welch t‑tests on the slope estimates, with degrees of freedom computed by the Welch–Satterthwaite approximation.
Pairwise comparisons between tiers of the proportion of studies where LLMs outperformed humans were conducted with one-sided, two-sample z-tests with Bonferroni correction for multiple comparisons.
The year-over-year rate of change in the proportion of studies where LLMs outperformed humans was analyzed by fitting a binomial (logistic) regression of the binary outperformance indicator on publication year and comparing it to an intercept-only model via a likelihood-ratio test.
Comparison of the proportion of studies where LLMs outperformed humans across levels of experience was performed via pairwise one‑sample z‑tests for proportions, with P values adjusted for multiple comparisons using the Benjamini–Hochberg false discovery rate procedure with a false discovery rate of 5%. We chose false discovery rate over the more conservative Bonferroni correction because, with a large family of pairwise tests, controlling the expected proportion of false discoveries (rather than the probability of any false positive) preserves statistical power and is more appropriate for this exploratory analysis context.
Because this review was designed as a descriptive, bibliometric and methodological mapping of the LLM-in-clinical-medicine literature—rather than an evaluation of the magnitude or direction of treatment effects—we did not conduct a conventional study-level risk of bias assessment. We did not pool effect estimates, quantitatively compare interventions or base recommendations on individual study results; consequently, risk of bias judgments would neither have been comparable across the highly heterogeneous set of included designs (ranging from randomized trials and observational studies to simulation and exam-style evaluations) nor altered our analyses or conclusions. Instead, we captured the aspects of methodological rigor that were relevant to our objectives through a prespecified evidence tier framework (Tiers S/I/II/III, reflecting data realism and study design) and by explicitly quantifying misclassification error in our LLM-assisted screening and tiering pipeline against human-validated labels. For these reasons, a formal study-level risk of bias assessment was considered not applicable and was therefore not performed.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
All data pertaining to this systematic review necessary to reproduce all results are available via GitHub at https://github.com/nyuolab/llms-in-clinical-medicine-systematic-review and via Zenodo at https://doi.org/10.5281/zenodo.17393576 (ref. ²⁰).
All analyses were run in Python3.12.7 inside JupyterLab. The notebooks rely on a small set of open‑source packages: numpy1.26.4 for array math, pandas2.2.3 for data management, matplotlib3.10.1 for visualization, scipy1.14.1 and statsmodels0.14.4 for various statistical tests, PyMC5.23.0 with pytensor and ArviZ0.22.0 for Bayesian inference, and the OpenAI Python SDK1.61.0 for querying GPT models. All code is available via GitHub at https://github.com/nyuolab/llms-in-clinical-medicine-systematic-review and via Zenodo at https://doi.org/10.5281/zenodo.17393576 (ref. ²⁰).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at arXiv https://doi.org/10.48550/arXiv.2108.07258 (2021).
Tang, L. et al. Evaluating large language models on medical evidence summarization. NPJ Digit. Med. 6, 158 (2023).
Article PubMed PubMed Central Google Scholar
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature https://doi.org/10.1038/s41586-025-08869-4 (2025).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature https://doi.org/10.1038/s41586-025-08866-7 (2025).
Kıyak, Y. S. & Emekli, E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad. Med. J. 100, 858–865 (2024).
Article PubMed Google Scholar
Alyakin, A. et al. Repurposing the scientific literature with vision-language models. Preprint at arXiv https://arxiv.org/abs/2502.19546 (2025).
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kung, T. H. et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit. Health 2, e0000198 (2023).
Article PubMed PubMed Central Google Scholar
Gilson, A. et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9, e45312 (2023).
Article PubMed PubMed Central Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y., Song, Y., Ma, Z. & Han, X. Multidisciplinary considerations of fairness in medical AI: a scoping review. Int. J. Med. Inform. 178, 105175 (2023).
Article PubMed Google Scholar
Mennella, C., Maniscalco, U., De Pietro, G. & Esposito, M. Ethical and regulatory challenges of AI technologies in healthcare: a narrative review. Heliyon 10, e26297 (2024).
Article PubMed PubMed Central Google Scholar
Patil, A. et al. Large language models in neurosurgery: a systematic review and meta-analysis. Acta Neurochir. (Wien) 166, 475 (2024).
Article PubMed Google Scholar
Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 333, 319–328 (2025).
Article PubMed PubMed Central Google Scholar
Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372, n71 (2021).
Article PubMed PubMed Central Google Scholar
Danilov, G. et al. Length of stay prediction in neurosurgery with Russian GPT-3 language model compared to human expectations. Stud. Health Technol. Inform. 289, 156–159 (2022).
PubMed Google Scholar
Danilov, G. et al. Predicting the length of stay in neurosurgery with RuGPT-3 language model. Stud. Health Technol. Inform. 295, 555–558 (2022).
PubMed Google Scholar
Bricker, J. B., Sullivan, B., Mull, K., Santiago-Torres, M. & Lavista Ferres, J. M. Conversational chatbot for cigarette smoking cessation: results from the 11-step user-centered design development process and randomized controlled trial. JMIR MHealth UHealth 12, e57318 (2024).
Article PubMed PubMed Central Google Scholar
Gallifant, J. et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat. Med. 31, 60–69 (2025).
Article CAS PubMed PubMed Central Google Scholar
Chen, S. nyuolab/llms-in-clinical-medicine-systematic-review: Initial release. Zenodo https://doi.org/10.5281/zenodo.17393576 (2025).
Download references
E.K.O. is supported by the National Cancer Institute Early Surgeon Scientist Program (3P30CA016087-41S1) and the W.M. Keck Foundation. This work was supported by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. RS-2019-II190075 Artificial Intelligence Graduate School Program (KAIST); No. RS-2024-00509279, Global AI Frontier Lab). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We appreciate the informal input from mentors, colleagues and lab members of OLAB.
Duke University School of Medicine, Durham, NC, USA
Sully F. Chen, Andreas Seas & Rochelle T. Bitolas
Washington University School of Medicine, Saint Louis, MO, USA
Anton Alyakin & Jin Vivian Lee
Department of Neurosurgery, NYU Langone Health, New York, NY, USA
Anton Alyakin, Eunice Yang, Joanne J. Choi, Jin Vivian Lee, Robert J. Steele & Eric K. Oermann
Global AI Frontier Lab, New York University, New York, NY, USA
Anton Alyakin, Jin Vivian Lee & Eric K. Oermann
Columbia University Vagelos College of Physicians and Surgeons, New York, NY, USA
Eunice Yang
Augusta University/University of Georgia Medical Partnership, Athens, GA, USA
Joanne J. Choi
UQ Ochsner, New Orleans, LA, USA
Amelia L. Chen
Department of Neurosurgery, Johns Hopkins University, Baltimore, MD, USA
Pranav I. Warman
Department of General Surgery, NYU Langone Health, New York, NY, USA
Robert J. Steele
New York University Grossman School of Medicine, New York, NY, USA
Daniel A. Alber & Eric K. Oermann
Department of Radiology, NYU Langone Health, New York, NY, USA
Eric K. Oermann
Courant Institute School of Mathematics, Computing, and Data Science, New York University, New York, NY, USA
Eric K. Oermann
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
S.F.C. and E.K.O. conceptualized and supervised the project. S.F.C. collected journal publications and abstracts. S.F.C. developed LLM prompting of the initial screening and tiering process. S.F.C., J.V.L., A.A., A.S., A.L.C., P.I.W., R.T.B., R.J.S. and D.A.A. performed the human validation of the initial screening. E.Y. performed the tiebreaking for the initial screening. S.F.C., E.Y., J.V.L, A.A., A.S., A.L.C., J.J.C., R.T.B. and R.J.S. performed the human validation of the tiering. A.A. and R.T.B. performed the tiebreaking for the human tiering. S.F.C. extracted the publications metadata using LLMs. S.F.C. performed statistical analyses of the data. S.F.C. designed the manuscript figures. S.F.C., A.A., A.S., E.Y. and J.J.C. drafted the manuscript, with all authors contributing to revisions and final edits.
Correspondence to Sully F. Chen or Eric K. Oermann.
E.K.O. reports consulting with Sofinnova Partners and Google, income from Merck & Co. and Mirati Therapeutics, and equity in Artisight. S.F.C. reports equity in OpenAI. The other authors declare no competing interests.
Nature Medicine thanks Stephen Gilbert, Ethan Goh, Guangyu Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Ming Yang, in collaboration with the Nature Medicine team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The confusion matrix of the LLM-tiered studies versus the human consensus.
Supplementary Tables 1–3.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Chen, S.F., Alyakin, A., Seas, A. et al. LLM-assisted systematic review of large language models in clinical medicine. Nat Med 32, 1152–1159 (2026). https://doi.org/10.1038/s41591-026-04229-5
Download citation
Received: 03 August 2025
Accepted: 14 January 2026
Published: 03 March 2026
Version of record: 03 March 2026
Issue date: March 2026
DOI: https://doi.org/10.1038/s41591-026-04229-5
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Advertisement
Nature Medicine (Nat Med)
ISSN 1546-170X (online)
ISSN 1078-8956 (print)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

source

Save This Post

LLM-assisted systematic review of large language models in clinical medicine – Nature

Leave a Comment Cancel Reply