A comparison of the performance of Chinese large language models and ChatGPT throughout the entire clinical workflow – Nature

Spread the love

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Reports volume 15, Article number: 36322 (2025)
5191 Accesses
1 Altmetric
Metrics details
ChatGPT has demonstrated strong performance in the complex, full clinical workflow. In recent years, several large language models (LLMs) from China have been introduced; however, their performance in such intricate tasks has yet to be thoroughly assessed, and it remains unclear whether their performance diverges from that of ChatGPT. This study seeks to evaluate the capacity of the Chinese LLMs for providing continuous clinical decision support by assessing their performance with simulated patient cases.
We selected 29 standard cases from the Merck Manual as simulated patients. We provided their information to the LLMs. Each simulated case is accompanied by a series of sequential questions designed to simulate the process of differential diagnosis, diagnostic workup, diagnosis and management. The responses were then recorded and scored. Then we compared the performance of two Chinese large language models with ChatGPT-4 in entire clinical workflow of simulated patient, selecting the best-performing model for a comparison with 18 human emergency fellow doctors. Additionally, we compared the differences in performance between different versions of the LLMs.
There were no significant differences between ChatGPT-4 and Doubao in all four aspects (P > 0.05). However, ERNIE Bot 3.5 was inferior to ChatGPT-4 and Doubao in differential diagnosis, diagnostic questions, and management (P < 0.05). But in diagnosis questions, the average accuracy proportion for all three models was above 97%, with no significant differences observed (P > 0.05). There was no significant difference between LLMs and emergency fellow physicians in diagnosis and differential diagnosis (P > 0.05), but in diagnostic questions as well as management, LMMs were superior to emergency fellow physicians (P < 0.05). ChatGPT-4 was higher than ChatGPT 3.5 in all four aspects (P < 0.05).
The large language model Doubao from China demonstrates performance similar to ChatGPT across full clinical workflows. LLMs outperform human emergency fellow physicians and exhibit rapid development, offering significant practical application potential in healthcare.
The concept of artificial intelligence (AI) was first proposed in 1956 1, aiming to build computer systems that can perform thinking and reasoning tasks similar to the human brain. However, due to limitations in computer hardware, the development of this technology was slow for a long period. In recent years, with breakthroughs in computer hardware capabilities, AI technology has experienced rapid growth. A milestone event in the history of AI development occurred in November 2022 with the release of ChatGPT-3.5. ChatGPT-3.5 is a large language model (LLM) capable of generating text that closely resembles natural human language based on the provided context 2. Thanks to extensive training on vast amounts of textual data and its outstanding natural language processing abilities, LLMs have found widespread applications in the medical field 3,4,5,6. Early studies have shown that ChatGPT-3.5 performs well across the entire clinical workflow. This includes its capabilities in differential diagnosis, test recommendations, diagnosis, and treatment management, positioning it as the AI derivative closest to a human doctor. 7.
AI development has been rapid. In the past year, ChatGPT has been updated to version 4.0, and many other high-quality LLMs have been released. Among them, Chinese models such as Doubao and ERNIE Bot 3.5 have gained significant popularity and widespread use locally. However, there is currently no research evaluating how these Chinese LLMs perform in entire clinical workflows. Therefore, this study compares the performance of two Chinese large language models with ChatGPT-4 in entire clinical workflow of simulated patient, selecting the best-performing model for a comparison with human emergency doctors. This study aims to explore the current application value of LLMs in medical clinical work.
The two Chinese language models incorporated herein are Doubao (https://www.doubao.com/chat) and ERNIE Bot 3.5 (https://yiyan.baidu.com). Similar to ChatGPT, both of these sophisticated large models utilize the Transformer as their fundamental architecture. Doubao is developed by ByteDance in China, while ERNIE Bot 3.5 belongs to Baidu Group. The training datasets of both models remain undisclosed. Notably, both models possess the capability to access and retrieve real-time information. In contrast, ChatGPT-4 lacks the functionality to conduct real-time searches, and the information it furnishes is solely reliant on the data and knowledge it assimilated during training up until October 2023.
We selected 29 standard cases from the Merck Manual as simulated patients 8. These cases represent typical clinical scenarios commonly encountered in healthcare settings and include components similar to clinical encounter documentation, such as the history of present illness (HPI), review of systems (ROS), physical exam (PE), and laboratory test results. Each simulated case is accompanied by a series of sequential questions designed to simulate the process of differential diagnosis, diagnostic workup, and clinical management decisions. The cases are written by independent experts in the field and undergo a peer-review process before being published. It is important to note that these standard cases may be accessible via real-time search functions in some large language models.
We input the case data from the Merck Manual into each LLM and sequentially presented the associated questions, recording the model’s responses. The questions were then categorized into four aspects: differential diagnosis, diagnostic questions, diagnosis, and management. The classification criteria are as follows: differential diagnosis questions (diff), which ask the user to determine which conditions cannot be excluded from an initial differential diagnosis; diagnostic questions (diag), which require the user to determine appropriate diagnostic steps based on the current hypotheses and available information; diagnosis questions (dx), which ask the user to provide a final diagnosis; and management questions (mang), which ask the user to recommend appropriate clinical interventions. It is important to note that all questions involving image analysis were excluded from this study, as three LLMs are text-based AI and lacks the ability to interpret visual information.We have detailed the distribution of questions across four categories for each case in Table S5.
Each case presents multiple-choice questions, with each question containing a varying number of options. The LLMs completes the assessment by selecting answers. Both selecting a correct option and omitting an incorrect option are considered correct decisions for that particular option. We define the correct proportion as the total number of correct decisions divided by the total number of options. To account for the potential variability in the responses from LLMs, two researchers independently performed the assessment and recording process. We used the Intraclass Correlation Coefficient (ICC) to evaluate inter-rater reliability (Table S6).The final score for each question was determined by calculating the average correct proportion across the researchers’ evaluations (Fig. 1).
We chose emergency physicians as the subjects of our study because the clinical specialties involved in simulating cases are extensive, which closely resembles the work environment of emergency physicians. We enrolled 18 emergency medicine fellows from a single center who had completed residency and were in their 1st–3rd year of subspecialty training.The same procedures applied to the large language models (LLMs) were also carried out for these doctors. The average scores for each question were recorded for all 18 physicians, and these scores were subsequently compared with those of the best-performing large language model (Fig. 1).
We evaluated the development of the large language model by re-analyzing the original data provided by early researchers 3 and comparing it with the data generated by the current version of the large language model.
All experimental protocols were approved by Sir Run Run Shaw Hospital Ethics Committee (20240720-12). All methods were carried out in accordance with relevant guidelines and regulations. Written informed consent for participation was obtained from all participants.
Study design.
A paired t-test is used when comparing two groups where the difference in the observed variable follows a normal (or approximately normal) distribution. If the difference in the observed variable between the groups does not follow a normal (or approximately normal) distribution, the Wilcoxon signed-rank test, a non-parametric test for paired samples, is employed. All statistical analyses were performed using R software (v 4.4.3).
We compared the performance of different large language models in four dimensions. Among them, the average correct proportion of Doubao in differential diagnosis was 0.71 ± 0.12, in diagnostic questions was 0.79 ± 0.14, in diagnosis was 0.98 ± 0.09, and in management was 0.81 ± 0.11. The average correct proportion of ERNIE Bot 3.5 in differential diagnosis was 0.63 ± 0.14, in diagnostic questions was 0.72 ± 0.16, in diagnosis was 0.97 ± 0.07, and in management was 0.76 ± 0.14. The average correct proportion of ChatGPT-4 in differential diagnosis was 0.74 ± 0.12, in diagnostic questions was 0.78 ± 0.13, in diagnosis was 0.99 ± 0.03, and in management was 0.84 ± 0.10. A paired t-test revealed no significant differences between ChatGPT-4 and Doubao in differential diagnosis, diagnostic questions and management (P > 0.05). However, ERNIE Bot 3.5 was inferior to ChatGPT-4 and Doubao in differential diagnosis, diagnostic questions, and management (P < 0.05). But in the aspect of diagnosis questions, the Wilcoxon signed-rank test showed the performance of the three models was similar (P > 0.05) (Table S1 and Fig. 2). ERNIE Bot 3.5 did not perform worse than the other two large models in all cases. For example, in the differential diagnosis questions of simulated cases 20 (final diagnosis of benign prostatic hyperplasia), 23 (final diagnosis of torsion of the testicular appendage), 26 (final diagnosis of cardiac tamponade), and 27 (final diagnosis of Cushing’s disease), ERNIE Bot 3.5 outperformed the other two models (Figure S1S4 and Table S4).
A Comparison of the Performance of Two Chinese LLMs and ChatGPT-4. diff: differential diagnosis; diag: diagnostic questions; dx: diagnosis questions; mang: management questions. NS means not significant P > 0.05 and * or ** indicates statistical significance P < 0.05.
Comparison between humans and LLMs.
The performance of Doubao and ChatGPT-4 is similar across all aspects, but ChatGPT-4 is more accessible in the international community. Therefore, we chose ChatGPT-4 as the optimal model for further research. We compared the performance of human fellow doctors and ChatGPT-4 in four dimensions. The average correct proportion of human emergency fellow doctors in differential diagnosis was 0.66 ± 0.19, in examinations and tests was 0.70 ± 0.15, in diagnosis was 0.98 ± 0.08, and in management was 0.72 ± 0.18. There was no significant difference between them in diagnosis and differential diagnosis (P > 0.05), but in diagnostic questions as well as management, LMMs were superior to human fellow physicians (P < 0.05) (Table S2 and Fig. 3).
A Comparison of the Performance of ChatGPT-4 and emergency fellow physicians. NS means not significant P > 0.05 and * or ** indicates statistical significance P < 0.05.
We compared the performance of different versions of ChatGPT in four dimensions. Among them, the average correct proportion of ChatGPT-3.5 in differential diagnosis was 0.63 ± 0.15, in examinations and tests was 0.71 ± 0.15, in diagnosis was 0.84 ± 0.18, and in management was 0.67 ± 0.12. ChatGPT-4 was higher than ChatGPT 3.5 in all four aspects (P < 0.05) (Table S3 and Fig. 4).
A Comparison of the Performance of ChatGPT-4 and ChatGPT-3.5. NS means not significant P > 0.05 and * or ** indicates statistical significance P < 0.05.
Entire clinical workflows are a complex process, involving many factors such as clinical thinking, clinical reasoning, individual judgment, and the patient’s condition 9. This study is the first to evaluate the performance of two large language models (LLMs) from China in this complex task and compare them with ChatGPT-4. The results indicate that although the models differ in language, the clinical performance of Doubao and ChatGPT-4 is quite similar. This suggests that the difference in language does not significantly impact the performance of LLMs.
The performance of the LLMs varies across different stages of clinical work, with the poorest performance observed in differential diagnosis and the best performance in diagnosis questions (dx). The average correct proportion for diagnosis questions exceeded 97%, indicating that when sufficient diagnostic information is provided, LLMs can offer highly accurate diagnoses.
In this study, ChatGPT-4 outperformed human emergency physicians, a finding consistent with the research by Goh E et al. 10. This is a very promising result, suggesting that LLMs are approaching practical clinical applications. However, this does not imply that human physicians can be replaced. On one hand, the simulated cases in this study involved issues outside the specialties of many emergency physicians, and thus, they were unable to provide perfect answers, even though they might encounter similar situations in real clinical practice. On the other hand, healthcare institutions are composed of multidisciplinary teams, and through consultation mechanisms and teamwork, they can achieve greater efficiency than any single emergency physician. Furthermore, current LLMs lack the ability to actively solicit information and perform practical operations during the medical history-taking process. Additionally, a significant drawback of LLMs is that they may provide explanations that appear reasonable but are, in fact, incorrect. For example, in Case 2, during the differential diagnosis of a patient with chest pain, the LLM dismissed acute coronary syndrome (ACS) too easily, reasoning that the chest pain was not severe enough and the patient did not have difficulty breathing (attachment 2 and attachment 3). However, ACS should always be considered in the differential diagnosis of chest pain. In the real world, such errors could be fatal. Therefore, the most appropriate role for LLMs at present may be as a medical decision support tool for emergency physicians, supplementing gaps in their knowledge 10,11.
Compared to ChatGPT-3.5, ChatGPT-4 demonstrates superior performance across all clinical tasks, particularly in diagnosis, where it has reached an average correct proportion of 99%. Despite the two versions being released only two years apart, this rapid development of LLMs suggests that the pace of advancement is exceptionally fast. The healthcare sector may undergo significant changes as a result, and our study has keenly observed this shift.
Our study has certain limitations. Some of the LLMs might have already “learned” or had access to these specific cases during their training. This is like giving the LLM an “open book test” while the human doctors relied on their memory. LLMs learn patterns and relationships from trillions of data points. In this massive amount of data, the influence of any single training sample is highly diluted and averaged out. The LLM learns statistical regularities and high-level abstract features, rather than rote memorization of individual sentences12. However, we are indeed unable to quantify this phenomenon. Due to the massive number of parameters and complex internal structure of LLMs, precisely tracing the impact of a single training data point on its output is extremely difficult 13. The calculation of such “influence functions” often performs poorly on LLMs, further indicating that the impact of individual data is diffuse and hard to isolate. This is also one of the limitations of this paper.Our other limitation is that due to the small number of standardized cases we selected and the participation of only emergency department physicians from a single center, we are unable to evaluate if the AI would perform the same way with different types of patient cases or with doctors from other hospitals or specialties. This requires further research. However, as an exploratory study, we believe this conclusion is valuable, as it implies the potential for AI application across the entire workflow of emergency diagnosis and treatment, rather than being a fantasy.
Our findings indicate that the Chinese LLM, Doubao, performs comparably to ChatGPT across the full spectrum of clinical workflows. This, combined with LLMs’ superior performance over human emergency medicine fellows and their rapid development, underscores their significant practical application potential in healthcare.
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Haleem A, Javaid M, Khan IH. 2019 Current status and applications of Artificial Intelligence (AI) in medical field: an overview. 9(6): 231–237.
Introducing ChatGPT. OpenAI. 2022. Nov 30, [2023–06–20]. https://openai.com/blog/chatgpt
Bates, D. W. et al. The potential of artificial intelligence to improve patient safety: a scoping review. NPJ Digit Med. 4(1), 54. https://doi.org/10.1038/s41746-021-00423-6.PMID:33742085;PMCID:PMC7979747 (2021).
Article  PubMed  PubMed Central  Google Scholar 
Rao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer KJ, Succi MD. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. J Am Coll Radiol. 2023 20(10):990–997. https://doi.org/10.1016/j.jacr.2023.05.003.
Levine, D. M. et al. The diagnostic and triage accuracy of the GPT-3 artificial intelligence model: an observational study. Lancet Digit Health. 6(8), e555–e561. https://doi.org/10.1016/S2589-7500(24)00097-9 (2024) (PMID: 39059888).
Article  CAS  PubMed  Google Scholar 
Gan, W. et al. Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial. J Med Internet Res. 20(26), e57037. https://doi.org/10.2196/57037 (2024).
Article  Google Scholar 
Rao, A. et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res. 22(25), e48659. https://doi.org/10.2196/48659.PMID:37606976;PMCID:PMC10481210 (2023).
Article  Google Scholar 
Case studies. Merck Manual, Professional Version. URL: https://www.merckmanuals.com/professional/pages-with-widgets/case-studies?mode=list [accessed 2024–11–01]
Liu, J., Wang, C. & Liu, S. Utility of ChatGPT in Clinical Practice. J Med Internet Res. 28(25), e48568. https://doi.org/10.2196/48568.PMID:37379067;PMCID:PMC10365580 (2023).
Article  Google Scholar 
Goh, E. et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 7(10), e2440969. https://doi.org/10.1001/jamanetworkopen.2024.40969.PMID:39466245;PMCID:PMC11519755 (2024).
Article  PubMed  PubMed Central  Google Scholar 
McDuff D, Schaekermann M, Tu T, et al. Towards accurate differential diagnosis with large language models. arXiv. Preprint posted online November 30, 2023. https://doi.org/10.48550/arXiv.2312.00164
Xiong A, Zhao X, Pappu A, et al. The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation. arXiv preprint arXiv:2507.05578, 2025.
Chai Y, Liu Q, Wang S, et al. On training data influence of gpt models. arXiv preprint arXiv:2404.07840, 2024.
Download references
We would like to thank the researchers and study participants for their contributions.
He Y received funding from the Medical Health Science and Technology Project of Zhejiang Provincial Health Commission (2025KY909). Liu N received funding from the Natural Science Foundation of Zhejiang Province (Q24H160079).
Department of Emergency Medicine, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, 3#, East Qingchun Road, Hangzhou, 310016, Zhejiang, China
Yang He, Lingling Lai, Ning Liu, Yucai Hong, PengPeng Chen & Zhongheng Zhang
School of Medicine, Shaoxing University, Shaoxing, 312000, Zhejiang, China
Yang Wang
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
PubMed Google Scholar
The research design was completed by Yang He, Yucai Hong and Zhongheng Zhang. The responses from the LLMs were recorded by Yang He and Yang Wang. The scoring of the records was done by Pengpeng Chen. The statistical analysis was performed by Liu N and Lingling Lai. All authors contributed to the article and approved the submitted version.
Correspondence to Zhongheng Zhang.
The authors declare no competing interests.
All experimental protocols were approved by Sir Run Run Shaw Hospital Ethics Committee (20240720-12). All methods were carried out in accordance with relevant guidelines and regulations. Written informed consent for participation was obtained from all participants.
All authors consent for publication.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Below is the link to the electronic supplementary material.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
He, Y., Wang, Y., Lai, L. et al. A comparison of the performance of Chinese large language models and ChatGPT throughout the entire clinical workflow. Sci Rep 15, 36322 (2025). https://doi.org/10.1038/s41598-025-20210-7
Download citation
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-20210-7
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Advertisement
Scientific Reports (Sci Rep)
ISSN 2045-2322 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

source

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top