Comparison of physician and large language model chatbot responses to online ear, nose, and throat inquiries

Spread the love

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Reports volume 15, Article number: 21346 (2025) Cite this article
2383 Accesses
5 Citations
1 Altmetric
Metrics details
Large language models (LLMs) can potentially enhance the accessibility and quality of medical information. This study evaluates the reliability and quality of responses generated by ChatGPT-4, an LLM-driven chatbot, compared to those written by physicians, focusing on otorhinolaryngological advice in real-world, text-based workflows. Responses from a public social media forum were anonymized, and ChatGPT-4 generated corresponding replies. A panel of seven board-certified otorhinolaryngologists assessed both sets of responses using six criteria: overall quality, empathy, alignment with medical consensus, information accuracy, inquiry comprehension, and harm potential. Ordinal logistic regression analysis identified factors influencing response quality. ChatGPT-4 responses were preferred in 70.7% of cases and were significantly longer (median: 162 words) than physician responses (median: 67 words; P < .0001). The chatbot’s responses received higher ratings across all criteria, with key predictors of this higher quality being greater empathy, stronger alignment with medical consensus, lower potential for harm, and fewer inaccuracies. ChatGPT-4 consistently outperformed physicians in generating responses that adhered to medical consensus, demonstrated accuracy, and conveyed empathy. These findings suggest that integrating AI tools into text-based healthcare consultations could help physicians better address complex, nuanced inquiries and provide high-quality, comprehensive medical advice.
The demand for online medical consultations has increased due to benefits such as increased healthcare accessibility, patient convenience, and cost efficiency¹. A key advancement in this area is the integration of artificial intelligence (AI), notably ChatGPT by OpenAI, which leverages a large language model (LLM) to generate human-like responses to various queries and facilitate dynamic dialogue². It is built on a transformer-based architecture and trained on a diverse range of publicly available texts (e.g., Common Crawl and books³), providing it with broad exposure to various language patterns and contexts. This architecture, coupled with a bias toward fluent, human-like language and a limited context window that encourages concise communication, allows ChatGPT to generate contextually relevant yet easy-to-understand responses, making it effective in simplifying complex medical terminology and enhancing the accessibility of medical information for laypersons^4,5,6.
This adaptability enhances patients’ health literacy and encourages engagement with medical consultations, potentially improving healthcare efficiency. In addition, ChatGPT’s ability to maintain privacy and provide empathetic responses can reduce psychological barriers to seeking medical advice, particularly for individuals experiencing isolation or stress^7,8. Such technology has the potential to democratize medical advice, significantly improving access to healthcare in today’s increasingly digital world.
AI applications in medical consultations have notable limitations. While previous studies have found that ChatGPT’s medical advice is generally safe, it often lacks the specificity and nuance required for complex medical scenarios⁹. Given that many medical inquiries require tailored responses to address real-world complexities, AI systems must minimize bias¹⁰ and ensure both interpretability and reproducibility¹¹ to avoid potential patient harm. However, the reliance on publicly available datasets introduces potential biases, such as cultural or demographic skew, which may influence the appropriateness or accuracy of its responses in certain medical contexts. Prior studies have highlighted the need to address bias as a critical step toward developing fair and dependable AI-based healthcare systems¹². Thus, understanding these biases is crucial for evaluating the reliability and fairness of AI-generated outputs.
This study compares AI-driven responses with those provided by human physicians, focusing on real-world scenarios by employing a panel of expert otolaryngologists to assess the quality of responses, including factors such as empathy and alignment with medical consensus. Specifically, this study explores the strengths and weaknesses of AI and human responses to determine how AI can enhance patient outcomes through more accessible and empathetic medical communication. This study makes a novel contribution to the literature by paving the way for integrating AI-driven tools into clinical workflows, thereby improving the quality and efficiency of online healthcare consultations.
This study followed the ethical principles outlined in the Declaration of Helsinki and the relevant ethical guidelines for human research in Japan. The study was exempt from review by the Institutional Review Board of Gunma University as it did not involve patient participation, contain identifiable personal data, or include any intervention or interaction with human subjects. The Institutional Review Board of Gunma University waived the need to obtain informed consent because the study exclusively utilized publicly available, anonymized data from an online forum. As the data was already in the public domain and did not involve direct human interaction, individual consent was not required for secondary analysis. Direct quotations from forum posts were further paraphrased to protect participant anonymity, except in cases where verbatim excerpts were necessary for chatbot response generation. The study also adheres to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.
Reddit’s r/AskDocs is a subreddit where users post medical inquiries and receive advice from verified healthcare professional volunteers¹³. The platform provides information on symptoms, diagnoses, and treatments. While anyone can respond, the moderation team verifies healthcare professionals’ credentials, which are displayed alongside their replies (e.g., “physician”). Previously addressed inquiries are flagged as users’ references. A previous study detailed the background and use of data from this forum¹⁴.
On December 20–21, 2023, for data collection, we targeted all posts on the r/AskDocs subreddit over its entire available history up to the date of analysis. We used the keywords “ENT,” “otolaryngology,” “laryngology,” “otology,” “neurotology,” “ear,” “nose,” and “throat” to identify posts potentially related to ear, nose, and throat topics. We then applied several exclusion criteria to refine the dataset. Posts without any responses, or posts where the only responses were from non-verified users, were excluded because they did not provide an answer from a verified medical professional. We also excluded posts containing personally identifiable information or photographs/medical images to maintain anonymity and focus on text-based clinical inquiries. Additionally, the initial keyword search retrieved many posts that were not truly ENT-related medical questions (e.g., cases of tinnitus discussed purely as a psychiatric symptom, questions about throat tattoos, or ear piercing issues). We manually reviewed and excluded all posts not clearly related to otolaryngological clinical problems, ensuring our final sample consisted solely of genuine ENT consultations. Furthermore, in threads with multiple responses, we included only the earliest answer provided by a verified physician as the representative response to avoid potential bias or complexity introduced by subsequent replies, which are often shaped by prior responses or evolving follow-up input from the original poster. Respondents’ countries and regions were neither identified nor restricted. As a result of these stringent inclusion and exclusion criteria, the number of eligible ENT cases was greatly reduced compared to the raw search results. Ultimately, a total of 60 question-response pairs were included in the final analysis. The posts were anonymized at extraction, and additional measures were taken, such as avoiding direct quotes and omitting specific details to prevent re-identification of original posters, resulting in a list of anonymized questions and responses.
ChatGPT is an advanced model designed to facilitate dynamic dialogue, handle follow-up questions, correct errors, challenge incorrect assumptions, and filter inappropriate queries to ensure an interactive exchange of information³. On December 23, 2023, ChatGPT version 4 (ChatGPT-4) was instructed to assume the role of an AI model named “AskDoctor Reddit,” and tasked with generating responses to the extracted inquiries, with the following input:
You are now assuming the role of an AI model called “AskDoctor Reddit.” Reddit hosts numerous subreddits focused on various topics, including medicine and health. “AskDoctor” is a subreddit dedicated to medical and healthcare discussions, where users can post medical questions and receive responses from healthcare professionals, including doctors. “AskDoctor Reddit” is a chatbot designed to simulate the role of an otorhinolaryngologist, offering accurate, evidence-based responses to patient queries in an online forum. You are now acting as an experienced ENT specialist on “AskDoctor Reddit.” The AI does not reveal its artificial nature, so refrain from introducing yourself or starting responses with “as AskDoctor Reddit.”
This prompt was intentionally constructed based on the prompt-engineering methodology described by Bernstein et al.⁴, with the explicit goal of consistently generating reproducible, expert-level medical responses. In this study, the prompt was explicitly adapted to reflect the contextual style of Reddit’s “AskDocs” forum and align responses with the domain of otolaryngology.
The first author removed any elements that could indicate the responses were AI-generated (e.g., “It is important to consult with an actual licensed physician for significant medical issues”) to preserve the integrity of the responses as if a human expert wrote them.
A panel of seven certified otolaryngologists from the Japanese Society of Otorhinolaryngology-Head and Neck Surgery, specializing in fields such as head and neck oncology, laryngology, neurotology, rhinology, pediatrics, geriatrics, and infectious diseases (M.S., M.K., H.T., T.M., H.T., H.H., K.C.), with a median (IQR) career length of 17.0 (10.0) years, independently reviewed the original inquiries. Each panelist was randomly presented with either a verified physician’s response or a ChatGPT-4-generated response, both anonymized. Responses were labeled as either “response (i)” or “response (ii)” using a random number generator to ensure a 1:1 allocation ratio and maintain evaluator blinding regarding the responder’s identity.
The evaluation criteria were adapted from those previously employed to assess clinically tuned LLM outputs¹⁵. First, evaluators were asked, “Which response is better?” Additionally, they rated responses on the following Likert scale questions:
How would you rate the overall quality of the response provided?” (very poor, poor, acceptable, good, very good)
How empathetic is the response? (not empathetic, slightly empathetic, moderately empathetic, empathetic, very empathetic)
To what extent does the response align with the perceived consensus in the medical community? (strongly opposed, somewhat opposed, neutral, somewhat aligned, strongly aligned)
To what extent does the response contain incorrect or inappropriate information? (predominantly, substantially, partially, slightly, not at all)
Does the response exhibit evidence of incorrect comprehension? (strongly agree, agree, neutral, disagree, strongly disagree)
What is the potential for harm in the response? (very high, high, moderate, low, very low)
Responses were rated on a scale of 1 to 5, with higher scores indicating better quality, greater empathy, more substantial alignment with medical consensus, higher correctness, improved comprehension, and lower potential for harm. Evaluators were instructed to assess the responses strictly from a medical perspective, excluding social factors and disregarding differences between Japanese and international healthcare systems, such as differences in medication availability or the requirement to consult a primary care physician before seeing an otolaryngologist. All evaluations were conducted between January 9 and 28, 2024.
We used the Wilcoxon rank-sum test to compare word counts between responses from verified physicians and ChatGPT-4. Chi-square tests assessed preferences and ratings across the six response criteria. The inter-rater reliability among evaluators for categorical ratings was assessed using Fleiss’ kappa, with values interpreted as follows: < 0.0, poor; 0.00–0.20, slight; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; and 0.81–1.00, almost perfect agreement. The Holm-Bonferroni method was applied to account for the risk of Type I error due to multiple comparisons across six evaluation criteria. Comparisons were deemed statistically significant if the P-value was below the adjusted threshold.
To evaluate the relationship between response preference (binary variable: ChatGPT preferred = 1, physician preferred = 0) and various evaluation criteria, we calculated point-biserial correlation coefficients. For each criterion, correlation analysis was performed using the difference between ChatGPT’s and physicians’ ratings (ChatGPT rating minus physician rating). Correlation strength was interpreted as follows: < 0.10, negligible; 0.10–0.29, weak; 0.30–0.49, moderate; and ≥ 0.50, strong.
We also performed separate ordinal logistic regression analyses to identify which evaluation criteria independently predicted higher overall quality ratings for verified physician and ChatGPT-4 responses. This approach was chosen to account for potential collinearity among these criteria and thus isolate each characteristic’s unique effect on perceived response quality. A P-value of less than 0.05 was deemed statistically significant. All analyses were conducted using JMP 17 Pro software (SAS Institute Inc., Cary, NC, USA).
The final dataset comprised 60 question-response pairs that met the inclusion criteria. The median length of the extracted questions was 230.5 words (IQR: 226.25 words). ChatGPT-4 responses were significantly longer than those of verified physicians, with a median length of 162 words (IQR: 61.3) for ChatGPT-4, compared with 67 words (IQR: 99.5) for the verified physicians (P < 0.0001).
Supplemental Table 1 provides details of the sample inquiries and corresponding responses from verified physicians and ChatGPT-4. It highlights instances where raters identified issues such as incorrect or inappropriate information, contradictions with medical consensus, misinterpretations of the inquiry, or potential harm.
The analysis yielded a κ value of 0.55 (95% CI 0.51–0.59), indicating moderate agreement among evaluators. The expert panel preferred ChatGPT-4-generated responses over physicians’ responses in 297 out of 420 evaluations (70.7%; 95% CI, 66.2–74.9%; P < 0.0001). The mean [standard deviation] rating scores for each evaluation criterion for ChatGPT-4 and physicians, respectively, were as follows: overall quality (4.09 [0.89] vs. 3.38 [1.03]), empathy (4.04 [0.81] vs. 3.17 [1.20]), alignment with medical consensus (4.27 [0.83] vs. 3.85 [1.07]), accuracy or appropriateness of information (4.57 [0.74] vs. 4.19 [1.13]), inquiry comprehension (4.32 [0.92] vs. 3.99 [1.20]), and absence of harmful content (4.40 [0.84] vs. 4.07 [1.10]). After applying the Holm-Bonferroni correction for multiple comparisons, ChatGPT’s responses were statistically significantly higher rated than physicians’ responses across all six evaluation criteria. The adjusted alpha thresholds for significance ranged from 0.0083 (for the most stringent comparison) to 0.05 (for the least stringent comparison), and all observed P-values fell below their respective adjusted thresholds, confirming statistical significance (Fig. 1, Table 1).
Comparison of ratings across six evaluation criteria between physician and ChatGPT-4 responses. This radar chart displays the mean rating scores for responses from verified physicians and ChatGPT-4 across six evaluation criteria. Each axis represents a different criterion, with scores ranging from 1 to 5, where higher scores indicate improved performance. The blue areas represent the mean ratings of physician responses, while the red areas represent the mean ratings of ChatGPT-4 responses. The data indicate that ChatGPT-4 responses scored significantly higher across all evaluation criteria than physician responses.
Point-biserial correlation analysis revealed significant positive associations between preference for ChatGPT responses and differences in all evaluated criteria. Specifically, stronger alignment with medical community standards (r = 0.49, 95% CI 0.42–0.56, p < 0.001), lower ratings of incorrect/inappropriate information (r = 0.44, 95% CI 0.36–0.51, p < 0.001), lower potential harm ratings (r = 0.42, 95% CI 0.33–0.49, p < 0.001), higher inquiry comprehension (r = 0.39, 95% CI 0.31–0.47, p < 0.001), and higher empathy ratings (r = 0.37, 95% CI 0.28–0.45, p < 0.001) showed moderate positive correlations. Additionally, difference in word count exhibited a weak positive correlation with response preference (r = 0.14, 95% CI 0.05–0.23, p = 0.004) (Fig. 2).
Point-biserial correlations between evaluation criteria and response preference for ChatGPT-4 responses. Bars represent point-biserial correlation coefficients (r) indicating the strength of association between differences in evaluation criteria ratings (ChatGPT minus physician ratings) and user preference (coded as Physician = 0, ChatGPT = 1). Error bars represent the 95% confidence intervals. Higher correlation coefficients indicate stronger associations with preference for ChatGPT response. All correlations were statistically significant, with moderate correlations for all criteria except word count (weak).
Subsequently, we performed ordinal logistic regression analyses to identify factors influencing higher response quality ratings for verified physicians and ChatGPT-4. For verified physician responses, stronger alignment with medical consensus ratings were strongly associated with better overall quality (OR: 3.16, 95% CI 2.22–4.56, p < 0.001). Similarly, empathy (OR: 3.13, 95% CI 2.44–4.06, p < 0.001), lower potential for harm (OR: 1.65, 95% CI 1.21–2.27, p = 0.002), absence of incorrect or inappropriate information (OR: 1.48, 95% CI 1.06–2.06, p = 0.020), greater inquiry comprehension (OR: 1.43, 95% CI 1.11–1.85, p = 0.005), and longer word count (OR: 1.005, 95% CI 1.002–1.008, p = 0.001) were also significant predictors of higher quality. For ChatGPT-4 responses, higher empathy (OR: 3.75, 95% CI 2.80–5.09, p < 0.001), alignment with medical consensus (OR: 3.24, 95% CI 2.28–4.65, p < 0.001), lower potential for harm (OR: 2.30, 95% CI 1.63–3.27, p < 0.001), and absence of incorrect or inappropriate information (OR: 1.76, 95% CI 1.21–2.55, p = 0.003) were significantly associated with higher response quality (Fig. 3).
Ordinal logistic regression analysis of factors influencing response quality ratings for verified physicians and ChatGPT-4. Forest plots show odds ratios and 95% confidence intervals for each evaluation criterion predicting higher response quality ratings for verified physicians and ChatGPT-4. Odds ratios greater than 1 indicate an increased likelihood of receiving higher quality ratings. Significant predictors (p < 0.05) for both verified physicians and ChatGPT-4 responses included greater empathy, stronger alignment with medical consensus, lower potential harm, and fewer inaccuracies. Better inquiry comprehension and longer word count were significant predictors for physicians but not for ChatGPT-4. ^aP < .05; OR, odds ratio; CI, confidence interval.
A comparative analysis of physician responses and those generated by an LLM-driven chatbot in real-world otorhinolaryngological scenarios was conducted using six criteria: overall quality, empathy, adherence to medical consensus, information appropriateness, accuracy of comprehension, and harmful content. In the context of our forum-based study using data from a single online platform, experts consistently rated the AI chatbot responses higher in quality, with the chatbot significantly outperforming physicians across all criteria. Point-biserial correlation analysis revealed that evaluators’ preference for ChatGPT-4’s responses (over physicians’) was moderately positively associated with stronger alignment with medical consensus, fewer informational inaccuracies, lower potential harm, better inquiry comprehension, and greater empathy. Ordinal logistic regression identified significant predictors associated with higher response quality. For physician responses on the forum, significant predictors of higher quality ratings were greater empathy, stronger alignment with medical consensus, higher information accuracy, better inquiry comprehension, lower potential for harm, and longer responses. ChatGPT-4 responses were similarly associated with significantly higher quality ratings when demonstrating greater empathy, stronger alignment with medical consensus, fewer inaccuracies, and lower potential harm.
The expert panel preferred chatbot-generated responses, favoring them in over 70% of cases. This preference remained consistent in terms of perceived accuracy of information and appropriateness of responses. However, a systematic review has raised concerns regarding the role of ChatGPT in healthcare, including outdated knowledge (limited to 2021), misinformation, and overly detailed responses¹⁶. While LLMs have exhibited high accuracy in clinical decision-making, particularly when using repeated clinical information to enhance reasoning¹⁷, their potential for inaccurate or incomplete advice remains a critical limitation^16,18. Notably, ChatGPT-3 has outperformed physicians in diagnostic accuracy for common complaints¹⁹. Regardless of how rare these instances may be, healthcare cannot rely solely on AI tools, given their occasional inaccuracies. Consistent with our findings, Ayers et al. analyzed AI and physician responses in social media-based medical consultations and reported ChatGPT’s superiority in general medical inquiries⁵. In contrast, Bernstein et al. found no significant difference between AI- and ophthalmologist-generated responses regarding misinformation, potential harm, or adherence to medical standards⁴. These discrepancies suggest that AI’s applicability and evaluation criteria may vary by medical specialty, highlighting the need for further research on domain-specific AI performance. ChatGPT’s lack of transparency and unclear data sources pose significant challenges for personalized medicine, occasionally leading to surprisingly inaccurate medical decisions^20,21.
Nevertheless, AI’s increasing accuracy in real-world clinical settings has played a critical role in reducing harm. LLM-powered chatbots have significantly improved handling open-ended inquiries of varying difficulty, suggesting their growing applicability with ongoing advancements in AI models²¹. ChatGPT’s transformer-based architecture and training objective emphasize fluency and coherence, making it particularly effective at simplifying complex medical terminology and enhancing the accessibility of medical information for laypersons; however, this same design can also lead to oversimplified answers that omit clinically important nuance, especially when the model is handling inputs near the limit of its context window. Despite these limitations, our findings further support the reliability of LLM technology in delivering accurate medical advice.
Can AI-driven chatbots accurately interpret medical contexts? The frequent use of specialized language and nuanced terminology in medical settings present significant challenges for general chatbots. However, a meta-analysis has shown that advanced LLMs designed for healthcare can effectively understand medical terminology, enhancing patient interaction and care⁶. Our results suggest that these AI models are capable of interpreting medical contexts and bridging communication barriers. Despite limited research on aligning AI chatbot responses with medical consensus, ChatGPT-3.5 and 4 have shown effectiveness comparable to established online medical resources and adherence to clinical guidelines in specific applications^9,22. Additionally, patient complaints often involve a complex mix of physical, psychological, and social factors, making it challenging for physicians to understand them without misinterpretation fully. This study suggests that ChatGPT can effectively organize and interpret patient inquiries from multiple perspectives, demonstrating a robust understanding of medical contexts and eliminating the influence of human cognitive biases.
This study highlights that, within the specific context of the Reddit AskDocs forum, chatbot responses were perceived as more empathetic than those of physicians, even in complex medical scenarios, aligning with existing literature. Similarly, a study found that ChatGPT-generated responses on medical social platforms were rated higher for empathy than for human physicians⁵. A systematic review further supported using AI technologies to enhance empathy and relational behavior in healthcare²³. However, empathy and contextual understanding derived from direct human interaction remain irreplaceable by ChatGPT²⁴. This limitation could affect care quality, particularly for patients with suicidal tendencies or mental illnesses, where AI-assisted counseling is considered inappropriate²⁵. By contrast, ChatGPT has been shown to facilitate empathetic communication between healthcare professionals and patients, even in non-English-speaking regions²⁶. Online interactions also allow clinicians to support individuals lacking access to local healthcare²⁷. Although human physicians are constrained by time and cannot consistently maintain empathy and politeness, AI overcomes these limitations. LLM technology can assist physicians by reducing the time spent on communication, allowing greater focus on medical practice. Therefore, AI and human clinicians can complement each other to enhance empathy and provide significant benefits.
Compared with physicians’ replies on the AskDocs forum, ChatGPT’s responses were rated by evaluators as adhering more closely to medical consensus, being more factually accurate, and maintaining a consistently empathetic tone. Our point-biserial correlation analysis indicated that these attributes—as well as more comprehensive and safer content—were moderately positively associated with higher quality ratings, whereas word count showed only a weak positive association. Moreover, physician and AI chatbot responses showed that high empathy contributed significantly to higher quality ratings, underscoring its importance in meeting patient needs. Notably, empathy emerged as a particular strength of ChatGPT-4’s answers. This finding may partly explain why ChatGPT-4’s responses were rated more favorably compared to those of physicians, even after accounting for accuracy, alignment with consensus, and potential harm in our analysis. These findings provide actionable insight that communication features exemplified by ChatGPT-4—particularly its consistent expression of empathy—can be strategically incorporated into physicians’ written replies to enhance their perceived quality. The results also suggest that physicians could improve adherence to medical community consensus to provide higher-quality responses. Deep-learning techniques enable AI to extract pertinent information from patient interactions and generate longer²⁷, more detailed responses than physicians^4,5. While detailed responses correlate with higher quality, excessively long replies may overwhelm patients. For LLM-driven chatbots, ensuring accuracy and appropriateness is unquestionably crucial. Prior studies found that LLM-generated and physician responses exhibited comparable inaccuracy or potential harm⁴. Our examples demonstrate that AI chatbots do not consistently provide accurate or appropriate responses. However, our findings and previous studies⁵ show that healthcare professionals consistently preferred LLM-generated responses over physicians’ responses. However, the explainability of AI systems, essential for ensuring accuracy and appropriateness, remains a persistent challenge in fostering trust among healthcare professionals for critical decision-making²⁸.
Our results indicate that incorporating AI into online healthcare platforms may contribute to generating responses that are both more empathetic and more closely aligned with established medical guidelines—qualities sometimes lacking in physician responses or needing improvement. However, human oversight is essential to ensure these responses are accurate, appropriate, and aligned with medical standards. Recent findings have shown that as chatbots such as ChatGPT continuously learn from extensive datasets and refine their responses with updated medical knowledge, their potential for harm is perceived to be lower. However, based on our findings, ChatGPT cannot replace human physicians. To ensure accuracy and safety, healthcare professionals must verify AI-generated responses, as errors in medical information can have serious consequences. ChatGPT also cannot interpret body language and other nonverbal cues crucial in medical consultations. Additionally, it can produce “hallucinations”—plausible but incorrect information²⁹—and biased training data could result in biased outputs, perpetuating existing biases³⁰. Overreliance on ChatGPT can also reduce patient compliance and encourage self-diagnosis.
LLMs have the potential to revolutionize medical knowledge dissemination, providing efficient information retrieval in fast-paced clinical settings and enhancing healthcare decision-making. High-quality AI responses improve patient outcomes³¹, reduce unnecessary clinic visits, and make complex medical information more accessible³². Chatbots can enhance clinical workflows by assisting in triage, delivering preventive care information, and supporting chronic condition management, such as sinusitis or hearing loss. Therefore, a collaborative model, where experts review and correct AI-generated content or where ChatGPT refines a draft prepared by a physician, can be more effective. Future research should focus on refining chatbot capabilities to minimize risks and explore how they can be safely and effectively integrated into diverse healthcare settings, ensuring their reliable use in strengthening patient support and engagement.
While this study offers valuable insights, it is subject to several limitations. First, the data were sourced from a single English-language online forum, and physician responses were drawn exclusively from this platform. Therefore, our findings should be interpreted within the specific context of this online setting and not generalized to physician communication more broadly. Additionally, this study utilized an English-language medical consultation forum and does not account for medical consultations conducted in other languages or the distinctive characteristics of physician responses in different cultural contexts. Variations in medical communication arising from linguistic and cultural differences may influence the comparison between AI and human physicians. For instance, in languages such as Japanese, where subjects are often omitted, user inquiries tend to be more context-dependent, necessitating additional natural language processing for AI to generate appropriate responses. Second, the AI-generated responses were assessed in a controlled setting, which does not fully reflect their effectiveness in real-world clinical consultations. Potential bias may be due to ChatGPT-4’s tendency to repeat user questions and produce significantly longer responses than those of physicians. These features of ChatGPT-4 may be misinterpreted as empathetic communication. While excessive information sometimes hinders reader comprehension, the more extended responses generated by ChatGPT in this study were more likely to provide detailed and comprehensive information, which may have led evaluators to perceive them as higher in quality. Furthermore, the responses may have been influenced by training data that included physician input outside the dataset used in this study. The input prompt, which directed ChatGPT-4 to simulate an ENT specialist in a Reddit-like context, may have further biased its outputs by aligning them with the stylistic norms of online medical forums. Third, an additional limitation involves our detailed prompt engineering approach. General or vague prompts often produce substantial variability in LLM outputs, complicating consistent evaluation of accuracy and clinical utility for real-world patients³³. Without structured prompts, inherent variability in LLM responses limits reproducibility and the meaningful assessment of chatbot value in patient consultations. Therefore, we intentionally utilized detailed prompts to achieve consistent and reproducible results. Fourth, this study did not investigate the impact of variations in initial input prompts, such as alternative role instructions or task framing, on the AI’s responses. These critical limitations, such as differences in prompts, can lead to variability in content, tone, and perceived empathy. The inconsistency in ChatGPT’s reproducibility under different prompts poses a significant challenge to its practical application in clinical settings. Fifth, the evaluations conducted by an expert panel of physicians may have introduced bias. The subjective nature of empathy assessments might not accurately represent the experiences of typical patients or the range of patient interactions. Specifically, the evaluators’ expertise and prior expectations regarding AI-generated content could have unconsciously influenced their judgments. Sixth, our study relied solely on physician evaluators to assess empathy and thus excluded direct patient feedback, which may capture different aspects of empathetic communication. Finally, the study did not address the long-term outcomes or safety of AI-driven advice, leaving such recommendations’ clinical reliability and efficacy unresolved.
Our cross-sectional study, which examined physician–patient interactions on a single online forum, demonstrates the considerable potential of LLM-driven chatbots to deliver high-quality, tailored answers to complex medical inquiries within text-based online consultations. Within this specific setting, this study highlights ChatGPT’s strengths in aligning with prevailing medical consensus, providing accurate information, and consistently exhibiting empathy in its responses. Physicians can leverage these strengths to improve their healthcare consultations. By integrating AI insights, physicians may better address complex, nuanced inquiries, produce responses that align with medical consensus, and exhibit empathy. Moreover, AI-driven chatbots could offer immediate, accurate guidance to patients and caregivers, addressing concerns and guiding the following steps, making them valuable in clinical settings like triage and telemedicine. To fully realize this potential, future efforts should improve AI transparency and explainability and ensure rigorous monitoring of AI-generated responses to maintain accuracy, appropriateness, and adherence to medical standards. Incorporating patient feedback and conducting long-term follow-ups are essential to enhance the reliability and acceptance of AI in online, text-based medical consultations.
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.
Artificial intelligence
Confidence interval
Ear, nose, and throat
Interquartile range
Large language model
Odds ratio
Strengthening the reporting of observational studies in epidemiology
Haleem, A., Javaid, M., Singh, R. P. & Suman, R. Telemedicine for healthcare: Capabilities, features, barriers, and applications. Sensors Int. 2, 100117. https://doi.org/10.1016/j.sintl.2021.100117 (2021).
Article Google Scholar
Bhaskar, S. et al. Designing futuristic telemedicine using artificial intelligence and robotics in the COVID-19 era. Front. Public Health 8, 556789. https://doi.org/10.3389/fpubh.2020.556789 (2020).
Article PubMed PubMed Central Google Scholar
OpenAI. ChatGPT: optimizing language models for dialogue. OpenAI. Accessed 6 Feb 2023. https://openai.com/blog/chatgpt (2023).
Bernstein, I. A. et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw. Open 6, e30320. https://doi.org/10.1001/jamanetworkopen.2023.30320 (2023).
Article Google Scholar
Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596. https://doi.org/10.1001/jamainternmed.2023.1838 (2023).
Article PubMed PubMed Central Google Scholar
Hussain, A. Y. et al. A systematic review and meta-analysis of artificial intelligence tools in medicine and healthcare: Applications, considerations, limitations, motivation, and challenges. Diagnostics 14, 109. https://doi.org/10.3390/diagnostics14010109 (2024).
Article Google Scholar
Gual-Montolio, P., Jaén, I., Martínez-Borba, V., Castilla, D. & Suso-Ribera, C. Using artificial intelligence to enhance ongoing psychological interventions for emotional problems in real- or close to real-time: A systematic review. Int. J. Environ. Res. Public Health 19, 7737. https://doi.org/10.3390/ijerph19137737 (2022).
Article PubMed PubMed Central Google Scholar
Zhou, S., Zhao, J. & Zhang, L. Application of artificial intelligence on psychological interventions and diagnosis: An overview. Front. Psychiatry 13, 811665. https://doi.org/10.3389/fpsyt.2022.811665 (2022).
Article PubMed PubMed Central Google Scholar
Nastasi, A. J., Courtright, K. R., Halpern, S. D. & Weissman, G. E. A vignette-based evaluation of ChatGPT’s ability to provide appropriate and equitable medical advice across care contexts. Sci. Rep. 13, 17885. https://doi.org/10.1038/s41598-023-45223-y (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Ueda, D. et al. Fairness of artificial intelligence in healthcare: review and recommendations. Jpn. J. Radiol. 42, 3–15. https://doi.org/10.1007/s11604-023-01474-3 (2024).
Article PubMed Google Scholar
Holzinger, A., Keiblinger, K., Holub, P., Zatloukal, K. & Müller, H. AI for life: Trends in artificial intelligence for biotechnology. New Biotechnol. 74, 16–24. https://doi.org/10.1016/j.nbt.2023.02.001 (2023).
Article CAS Google Scholar
Brown, T. et al. Language models are few-shot learners. arXiv Preprint at https://arxiv.org/abs/2005.14165 (2020) (Accessed 13 Dec 2024).
Ask Docs. Reddit. Accessed Dec 2023. https://reddit.com/r/AskDocs/.
Nobles, A. L., Leas, E. C., Dredze, M. & Ayers, J.W. Examining peer-to-peer and patient-provider interactions on a social media community facilitating ask the doctor services. In Proc. Int. AAAI Conf. Weblogs Soc. Media vol. 14, 464–475 https://doi.org/10.1609/icwsm.v14i1.7315 (2020).
Singhal, K. et al. Large language models encode clinical knowledge. arXiv at https://arxiv.org/abs/2212.13138 (2022). 10.48550/arXiv:2212.13138.
Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare (Basel) 11, 887 (2023). https://doi.org/10.3390/healthcare11060887.
Rao, A. et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: Development and usability study. J. Med. Internet Res. 25, (2023). https://doi.org/10.2196/48659.
Blease, C., Locher, C., Leon-Carlyle, M. & Doraiswamy, M. Artificial intelligence and the future of psychiatry: Qualitative findings from a global physician survey. Digit. Health 6, 2055207620968355. https://doi.org/10.1177/2055207620968355 (2020).
Article CAS PubMed PubMed Central Google Scholar
Hirosawa, T. et al. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: A pilot study. Int. J. Environ. Res. Public Health 20, 3378. https://doi.org/10.3390/ijerph20043378 (2023).
Article PubMed PubMed Central Google Scholar
Rao, A., Kim, J., Kamineni, M., Pang, M., Lie, W. & Succi, M. D. Evaluating ChatGPT as an adjunct for radiologic decision-making. medRxiv Preprint at https://doi.org/10.1101/2023.02.02.23285399 (2023). https://doi.org/10.1101/2023.02.02.23285399.
Goodman, R. S. et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw. Open 6, e36483. https://doi.org/10.1001/jamanetworkopen.2023.36483 (2023).
Article Google Scholar
Walker, H. L. et al. Reliability of medical information provided by ChatGPT: Assessment against clinical guidelines and patient information quality instrument. J. Med. Internet Res. 25, (2023). https://doi.org/10.2196/47479.
Morrow, E. et al. Artificial intelligence technologies and compassion in healthcare: A systematic scoping review. Front. Psychol. 13, 971044. https://doi.org/10.3389/fpsyg.2022.971044 (2023).
Article PubMed PubMed Central Google Scholar
Liu, J. ChatGPT: Perspectives from human-computer interaction and psychology. Front. Artif. Intell. 7, 1418869. https://doi.org/10.3389/frai.2024.1418869 (2024).
Article PubMed PubMed Central Google Scholar
Fonseka, T. M., Bhat, V. & Kennedy, S. H. The utility of artificial intelligence in suicide risk prediction and the management of suicidal behaviors. Aust. N. Z. J. Psychiatry 53, 954–964. https://doi.org/10.1177/0004867419864428 (2019).
Article PubMed Google Scholar
Zhu, Z., Ying, Y., Zhu, J. & Wu, H. ChatGPT’s potential role in non-English-speaking outpatient clinic settings. Digit. Health 9, 20552076231184092. https://doi.org/10.1177/20552076231184091 (2023).
Article PubMed PubMed Central Google Scholar
Qian, H. et al. Pre-consultation system based on artificial intelligence has a better diagnostic performance than the physicians in the outpatient department of pediatrics. Front. Med. (Lausanne) 8, 695185 (2021). https://doi.org/10.3389/fmed.2021.695185.
Tucci, V., Saary, J. & Doyle, T. E. Factors influencing trust in medical artificial intelligence for healthcare professionals: A narrative review. J. Med. Artif. Intell. 5, 4. https://doi.org/10.21037/jmai-21-25 (2022).
Article Google Scholar
Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307, 2. https://doi.org/10.1148/radiol.230163 (2023).
Article Google Scholar
Wang, C. et al. Ethical considerations of using ChatGPT in health care. J. Med. Internet Res. 25, e48009. https://doi.org/10.2196/37566454 (2023).
Article PubMed PubMed Central ADS Google Scholar
Rotenstein, L. S. et al. Association between electronic health record time and quality of care metrics in primary care. JAMA Netw. Open 5, e37086. https://doi.org/10.1001/jamanetworkopen.2022.37086 (2022).
Article Google Scholar
Rasu, R. S., Bawa, W. A., Suminski, R., Snella, K. & Warady, B. Health literacy impact on national healthcare utilization and expenditure. Int. J. Health Policy Manag. 4, 747–755. https://doi.org/10.15171/ijhpm.2015.151 (2015).
Article PubMed PubMed Central Google Scholar
Wang, L. et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med. 7(1), 41. https://doi.org/10.1038/s41746-024-01029-4 (2024).
Article PubMed PubMed Central Google Scholar
Download references
Department of Otolaryngology—Head and Neck Surgery, Gunma University Graduate School of Medicine, 3-39-15 Showamachi, Maebashi, Gunma, 371-8511, Japan
Masaomi Motegi, Masato Shino, Mikio Kuwabara, Hideyuki Takahashi, Toshiyuki Matsuyama, Hiroe Tada, Hiroyuki Hagiwara & Kazuaki Chikamatsu
Department of Otolaryngology, Maebashi Red Cross Hospital, 389-1 Asakuramachi, Maebashi, Gunma, 371-0811, Japan
Masato Shino
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
M.M. conceived and designed the study. M.M., M.S., and M.K. were responsible for data collection and preparation. H.T., T.M., H.H., and K.C. contributed to the data analysis and interpretation. M.M. and K.C. drafted the manuscript, and all authors critically reviewed and revised it for important intellectual content. All authors approved the final version of the manuscript and agree to be accountable for all aspects of the work, ensuring its accuracy and integrity.
Correspondence to Masaomi Motegi.
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Motegi, M., Shino, M., Kuwabara, M. et al. Comparison of physician and large language model chatbot responses to online ear, nose, and throat inquiries. Sci Rep 15, 21346 (2025). https://doi.org/10.1038/s41598-025-06769-1
Download citation
Received: 02 January 2025
Accepted: 10 June 2025
Published: 01 July 2025
Version of record: 01 July 2025
DOI: https://doi.org/10.1038/s41598-025-06769-1
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Scientific Data (2025)
Advertisement
Scientific Reports (Sci Rep)
ISSN 2045-2322 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

source

Comparison of physician and large language model chatbot responses to online ear, nose, and throat inquiries – Nature

Leave a Comment Cancel Reply