Obstet Gynecol Sci Search

CLOSE


Obstet Gynecol Sci > Volume 67(6); 2024 > Article
Eoh, Kwon, Lee, Lee, Lee, Kim, and Nam: Efficacy of large language models and their potential in Obstetrics and Gynecology education

Abstract

Objective

The performance of large language models (LLMs) and their potential utility in obstetric and gynecological education are topics of ongoing debate. This study aimed to contribute to this discussion by examining the recent advancements in LLM technology and their transformative potential in artificial intelligence.

Methods

This study assessed the performance of generative pre-trained transformer (GPT)-3.5 and −4 in understanding clinical information, as well as its potential implications for obstetric and gynecological education. Obstetrics and gynecology residents at three hospitals underwent an annual promotional examination, from which 116 of the 170 questions over 4 years (2020-2023) were analyzed, excluding 54 questions with images. The scores achieved by GPT-3.5, −4, and the 100 residents were compared.

Results

The average scores across all 4 years for GPT-3.5 and −4 were 38.79 (standard deviation [SD], 5.65) and 79.31 (SD, 3.67), respectively. For groups first-year resident, second-year resident, and third-year resident, the cumulative annual average scores were 79.12 (SD, 9.00), 80.95 (SD, 5.86), and 83.60 (SD, 6.82), respectively. No statistically significant differences were observed between the scores of GPT-4.0 and those of the residents. When analyzing questions specific to obstetrics, the average scores for GPT-3.5 and −4.0 were 33.44 (SD, 10.18) and 90.22 (SD, 7.68), respectively.

Conclusion

GPT-4 demonstrated exceptional performance in obstetrics, different types of data interpretation, and problem solving, showcasing the potential utility of LLMs in these areas. However, acknowledging the constraints of LLMs is crucial and their utilization should augment human expertise and discernment.

Introduction

Recent breakthroughs in large language model (LLM) technology has significantly transformed the landscape of artificial intelligence (AI) [1]. Among current LLMs, OpenAI’s Chat generative pre-trained transformer (GPT), which launched in November 2022, has emerged as a noteworthy innovation [1]. ChatGPT has effectively analyzed and applied knowledge in specialized areas such as medicine, law, and business management, which are typically reserved for subject matter experts. Remarkably, this system has attained substantial accuracy, passing challenging assessments such as the United States Medical Licensing Examination, bar, Wharton Master of Business Administration final, and other medical examinations, accomplishing this feat with its pre-existing training alone, without any additional fine-tuning [2-7]
Becoming proficient in obstetrics and gynecology (Ob/Gyn) is a long journey, which integrates theoretical study, experiential learning, and closely monitored clinical practice [8]. Throughout their training, Ob/Gyn residents collaborate with senior practitioners to acquire hands-on experience in patient management, surgical procedures, and clinical judgment. They also participate in a wide range of didactic sessions and seminars that cover the core concepts of Ob/Gyn, as well as medical and surgical skills and methods. This extensive training means that mastering this field often takes at least 10 years [8]. Therefore, it is crucial to evaluate the potential role of emerging technologies such as AI and LLMs in enhancing the educational process [9,10].
This study aimed to utilize ChatGPT to analyze Korean Ob/Gyn examinations and determine whether LLMs exhibit expert-level understanding. We also compared the capabilities of GPT-3.5 with those of GPT-4.

Materials and methods

1. Ob/Gyn examination for residents

The goal of an Ob/Gyn residency is to cultivate an individual’s abilities to thoroughly assess pathological conditions related to Ob/Gyn illnesses and master surgical techniques for treating obstetric, neoplastic, and infectious diseases. In South Korea, residents must pass a board examination that objectively measures their knowledge and skills to be qualified as a certified Ob/Gyn physician. As part of their preparation for this examination and to gauge their proficiency, residents at three hospitals (Severance Hospital, Gangnam Severance Hospital, and Yongin Severance Hospital) affiliated with Yonsei University College of Medicine are required to participate in annual advancement tests. These evaluations cover all aspects of Ob/Gyn practice and feature various types of questions, including fact recall (R-type), data interpretation (I-type), and problem solving (P-type).

2. Dataset for model testing

The annual advancement examination questions were curated by professors from the Department of Obstetrics and Gynecology at three hospitals affiliated with Yonsei University College of Medicine; however, given the limitations of LLMs in processing visual data such as clinical imagery, diagnostic imaging, and graphs, questions containing visual elements were omitted from the dataset. Each question was manually entered in Korean. Our final dataset included 116 questions from the initial phase of the board examinations administered from 2020-2023 (Fig. 1).

3. LLM and performance evaluation

This study focused on determining the efficacy of OpenAI’s ChatGPT language models in answering a set of questions. The GPT-3.5 and −4 models were evaluated on July 3, 2023 and July 15, 2023, respectively. To assess the efficacy of each model, we manually entered the questions into the ChatGPT website and juxtaposed each model’s responses with those of the Ob/Gyn residents (Fig. 1).

4. Statistical analysis

This study compared the performance of ChatGPT models GPT-3.5 and −4 using the Student’s t- and chi-squared tests. A P-value <0.05 indicated a statistically significant difference in performance between GPT-3.5 and −4.

5. Ethical approval

This study did not involve human subjects; therefore, no institutional review board approval was required.

Results

The comparative analysis of GPT-3.5 and −4, as shown in Table 1, suggested a significant difference in their performance. GPT-4 demonstrated a markedly higher accuracy rate (79.3%), correctly answering 92 of the 116 questions. In contrast, GPT-3.5 exhibited lower accuracy (38.8%), correctly answering only 45 of the 116 questions.
Table 2 is a 2×2 contingency table summarizing the comparative performance of GPT-3.5 and −4. An overlap in the correct answers between the two models was observed in 42 instances (36.2%), while GPT-3.5 correctly answered only three questions (2.6%) that were incorrectly answered by GPT-4. The majority of GPT-3.5’s answers (50 [43.1%]) were incorrect, whereas the majority of GPT-4’s answers were correct. There were 21 questions (18.1%) that both models answered incorrectly. Table 2 shows the superior performance of GPT-4, which achieved a higher overall rate of correct answers (79.3%).
As shown in Fig. 2, the performance of GPT-4 was comparable to that of the first-, second-, and third-years residents, with no statistically significant difference. Furthermore, when the questions were categorized as obstetrics or gynecology, the performance of GPT-4 was particularly notable for obstetrics, with significantly good performance observed in the field (P=0.015) (Fig. 3).
In the comparative analysis of question types, both GPT-3.5 and −4 demonstrated superior performance among I- and P-type questions compared to R-type. Specifically, GPT-4 achieved a remarkable 100.0% success rate with I-type questions and 79.2% with P-type questions, whereas its performance for R-type questions was lower, at 33.3% (Table 3).

Discussion

The main goal of this study was to quantitatively evaluate the capability of ChatGPT to understand intricate clinical data and examine the possible impacts of LLM technology on Ob/Gyn education and training. We assessed the efficacy of ChatGPT using questions from the Ob/Gyn examination and found that the GPT-4 model achieved an accuracy rate of 79.4%. Of note, this level of accuracy was reached without any specialized fine-tuning of the model and solely using prompts in Korean, underscoring the importance of our results.
The analysis showed the significantly superior performance of GPT-4 across various subspecialties and question types compared to that of GPT-3.5, with accuracy rates ranging from 76.7% to 85.7%. However, there were three questions (2.6%) which GPT-3.5 answered correctly but GPT-4 did not (Table 2). Despite its overall higher accuracy, the reasons for GPT-4’s incorrect responses in these cases remain unclear. Identifying the precise cause of this difference is complex, with factors such as variations in the training data, model architecture, or other elements that may have influenced the disparate results between the two models.
GPT-4 demonstrated a more robust ability to interpret data and solve problems than to recall facts. These skills, which are crucial for the efficacy of sophisticated AI systems, were evident in GPT-4’s contextual comprehension and inferential methodology, and are essential for identifying and analyzing intricate data patterns. The capability of the LLM to synthesize and utilize various information for inventive problem solving demonstrated its near-human level of reasoning [11]. However, there are shortcomings to this model, particularly in terms of recall. GPT-4’s knowledge was limited to the data available at the time of its last training update, which was April 2023. Additionally, the model cannot update information in real-time, which affects its performance in instances where access to the most recent data is vital [12].
Furthermore, GPT-4 was more proficient in answering obstetric than gynecological questions, although the reason for this difference is not entirely understood. As GPT-4’s knowledge was capped at the date of its most recent training update and the possibility that the field of obstetrics experienced relatively fewer updates than gynecology in the period between the update and examination, the model’s knowledge base may have been a factor in this difference. Thus, a more comprehensive analysis with additional problems is needed to gain a deeper understanding of this phenomenon.
The authors strongly suggest that the Ob/Gyn community proactively embraces these technological advances to improve patient safety and enhance quality of care. It is vital to shift Ob/Gyn education from conventional memorization-based learning to a strategy of defining problems in specific clinical scenarios and gathering the data necessary for problem solving [13]. As generative AI models, LLMs offer solutions to specific problems, with the answer quality depending on the questions posed. For Ob/Gyn physicians, comprehensive history-taking and physical examinations are essential for the accurate identification of patient issues. Supplying LLMs with detailed accounts of a patients’ primary complaint(s), current illness(es), and physical examination results can aid in making decisions regarding diagnostic testing and treatment plans given a particular clinical context. Nonetheless, practitioners must remember that LLMs are not substitutes for the fundamental elements of patient care such as forging strong patient relationships and listening attentively to their concerns [14].
Ob/Gyn physicians who completed their training more than 10 years ago may find LLMs beneficial for continuing medical education (CME) [15,16]. When a significant amount of time has passed since one’s initial training, it can be a challenge to keep up with new developments, potentially leading to the use of outdated treatment methods. Although many Ob/Gyn societies offer dedicated CME programs, changing established clinical protocols can be complicated. Utilizing a current LLM as an auxiliary tool for decision making may offer these physicians an extra avenue with which to keep abreast of the latest information and pursue evidence-based treatment in their patient care [17].
In healthcare, decision making critically influences patient safety, and requires greater precision and a more cautious approach to procedural modifications than in other fields. Although GPT-4 achieved 79.3% accuracy on the Ob/Gyn examination, it is essential to recognize that LLMs are generative models, sometimes labeled as “stochastic parrots” [18]. Rather than delivering strictly accurate information, these models instead provide answers based on the likelihood of the most fitting words from their training data. Therefore, the existing level of accuracy level does not meet the standards for direct use in patient care.
ChatGPT is just one LLM; other versions, launched less than a year ago, also demonstrate the aforementioned exceptional capabilities. Microsoft recently unveiled BioGPT, an LLM specializing in PubMed literature, while meta AI has rolled out Llama, which features a user-friendly application programming interface for broader innovation and customization [19,20]. These innovations indicate that we can expect future LLMs to undergo training with a broader and more varied range of medical data, offering specialized expertise in healthcare. Moreover, the GPT-4 framework is capable of processing and interpreting visual content such as images and videos, which suggests its potential performance in the future using datasets that include clinical images and surgical footage. Such advancements would increase the relevance of GPT-4 in Ob/Gyn subspecialties, expanding its function beyond text-based applications and providing more thorough insights into intricate clinical situations, thereby aiding healthcare professionals in decision making and subsequently enhancing patient care.
This study’s limitations primarily stem from the exclusion of visual data, such as clinical images, radiological scans, and graphical representations, which are essential components of medical education and practice, especially in a field like Ob/Gyn. The LLMs evaluated in this study, including GPT-3.5 and GPT-4, were primarily designed for text-based comprehension and were not equipped to process or interpret visual information. As a result, questions involving visual elements, which are often critical for accurate diagnosis and treatment decisions in clinical practice, were intentionally omitted from our dataset. Consequently, it remains uncertain whether GPT-4, despite its superior performance on text-based questions, would maintain its edge over human residents when faced with cases that require the interpretation of visual data, such as ultrasound images, fetal heart rate tracings, or histopathological slides.
Nevertheless, this study provides significant insights into the evolving role of LLMs in clinical education by demonstrating their ability to process complex clinical information and make relevant clinical decisions based solely on textual data. While the current study did not address the integration of multimodal data, it serves as an important step toward understanding the potential of LLMs in enhancing medical education, particularly in the Ob/Gyn domain. Future research that incorporates visual data will be necessary to fully assess the capabilities and limitations of these models in real-world clinical scenarios where visual interpretation is crucial.
In conclusion, GPT-4 exhibited impressive performance in processing intricate clinical data, achieving 79.3% accuracy on the Ob/Gyn examination. However, it is imperative to acknowledge the constraints of LLMs and that their utilization should augment, not replace, human expertise and discernment.

Notes

Conflicts of interest

The other authors also have no conflicts of interest.

Ethical approval

Not applicable.

Patient consent

Not applicable.

Funding information

None.

Fig. 1
Dataset preparation for model evaluation. GPT, generative pre-trained transformer.
ogs-24211f1.jpg
Fig. 2
Comparison of the performance of GPT-3.5, −4, and obstetrics and gynecology residents. GPT, generative pre-trained transformer; R, resident.
ogs-24211f2.jpg
Fig. 3
Comparison of the performance of GPT-4 with overall accuracies according to its subspecialties. GPT, generative pre-trained transformer; SD, standard deviation.
ogs-24211f3.jpg
Table 1
Comparison table for the accuracy of GPT-3.5 and −4
Variable GPT-3.5 GPT-4
Correct answer 45 92
Incorrect answer 71 24
Accuracy 45/116 (38.8) 92/116 (79.3)

Values are presented as number (%).

GPT, generative pre-trained transformer.

Table 2
2×2 contingency table summarizing the performance of GPT-3.5 and −4
Variable GPT-4 correct answer GPT-4 incorrect answer Total
GPT-3.5 correct answer 42 (36.2) 3 (2.6) 45 (38.8)
GPT-3.5 incorrect answer 50 (43.1) 21 (18.1) 71 (61.2)
Total 92 (79.3) 24 (20.7) 116 (100.0)

Values are presented as number (%).

GPT, generative pre-trained transformer.

Table 3
Comparison of the performance of GPT-4 with overall accuracies according to type of question
Variable R type (n=6) I type (n=14) P type (n=96) P-value
GPT-4 2 (33.3) 14 (100.0) 76 (79.2) <0.001
GPT-3.5 1 (16.7) 8 (57.1) 36 (37.5) <0.001

Values are presented as number (%).

GPT, generative pre-trained transformer; R-type, recall of facts; I-type, interpretation of data; P-type, problem solving.

References

1. Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science. Nature 2023;614:214-6.
crossref pmid pdf
2. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198.
crossref pmid pmc
3. Mbakwe AB, Lourentzou I, Celi LA, Mechanic OJ, Dagan A. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLOS Digit Health 2023;2:e0000205.
crossref pmid pmc
4. Oh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res 2023;104:269-73.
crossref pmid pmc pdf
5. Yaneva V, Baldwin P, Jurich DP, Swygert K, Clauser BE. Examining ChatGPT performance on USMLE sample items and implications for assessment. Acad Med 2024;99:192-7.
crossref pmid pmc
6. Watari T, Takagi S, Sakaguchi K, Nishizaki Y, Shimizu T, Yamamoto Y, et al. Performance comparison of Chat-GPT-4 and Japanese medical residents in the general medicine in-training examination: comparison study. JMIR Med Educ 2023;9:e52202.
crossref pmid pmc
7. Ali R, Tang OY, Connolly ID, Zadnik Sullivan PL, Shin JH, Fridley JS, et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. Neurosurgery 2023;93:1353-65.
pmid
8. Chen KT, Baecher-Lind L, Morosky CM, Bhargava R, Fleming A, Royce CS, et al. Current practices and perspectives on clerkship grading in obstetrics and gynecology. Am J Obstet Gynecol 2024;230:97e1-6.
crossref
9. Wartman SA, Combs CD. Medical education must move from the information age to the age of artificial intelligence. Acad Med 2018;93:1107-9.
crossref pmid
10. Ahn KH, Lee KS. Artificial intelligence in obstetrics. Obstet Gynecol Sci 2022;65:113-24.
crossref pmid pmc pdf
11. Ong H, Ong J, Cheng R, Wang C, Lin M, Ong D. GPT technology to help address longstanding barriers to care in free medical clinics. Ann Biomed Eng 2023;51:1906-9.
crossref pmid pdf
12. Bhattarai K, Oh IY, Sierra JM, Tang J, Payne PRO, Abrams ZB, et al. Leveraging GPT-4 for identifying cancer phenotypes in electronic health records: a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5 and spaCy’s rule-based & machine learning-based methods. JAMIA Open 2024;7:ooae060.
crossref pmid pmc
13. Phung A, Daniels G, Curran M, Robinson S, Maiz A, Jaqua B. Multispecialty trainee perspective: the journey toward competency-based graduate medical education in the United States. J Grad Med Educ 2023;15:617-22.
crossref pmid pmc pdf
14. Kapadia MR, Kieran K. Being affable, available, and able is not enough: prioritizing surgeon-patient communication. JAMA Surg 2020;155:277-8.
crossref pmid
15. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ 2023;9:e48291.
crossref pmid pmc
16. Jamal A, Solaiman M, Alhasan K, Temsah MH, Sayed G. Integrating ChatGPT in medical education: adapting curricula to cultivate competent physicians for the AI Era. Cureus 2023;15:e43036.
crossref pmid pmc
17. Han ER, Yeo S, Kim MJ, Lee YH, Park KH, Roh H. Medical education trends for future physicians in the era of advanced technology and artificial intelligence: an integrative review. BMC Med Educ 2019;19:460.
crossref pmid pmc pdf
18. Sharma A, Kumar R, Vinjamuri S. Artificial intelligence chatbots: addressing the stochastic parrots in medical science. Nucl Med Commun 2023;44:831-3.
crossref pmid
19. Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform 2022;23:bbac409.
crossref pmid pdf
20. Zagirova D, Pushkov S, Leung GHD, Liu BHM, Urban A, Sidorenko D, et al. Biomedical generative pre-trained based transformer language model for age-related disease target discovery. Aging (Albany NY) 2023;15:9293-309.
crossref pmid pmc


ABOUT
ARTICLE & TOPICS
Article category

Browse all articles >

Topics

Browse all articles >

BROWSE ARTICLES
POLICY
FOR CONTRIBUTORS
Editorial Office
4th Floor, 36 Gangnam-daero 132-gil, Gangnam-gu, Seoul 06044, Korea.
Tel: +82-2-2266-7238    Fax: +82-2-3445-2440    E-mail: journal@ogscience.org                

Copyright © 2024 by Korean Society of Obstetrics and Gynecology.

Developed in M2PI

Close layer
prev next