Efficacy of large language models and their potential in Obstetrics and Gynecology education
Article information
Abstract
Objective
The performance of large language models (LLMs) and their potential utility in obstetric and gynecological education are topics of ongoing debate. This study aimed to contribute to this discussion by examining the recent advancements in LLM technology and their transformative potential in artificial intelligence.
Methods
This study assessed the performance of generative pre-trained transformer (GPT)-3.5 and −4 in understanding clinical information, as well as its potential implications for obstetric and gynecological education. Obstetrics and gynecology residents at three hospitals underwent an annual promotional examination, from which 116 of the 170 questions over 4 years (2020–2023) were analyzed, excluding 54 questions with images. The scores achieved by GPT-3.5, −4, and the 100 residents were compared.
Results
The average scores across all 4 years for GPT-3.5 and −4 were 38.79 (standard deviation [SD], 5.65) and 79.31 (SD, 3.67), respectively. For groups first-year resident, second-year resident, and third-year resident, the cumulative annual average scores were 79.12 (SD, 9.00), 80.95 (SD, 5.86), and 83.60 (SD, 6.82), respectively. No statistically significant differences were observed between the scores of GPT-4.0 and those of the residents. When analyzing questions specific to obstetrics, the average scores for GPT-3.5 and −4.0 were 33.44 (SD, 10.18) and 90.22 (SD, 7.68), respectively.
Conclusion
GPT-4 demonstrated exceptional performance in obstetrics, different types of data interpretation, and problem solving, showcasing the potential utility of LLMs in these areas. However, acknowledging the constraints of LLMs is crucial and their utilization should augment human expertise and discernment.
Introduction
Recent breakthroughs in large language model (LLM) technology has significantly transformed the landscape of artificial intelligence (AI) [1]. Among current LLMs, OpenAI’s Chat generative pre-trained transformer (GPT), which launched in November 2022, has emerged as a noteworthy innovation [1]. ChatGPT has effectively analyzed and applied knowledge in specialized areas such as medicine, law, and business management, which are typically reserved for subject matter experts. Remarkably, this system has attained substantial accuracy, passing challenging assessments such as the United States Medical Licensing Examination, bar, Wharton Master of Business Administration final, and other medical examinations, accomplishing this feat with its pre-existing training alone, without any additional fine-tuning [2–7]
Becoming proficient in obstetrics and gynecology (Ob/Gyn) is a long journey, which integrates theoretical study, experiential learning, and closely monitored clinical practice [8]. Throughout their training, Ob/Gyn residents collaborate with senior practitioners to acquire hands-on experience in patient management, surgical procedures, and clinical judgment. They also participate in a wide range of didactic sessions and seminars that cover the core concepts of Ob/Gyn, as well as medical and surgical skills and methods. This extensive training means that mastering this field often takes at least 10 years [8]. Therefore, it is crucial to evaluate the potential role of emerging technologies such as AI and LLMs in enhancing the educational process [9,10].
This study aimed to utilize ChatGPT to analyze Korean Ob/Gyn examinations and determine whether LLMs exhibit expert-level understanding. We also compared the capabilities of GPT-3.5 with those of GPT-4.
Materials and methods
1. Ob/Gyn examination for residents
The goal of an Ob/Gyn residency is to cultivate an individual’s abilities to thoroughly assess pathological conditions related to Ob/Gyn illnesses and master surgical techniques for treating obstetric, neoplastic, and infectious diseases. In South Korea, residents must pass a board examination that objectively measures their knowledge and skills to be qualified as a certified Ob/Gyn physician. As part of their preparation for this examination and to gauge their proficiency, residents at three hospitals (Severance Hospital, Gangnam Severance Hospital, and Yongin Severance Hospital) affiliated with Yonsei University College of Medicine are required to participate in annual advancement tests. These evaluations cover all aspects of Ob/Gyn practice and feature various types of questions, including fact recall (R-type), data interpretation (I-type), and problem solving (P-type).
2. Dataset for model testing
The annual advancement examination questions were curated by professors from the Department of Obstetrics and Gynecology at three hospitals affiliated with Yonsei University College of Medicine; however, given the limitations of LLMs in processing visual data such as clinical imagery, diagnostic imaging, and graphs, questions containing visual elements were omitted from the dataset. Each question was manually entered in Korean. Our final dataset included 116 questions from the initial phase of the board examinations administered from 2020–2023 (Fig. 1).
3. LLM and performance evaluation
This study focused on determining the efficacy of OpenAI’s ChatGPT language models in answering a set of questions. The GPT-3.5 and −4 models were evaluated on July 3, 2023 and July 15, 2023, respectively. To assess the efficacy of each model, we manually entered the questions into the ChatGPT website and juxtaposed each model’s responses with those of the Ob/Gyn residents (Fig. 1).
4. Statistical analysis
This study compared the performance of ChatGPT models GPT-3.5 and −4 using the Student’s t- and chi-squared tests. A P-value <0.05 indicated a statistically significant difference in performance between GPT-3.5 and −4.
5. Ethical approval
This study did not involve human subjects; therefore, no institutional review board approval was required.
Results
The comparative analysis of GPT-3.5 and −4, as shown in Table 1, suggested a significant difference in their performance. GPT-4 demonstrated a markedly higher accuracy rate (79.3%), correctly answering 92 of the 116 questions. In contrast, GPT-3.5 exhibited lower accuracy (38.8%), correctly answering only 45 of the 116 questions.
Table 2 is a 2×2 contingency table summarizing the comparative performance of GPT-3.5 and −4. An overlap in the correct answers between the two models was observed in 42 instances (36.2%), while GPT-3.5 correctly answered only three questions (2.6%) that were incorrectly answered by GPT-4. The majority of GPT-3.5’s answers (50 [43.1%]) were incorrect, whereas the majority of GPT-4’s answers were correct. There were 21 questions (18.1%) that both models answered incorrectly. Table 2 shows the superior performance of GPT-4, which achieved a higher overall rate of correct answers (79.3%).
As shown in Fig. 2, the performance of GPT-4 was comparable to that of the first-, second-, and third-years residents, with no statistically significant difference. Furthermore, when the questions were categorized as obstetrics or gynecology, the performance of GPT-4 was particularly notable for obstetrics, with significantly good performance observed in the field (P=0.015) (Fig. 3).
In the comparative analysis of question types, both GPT-3.5 and −4 demonstrated superior performance among I- and P-type questions compared to R-type. Specifically, GPT-4 achieved a remarkable 100.0% success rate with I-type questions and 79.2% with P-type questions, whereas its performance for R-type questions was lower, at 33.3% (Table 3).
Discussion
The main goal of this study was to quantitatively evaluate the capability of ChatGPT to understand intricate clinical data and examine the possible impacts of LLM technology on Ob/Gyn education and training. We assessed the efficacy of ChatGPT using questions from the Ob/Gyn examination and found that the GPT-4 model achieved an accuracy rate of 79.4%. Of note, this level of accuracy was reached without any specialized fine-tuning of the model and solely using prompts in Korean, underscoring the importance of our results.
The analysis showed the significantly superior performance of GPT-4 across various subspecialties and question types compared to that of GPT-3.5, with accuracy rates ranging from 76.7% to 85.7%. However, there were three questions (2.6%) which GPT-3.5 answered correctly but GPT-4 did not (Table 2). Despite its overall higher accuracy, the reasons for GPT-4’s incorrect responses in these cases remain unclear. Identifying the precise cause of this difference is complex, with factors such as variations in the training data, model architecture, or other elements that may have influenced the disparate results between the two models.
GPT-4 demonstrated a more robust ability to interpret data and solve problems than to recall facts. These skills, which are crucial for the efficacy of sophisticated AI systems, were evident in GPT-4’s contextual comprehension and inferential methodology, and are essential for identifying and analyzing intricate data patterns. The capability of the LLM to synthesize and utilize various information for inventive problem solving demonstrated its near-human level of reasoning [11]. However, there are shortcomings to this model, particularly in terms of recall. GPT-4’s knowledge was limited to the data available at the time of its last training update, which was April 2023. Additionally, the model cannot update information in real-time, which affects its performance in instances where access to the most recent data is vital [12].
Furthermore, GPT-4 was more proficient in answering obstetric than gynecological questions, although the reason for this difference is not entirely understood. As GPT-4’s knowledge was capped at the date of its most recent training update and the possibility that the field of obstetrics experienced relatively fewer updates than gynecology in the period between the update and examination, the model’s knowledge base may have been a factor in this difference. Thus, a more comprehensive analysis with additional problems is needed to gain a deeper understanding of this phenomenon.
The authors strongly suggest that the Ob/Gyn community proactively embraces these technological advances to improve patient safety and enhance quality of care. It is vital to shift Ob/Gyn education from conventional memorization-based learning to a strategy of defining problems in specific clinical scenarios and gathering the data necessary for problem solving [13]. As generative AI models, LLMs offer solutions to specific problems, with the answer quality depending on the questions posed. For Ob/Gyn physicians, comprehensive history-taking and physical examinations are essential for the accurate identification of patient issues. Supplying LLMs with detailed accounts of a patients’ primary complaint(s), current illness(es), and physical examination results can aid in making decisions regarding diagnostic testing and treatment plans given a particular clinical context. Nonetheless, practitioners must remember that LLMs are not substitutes for the fundamental elements of patient care such as forging strong patient relationships and listening attentively to their concerns [14].
Ob/Gyn physicians who completed their training more than 10 years ago may find LLMs beneficial for continuing medical education (CME) [15,16]. When a significant amount of time has passed since one’s initial training, it can be a challenge to keep up with new developments, potentially leading to the use of outdated treatment methods. Although many Ob/Gyn societies offer dedicated CME programs, changing established clinical protocols can be complicated. Utilizing a current LLM as an auxiliary tool for decision making may offer these physicians an extra avenue with which to keep abreast of the latest information and pursue evidence-based treatment in their patient care [17].
In healthcare, decision making critically influences patient safety, and requires greater precision and a more cautious approach to procedural modifications than in other fields. Although GPT-4 achieved 79.3% accuracy on the Ob/Gyn examination, it is essential to recognize that LLMs are generative models, sometimes labeled as “stochastic parrots” [18]. Rather than delivering strictly accurate information, these models instead provide answers based on the likelihood of the most fitting words from their training data. Therefore, the existing level of accuracy level does not meet the standards for direct use in patient care.
ChatGPT is just one LLM; other versions, launched less than a year ago, also demonstrate the aforementioned exceptional capabilities. Microsoft recently unveiled BioGPT, an LLM specializing in PubMed literature, while meta AI has rolled out Llama, which features a user-friendly application programming interface for broader innovation and customization [19,20]. These innovations indicate that we can expect future LLMs to undergo training with a broader and more varied range of medical data, offering specialized expertise in healthcare. Moreover, the GPT-4 framework is capable of processing and interpreting visual content such as images and videos, which suggests its potential performance in the future using datasets that include clinical images and surgical footage. Such advancements would increase the relevance of GPT-4 in Ob/Gyn subspecialties, expanding its function beyond text-based applications and providing more thorough insights into intricate clinical situations, thereby aiding healthcare professionals in decision making and subsequently enhancing patient care.
This study’s limitations primarily stem from the exclusion of visual data, such as clinical images, radiological scans, and graphical representations, which are essential components of medical education and practice, especially in a field like Ob/Gyn. The LLMs evaluated in this study, including GPT-3.5 and GPT-4, were primarily designed for text-based comprehension and were not equipped to process or interpret visual information. As a result, questions involving visual elements, which are often critical for accurate diagnosis and treatment decisions in clinical practice, were intentionally omitted from our dataset. Consequently, it remains uncertain whether GPT-4, despite its superior performance on text-based questions, would maintain its edge over human residents when faced with cases that require the interpretation of visual data, such as ultrasound images, fetal heart rate tracings, or histopathological slides.
Nevertheless, this study provides significant insights into the evolving role of LLMs in clinical education by demonstrating their ability to process complex clinical information and make relevant clinical decisions based solely on textual data. While the current study did not address the integration of multimodal data, it serves as an important step toward understanding the potential of LLMs in enhancing medical education, particularly in the Ob/Gyn domain. Future research that incorporates visual data will be necessary to fully assess the capabilities and limitations of these models in real-world clinical scenarios where visual interpretation is crucial.
In conclusion, GPT-4 exhibited impressive performance in processing intricate clinical data, achieving 79.3% accuracy on the Ob/Gyn examination. However, it is imperative to acknowledge the constraints of LLMs and that their utilization should augment, not replace, human expertise and discernment.
Notes
Conflicts of interest
The other authors also have no conflicts of interest.
Ethical approval
Not applicable.
Patient consent
Not applicable.
Funding information
None.