Introduction
Recent breakthroughs in large language model (LLM) technology has significantly transformed the landscape of artificial intelligence (AI) [
1]. Among current LLMs, OpenAI’s Chat generative pre-trained transformer (GPT), which launched in November 2022, has emerged as a noteworthy innovation [
1]. ChatGPT has effectively analyzed and applied knowledge in specialized areas such as medicine, law, and business management, which are typically reserved for subject matter experts. Remarkably, this system has attained substantial accuracy, passing challenging assessments such as the United States Medical Licensing Examination, bar, Wharton Master of Business Administration final, and other medical examinations, accomplishing this feat with its pre-existing training alone, without any additional fine-tuning [
2-
7]
Becoming proficient in obstetrics and gynecology (Ob/Gyn) is a long journey, which integrates theoretical study, experiential learning, and closely monitored clinical practice [
8]. Throughout their training, Ob/Gyn residents collaborate with senior practitioners to acquire hands-on experience in patient management, surgical procedures, and clinical judgment. They also participate in a wide range of didactic sessions and seminars that cover the core concepts of Ob/Gyn, as well as medical and surgical skills and methods. This extensive training means that mastering this field often takes at least 10 years [
8]. Therefore, it is crucial to evaluate the potential role of emerging technologies such as AI and LLMs in enhancing the educational process [
9,
10].
This study aimed to utilize ChatGPT to analyze Korean Ob/Gyn examinations and determine whether LLMs exhibit expert-level understanding. We also compared the capabilities of GPT-3.5 with those of GPT-4.
Results
The comparative analysis of GPT-3.5 and −4, as shown in
Table 1, suggested a significant difference in their performance. GPT-4 demonstrated a markedly higher accuracy rate (79.3%), correctly answering 92 of the 116 questions. In contrast, GPT-3.5 exhibited lower accuracy (38.8%), correctly answering only 45 of the 116 questions.
Table 2 is a 2×2 contingency table summarizing the comparative performance of GPT-3.5 and −4. An overlap in the correct answers between the two models was observed in 42 instances (36.2%), while GPT-3.5 correctly answered only three questions (2.6%) that were incorrectly answered by GPT-4. The majority of GPT-3.5’s answers (50 [43.1%]) were incorrect, whereas the majority of GPT-4’s answers were correct. There were 21 questions (18.1%) that both models answered incorrectly.
Table 2 shows the superior performance of GPT-4, which achieved a higher overall rate of correct answers (79.3%).
As shown in
Fig. 2, the performance of GPT-4 was comparable to that of the first-, second-, and third-years residents, with no statistically significant difference. Furthermore, when the questions were categorized as obstetrics or gynecology, the performance of GPT-4 was particularly notable for obstetrics, with significantly good performance observed in the field (
P=0.015) (
Fig. 3).
In the comparative analysis of question types, both GPT-3.5 and −4 demonstrated superior performance among I- and P-type questions compared to R-type. Specifically, GPT-4 achieved a remarkable 100.0% success rate with I-type questions and 79.2% with P-type questions, whereas its performance for R-type questions was lower, at 33.3% (
Table 3).
Discussion
The main goal of this study was to quantitatively evaluate the capability of ChatGPT to understand intricate clinical data and examine the possible impacts of LLM technology on Ob/Gyn education and training. We assessed the efficacy of ChatGPT using questions from the Ob/Gyn examination and found that the GPT-4 model achieved an accuracy rate of 79.4%. Of note, this level of accuracy was reached without any specialized fine-tuning of the model and solely using prompts in Korean, underscoring the importance of our results.
The analysis showed the significantly superior performance of GPT-4 across various subspecialties and question types compared to that of GPT-3.5, with accuracy rates ranging from 76.7% to 85.7%. However, there were three questions (2.6%) which GPT-3.5 answered correctly but GPT-4 did not (
Table 2). Despite its overall higher accuracy, the reasons for GPT-4’s incorrect responses in these cases remain unclear. Identifying the precise cause of this difference is complex, with factors such as variations in the training data, model architecture, or other elements that may have influenced the disparate results between the two models.
GPT-4 demonstrated a more robust ability to interpret data and solve problems than to recall facts. These skills, which are crucial for the efficacy of sophisticated AI systems, were evident in GPT-4’s contextual comprehension and inferential methodology, and are essential for identifying and analyzing intricate data patterns. The capability of the LLM to synthesize and utilize various information for inventive problem solving demonstrated its near-human level of reasoning [
11]. However, there are shortcomings to this model, particularly in terms of recall. GPT-4’s knowledge was limited to the data available at the time of its last training update, which was April 2023. Additionally, the model cannot update information in real-time, which affects its performance in instances where access to the most recent data is vital [
12].
Furthermore, GPT-4 was more proficient in answering obstetric than gynecological questions, although the reason for this difference is not entirely understood. As GPT-4’s knowledge was capped at the date of its most recent training update and the possibility that the field of obstetrics experienced relatively fewer updates than gynecology in the period between the update and examination, the model’s knowledge base may have been a factor in this difference. Thus, a more comprehensive analysis with additional problems is needed to gain a deeper understanding of this phenomenon.
The authors strongly suggest that the Ob/Gyn community proactively embraces these technological advances to improve patient safety and enhance quality of care. It is vital to shift Ob/Gyn education from conventional memorization-based learning to a strategy of defining problems in specific clinical scenarios and gathering the data necessary for problem solving [
13]. As generative AI models, LLMs offer solutions to specific problems, with the answer quality depending on the questions posed. For Ob/Gyn physicians, comprehensive history-taking and physical examinations are essential for the accurate identification of patient issues. Supplying LLMs with detailed accounts of a patients’ primary complaint(s), current illness(es), and physical examination results can aid in making decisions regarding diagnostic testing and treatment plans given a particular clinical context. Nonetheless, practitioners must remember that LLMs are not substitutes for the fundamental elements of patient care such as forging strong patient relationships and listening attentively to their concerns [
14].
Ob/Gyn physicians who completed their training more than 10 years ago may find LLMs beneficial for continuing medical education (CME) [
15,
16]. When a significant amount of time has passed since one’s initial training, it can be a challenge to keep up with new developments, potentially leading to the use of outdated treatment methods. Although many Ob/Gyn societies offer dedicated CME programs, changing established clinical protocols can be complicated. Utilizing a current LLM as an auxiliary tool for decision making may offer these physicians an extra avenue with which to keep abreast of the latest information and pursue evidence-based treatment in their patient care [
17].
In healthcare, decision making critically influences patient safety, and requires greater precision and a more cautious approach to procedural modifications than in other fields. Although GPT-4 achieved 79.3% accuracy on the Ob/Gyn examination, it is essential to recognize that LLMs are generative models, sometimes labeled as “stochastic parrots” [
18]. Rather than delivering strictly accurate information, these models instead provide answers based on the likelihood of the most fitting words from their training data. Therefore, the existing level of accuracy level does not meet the standards for direct use in patient care.
ChatGPT is just one LLM; other versions, launched less than a year ago, also demonstrate the aforementioned exceptional capabilities. Microsoft recently unveiled BioGPT, an LLM specializing in PubMed literature, while meta AI has rolled out Llama, which features a user-friendly application programming interface for broader innovation and customization [
19,
20]. These innovations indicate that we can expect future LLMs to undergo training with a broader and more varied range of medical data, offering specialized expertise in healthcare. Moreover, the GPT-4 framework is capable of processing and interpreting visual content such as images and videos, which suggests its potential performance in the future using datasets that include clinical images and surgical footage. Such advancements would increase the relevance of GPT-4 in Ob/Gyn subspecialties, expanding its function beyond text-based applications and providing more thorough insights into intricate clinical situations, thereby aiding healthcare professionals in decision making and subsequently enhancing patient care.
This study’s limitations primarily stem from the exclusion of visual data, such as clinical images, radiological scans, and graphical representations, which are essential components of medical education and practice, especially in a field like Ob/Gyn. The LLMs evaluated in this study, including GPT-3.5 and GPT-4, were primarily designed for text-based comprehension and were not equipped to process or interpret visual information. As a result, questions involving visual elements, which are often critical for accurate diagnosis and treatment decisions in clinical practice, were intentionally omitted from our dataset. Consequently, it remains uncertain whether GPT-4, despite its superior performance on text-based questions, would maintain its edge over human residents when faced with cases that require the interpretation of visual data, such as ultrasound images, fetal heart rate tracings, or histopathological slides.
Nevertheless, this study provides significant insights into the evolving role of LLMs in clinical education by demonstrating their ability to process complex clinical information and make relevant clinical decisions based solely on textual data. While the current study did not address the integration of multimodal data, it serves as an important step toward understanding the potential of LLMs in enhancing medical education, particularly in the Ob/Gyn domain. Future research that incorporates visual data will be necessary to fully assess the capabilities and limitations of these models in real-world clinical scenarios where visual interpretation is crucial.
In conclusion, GPT-4 exhibited impressive performance in processing intricate clinical data, achieving 79.3% accuracy on the Ob/Gyn examination. However, it is imperative to acknowledge the constraints of LLMs and that their utilization should augment, not replace, human expertise and discernment.