Clinical utility assessment framework for machine learning-based fetal health classification in cardiotocography: an observational study
Article information
Abstract
Objective
To evaluate the clinical utility and implementation considerations of artificial intelligence (AI)-based fetal health classification systems using the Kaggle Fetal Health Classification dataset, with a focus on obstetric physicians’ perspectives.
Methods
We analyzed the Kaggle Fetal Health Classification dataset (n=2,126), containing 21 cardiotocography parameters. Five machine-learning algorithms were evaluated: logistic regression, random forest, gradient boosting, support vector machine, and decision tree. Class weighting was applied to address the dataset imbalance. The model performance was assessed using standard classification metrics. An expert opinion-based clinical utility assessment framework was developed to assess interpretability, workflow integration, and safety.
Results
With class weighting applied, gradient boosting achieved the highest accuracy (89.67%), followed by random forest (88.50%) and logistic regression (82.16%). The most important predictive features were abnormal short-term variability (16.23% importance) and the percentage of time with abnormal long-term variability (13.21% importance). An analysis of all 21 features revealed that contraction-related parameters, including uterine_contractions, contributed minimally to the classification performance. The 35.3% false negative rate for pathological cases represents a significant safety concern and requires physician oversight.
Conclusion
AI-based fetal health classification systems show potential for future applications when properly validated. However, the significant false negative rate for pathological cases indicates that these systems cannot function independently. External validation using multicenter clinical data and prospective outcome studies is essential before clinical implementation.
Introduction
Among the major technological advances in obstetric practice, cardiotocography (CTG) and ultrasonography are cornerstone methods for fetal health assessment during pregnancy and labor [1]. CTG provides real-time insights into fetal heart rate patterns and uterine contractions, offering healthcare providers valuable information about fetal well-being and enabling timely clinical interventions when necessary [2].
Despite its widespread clinical adoption and proven utility, CTG trace interpretation remains one of the most challenging and subjective aspects of modern obstetric practice. The complexity of fetal heart rate patterns, combined with the dynamic nature of labor and delivery, creates scenarios in which even experienced obstetricians and midwives may disagree on the interpretation of identical CTG recordings [3]. This inherent subjectivity has been extensively documented, with studies consistently showing significant inter- and intra-observer variability among healthcare professionals across different levels of experience and training [4].
The clinical implications of this interpretive variability extend beyond academic interest, directly impacting patient care quality, clinical decision-making, and ultimately maternal and fetal outcomes. When healthcare providers disagree on CTG interpretation, inconsistencies in clinical management can lead to false-positive and false-negative assessments [5]. False-positive interpretations may result in unnecessary interventions, including emergency cesarean deliveries or other invasive procedures that carry inherent risks for both the mother and baby. Conversely, false-negative interpretations may lead to missed opportunities for timely intervention in cases where fetal compromise is present, potentially resulting in adverse outcomes.
Recently, artificial intelligence (AI) and machine learning technologies have opened up new possibilities for addressing these longstanding challenges in fetal health monitoring. Machine learning algorithms have demonstrated capabilities in pattern recognition, data analysis, and predictive modeling across diverse medical domains, ranging from diagnostic imaging and cardiac arrhythmia detection [6,7] to clinical decision support systems [8]. The application of these computational approaches to fetal health classification represents a promising avenue for improving the objectivity, consistency, and accuracy of CTG interpretation, while maintaining the essential role of clinical expertise in patient care [9].
Recent developments in large language models and conversational AI systems have further expanded the potential applications of AI in healthcare, showing promise for diagnostic support, clinical education, and patient communication in obstetric and gynecological practices [10]. These technological advances provide an important context for understanding the broader landscape of AI applications in obstetric care. Ahn and Lee [11] comprehensively reviewed AI applications in obstetrics, including preterm birth prediction, fetal growth assessment, and CTG interpretation, highlighting both the potential and limitations of various machine learning approaches in maternal-fetal medicine. Kim et al. [12] reviewed AI applications in obstetrics, demonstrating how AI technologies were integrated into CTG, ultrasonography, and magnetic resonance imaging diagnostics in clinical settings, particularly emphasizing the potential of CTG automated interpretation systems.
This study aimed to provide a preliminary clinical utility assessment of AI-based fetal health classification from the perspective of an obstetric physician. We examined not only the technical performance of various machine learning algorithms but also their potential clinical utility, practical applicability, and considerations for future integration into real-world obstetric care settings. Our analysis considers the unique requirements and challenges of clinical practice, providing insights that can guide the responsible implementation of AI technologies in fetal health monitoring, while ensuring patient safety remains the primary consideration.
Materials and methods
1. Study design and clinical framework
This study employed an expert opinion-based clinical utility assessment approach to evaluate AI-based fetal health classification systems from an obstetric physician’s perspective. This study was limited to internal validation using a single publicly available dataset and did not constitute external validation. An expert opinion-based clinical utility assessment framework was developed through consultation with experienced obstetricians and maternal-fetal medicine specialists to ensure that the evaluation criteria reflected real-world clinical needs and priorities.
2. Dataset description
The Kaggle Fetal Health Classification dataset served as the foundation for our analysis, providing a standardized benchmark for machine learning approaches to fetal health classification [13]. This dataset contains 2,126 CTG records collected from fetal monitoring sessions conducted in clinical settings, with each record characterized by 21 carefully extracted features representing various aspects of fetal heart rate patterns and uterine activity (Table 1). The dataset classification scheme divided the fetal health status into three clinically relevant categories: normal (n=1,693 [79.6%]), suspect (n=265 [12.5%]), and pathological (n=168 [7.9%]). This classification system aligns closely with established clinical practice in which CTG interpretations are typically categorized into similar risk-based classifications that guide clinical decision-making processes [14].
3. Machine learning model development
Five machine learning algorithms were selected for evaluation based on their widespread use in medical applications and clinical interpretability requirements: logistic regression, random forest, gradient boosting, support vector machine (SVM), and decision tree classifiers [15–17]. Similar machine-learning classifiers have been applied in other areas of gynecologic oncology, demonstrating the versatility of these approaches across the obstetric and gynecologic domains [18]. Our methodological approach prioritized interpretable machine learning models over deep learning approaches to ensure clinical transparency and physician trust, aligning with the study’s core objective of evaluating AI systems from a physician’s perspective. Model training was conducted using a stratified train-test split approach (80–20%) to ensure representative sampling across all three fetal health categories.
To address the significant class imbalance in the dataset (normal: 79.6%; suspect: 12.5%; pathological: 7.9%), class weighting was applied using weights that were inversely proportional to the class frequencies. This approach ensures that minority classes (suspect and pathological) receive appropriate emphasis during model training without requiring synthetic data generation.
All experiments were conducted using Python 3.8 with scikit-learn version 0.24.2 [19]. A random seed of 42 was used for reproducibility. The dataset was divided into training (80%) and testing (20%) sets using stratified sampling. A five-fold cross-validation was performed on the training set for model selection. Specific hyperparameters were as follows: logistic regression (C=1.0; max_iter=1,000; solver=‘lbfgs’; multi_class=‘multinomial’); random forest (n_ estimators=100; max_depth=none; min_samples_split=2); gradient boosting (n_estimators=100; learning_rate=0.1; max_depth=3); SVM (C=1.0; kernel=‘rbf’; gamma=‘scale’); decision tree (max_depth=none; min_samples_split=2; criterion=‘gini’).
4. Expert opinion-based clinical utility assessment framework
An expert opinion-based clinical utility assessment framework was developed to systematically assess the clinical suitability of AI models beyond the technical performance metrics. This framework evaluates six key criteria: technical performance, interpretability, workflow integration, safety considerations, training requirements, and cost-effectiveness. Each criterion was assessed on a five-point scale based on expert opinions from three maternal-fetal medicine specialists with more than 10 years of CTG interpretation experience. Inter-rater reliability was assessed using Fleiss’ kappa (kappa=0.68, indicating substantial agreement). This assessment represented preliminary expert opinions rather than validated clinical outcomes.
5. Ethical approval
This study used a publicly available, de-identified dataset (Kaggle Fetal Health Classification dataset). Because the dataset did not contain personally identifiable information, institutional review board approval and informed consent were not required.
Results
1. Model performance analysis
The evaluation of the five machine-learning algorithms revealed substantial variations in performance across different metrics. Table 2 presents the results with and without class weighting applied to address dataset imbalance. With class weighting applied, gradient boosting achieved the highest overall accuracy of 89.67%, followed by random forest (88.50%), and logistic regression (82.16%). The superior performance of ensemble methods such as random forest and gradient boosting aligns with theoretical expectations because these algorithms combine multiple weak learners to create more robust predictive models [16,17].
2. Feature importance analysis
Analysis of feature importance provided insights into which CTG parameters were the most influential in determining fetal health classification (Table 3). All 21 features from the dataset were analyzed and ranked according to their contribution to the predictive performance of the random forest model. The most important predictive feature was abnormal short-term variability, accounting for 16.23% of the predictive power of the model, followed by the percentage of time with abnormal long-term variability (13.21%). This finding aligns well with established clinical knowledge, as reduced or absent short-term variability in fetal heart rate is widely recognized as an indicator of potential fetal compromise that requires clinical attention [20].
Notably, several features that clinicians might expect to be important contributed minimally to the classification. Uterine_contractions ranked 17th (1.37%) and light_decelerations ranked last among all features (0.19%). These findings have significant clinical implications. In clinical CTG interpretation, the temporal relationship between uterine contractions and fetal heart rate deceleration is critical for distinguishing late from early decelerations and assessing fetal reserve [14,20]. The low importance of contraction-related features in machine learning models suggests that the dataset may not adequately capture the dynamic temporal relationship between contractions and heart rate patterns, or that the extracted features may not represent the sequential and contextual information that clinicians rely on when interpreting these patterns. Similarly, prolongued_decelerations (0.93%) and severe_decelerations (0.63%) showed low importance, likely reflecting their rarity in the dataset rather than their clinical insignificance. A recent scoping review by Francis et al. [21] identified concerns over the practicality and generalizability of machine learning approaches applied to CTG data, particularly when using summary features rather than raw time-series signals. These findings underscore the fundamental limitation of applying machine learning to CTG data. The static summary-level features available in this dataset do not fully represent the temporal waveform patterns that clinicians evaluate in real-time practice.
3. Class-specific performance analysis
A detailed analysis of class-specific performance revealed important insights into the strengths and limitations of AI models for different categories of fetal health status (Table 4). The balanced model showed improved performance for minority classes compared with the original approach. The analysis revealed excellent performance in identifying normal cases (100% recall), improved performance for pathological cases (64.7% recall with the balanced model vs. 41.2% with the original), and challenging performance for suspect cases (34.0% recall). From a clinical safety perspective, the 35.3% false-negative rate for pathological cases (improved from 58.8% in the original model) represents a significant concern that must be addressed through appropriate implementation strategies under physician supervision.
4. Expert opinion-based clinical utility assessment
The expert opinion-based clinical utility assessment framework was used to evaluate six key criteria (Table 5). Expert assessment revealed moderate technical performance (score 3/5), potentially suitable for decision support but insufficient for independent clinical use. Safety considerations received a poor score (2/5) because of the persistent false-negative rate for pathological cases. Interpretability scored well (4/5), as the feature importance aligned with established clinical knowledge.
Discussion
This preliminary expert opinion-based clinical utility assessment of AI-based fetal health classification systems from an obstetric physician’s perspective provides initial insights into both the potential and limitations of these technologies for clinical implementation. The moderate but realistic performance levels achieved in this study (89.7% accuracy for the best-performing balanced model) suggest that these systems may be more valuable as decision support tools than autonomous diagnostic systems, which has important implications for the integration of such technologies into clinical workflows [22].
A key finding of this study was that the best-performing model achieved 89.7% accuracy. This figure may appear modest compared with some technical studies that reported accuracies of over 99%. However, this raw accuracy must be interpreted within the context of clinical reality, where the primary challenge is not necessarily achieving perfect prediction, but overcoming the significant inter-observer variability inherent in the human interpretation of CTG traces. Multiple studies have shown that even experienced clinicians often disagree on the classification of the same CTG trace, with reported agreement levels (measured by the kappa statistic) often falling into the ‘fair’ to ‘moderate’ range [3–5]. In a study by Uccella et al. [23], a standardized algorithm substantially improved interpretation agreement (kappa=0.85) compared to subjective visual interpretation (kappa=0.24). This strongly suggests that the primary value of an AI system lies in providing a standardized, objective baseline that reduces the well-documented subjectivity and variability of human interpretation.
The prominence of physiologically relevant features, particularly heart rate variability measures, in AI model predictions provides reassurance that these systems focus on clinically meaningful patterns rather than spurious correlations. This alignment between the importance of AI features and established clinical knowledge is important for physicians’ acceptance of and confidence in AI recommendations. The finding that abnormal short-term variability and the percentage of time with abnormal long-term variability are the most important predictive features validates the clinical relevance of the AI approach and supports its potential for meaningful integration with existing clinical practice [20]. However, the low importance of contraction-related features and deceleration parameters remains an important issue. In standard clinical practice, the relationship between uterine contractions and fetal heart rate changes, particularly the timing of deceleration relative to contractions, is fundamental for CTG assessment [14,20]. The observation that these features contributed minimally to the classification suggests that the static feature extraction approach used in this dataset did not capture the temporal dynamics central to clinical interpretation. Hardalaç et al. [24] demonstrated that feature elimination strategies applied to CTG data could improve model accuracy to 97.20% while correctly predicting 100% of pathological cases, suggesting that careful feature engineering rather than simply including all available features may be more effective. Future AI systems should incorporate time-series analysis methods that can model sequential relationships between contractions and heart rate responses.
A critical finding of this study is the persistent challenge of class imbalance even after applying balancing techniques. The false-negative rate of 35.3% for pathological cases, which improved from the original 58.8%, still represents a significant patient safety concern. In clinical practice, missing a pathological case can lead to delayed interventions and adverse fetal outcomes. This finding strongly supports the idea that AI systems in fetal monitoring must function as decision support tools with mandatory physician oversight and not as autonomous diagnostic systems.
The landscape of AI applications in the Korean obstetric practice provides an important context for understanding the potential implementation of these systems. Park et al. [25] conducted an important study utilizing 22,522 deliveries from 14 Korean hospitals and demonstrated that large-scale, multicenter CTG datasets can successfully support AI model development with external validation, achieving area under the curves of 0.862–0.895. This study represents an advancement beyond publicly available datasets by incorporating real clinical data with actual patient outcomes and multi-institutional validation.
The implementation of AI-based fetal health classification systems raises important safety considerations that must be carefully addressed. Modern clinical decision support systems must be designed with robust safeguards and physician oversight capabilities [26]. Based on the findings of this study, we propose a structured implementation framework for AI-based fetal health classification systems that prioritizes patient safety while maximizing the potential benefits of AI technology. Class-specific performance analysis revealed that excellent performance for normal cases (100% recall) suggests a particular value for screening applications. Conversely, the low recall rate for suspected cases (34.0%) suggests that AI systems may have limited utility for these challenging intermediate cases, which often require clinical expertise and nuanced judgment, a challenge widely recognized by clinicians [27].
The performance achieved in this study (89.7% accuracy with balanced models) is lower than that of several recently reported results. Studies reporting greater than 99% accuracy using the same Kaggle dataset often employ extensive hyperparameter optimization and cross-validation strategies, which may not be generalizable to real clinical settings. Several important research directions emerged from this analysis that could enhance the clinical utility of AI-based fetal health classification systems. The integration of additional clinical data sources and explainable AI techniques specifically designed for medical applications can address interpretability challenges that currently limit clinical acceptance [28,29]. Although other studies have reported accuracies exceeding 95% [30], our study intentionally prioritized clinical realism and physician-centered assessment over maximal technical optimization.
Recent international studies have demonstrated promising results using advanced AI approaches. Mushtaq and Veningston [31] demonstrated that explainable deep learning models achieve high accuracy while maintaining their interpretability. Ensemble learning approaches have demonstrated strong performance in fetal health classification [32]. Studies integrating multimodal cardiotocographic and maternal clinical data have achieved 90.8% accuracy with balanced performance across risk categories [33]. Furthermore, reviews of AI applications in electronic fetal monitoring have identified both technological potential and significant gaps in external validation, emphasizing the need for rigorous clinical trials before large-scale adoption [21,34].
This study has several limitations that should be considered when interpreting the results. First, the evaluation was conducted using a publicly available dataset, which may not fully represent the diversity of clinical scenarios encountered in real-world practice. Second, the expert opinion-based clinical utility assessment framework was based on expert opinions rather than validated clinical outcome data. Third, the dataset lacked actual patient outcome data, such as Apgar scores, limiting our ability to assess the true clinical utility. Future studies should include a prospective evaluation of AI systems in clinical settings, with measurements of actual patient outcomes and clinical impact [28,29].
In conclusion, our preliminary assessment from a physician’s perspective suggests that AI-based systems show potential as decision-support tools for fetal health assessment. The realistic performance levels achieved in this study provide a practical foundation for future clinical studies. The strong alignment between the importance of AI features and clinical knowledge fosters confidence in these systems. However, the current limitations, especially the significant false-negative rate in detecting high-risk cases, demonstrate that these systems cannot replace the essential expertise and judgment of clinical physicians. External validation using multicenter clinical data and prospective outcome studies is essential before clinical implementation. The expanding application of AI in obstetrics beyond fetal monitoring further emphasizes the importance of continued assessment and physician oversight [35]. As AI continues to advance in obstetric practice, rigorous clinical validation and ongoing physician involvement will remain essential to ensure that these technologies enhance, rather than replace, clinical judgment, which is central to safe and effective patient care.
Notes
Conflict of interest
No potential conflict of interest relevant to this article was reported.
Ethical approval
This study used a publicly available, de-identified dataset (Kaggle Fetal Health Classification dataset). Because the dataset did not contain personally identifiable information, institutional review board approval and informed consent were not required.
Patient consent
Not applicable.
Funding information
None.
Acknowledgments
The dataset used in this study is publicly available through Kaggle at: https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification.
