Artificial intelligence in preterm birth prediction: a narrative review of current approaches and clinical applicability
Article information
Abstract
Preterm birth remains the leading cause of neonatal morbidity and mortality worldwide, affecting approximately 13.4 million births annually. Despite advances in our understanding of risk factors, current clinical prediction methods have demonstrated limited accuracy in individual risk stratification. This narrative review examines the current landscape of artificial intelligence (AI) applications for preterm birth prediction and evaluates the methodological quality and clinical applicability across different data modalities. PubMed, Embase, and Web of Science were searched to develop and validate machine learning models for predicting spontaneous preterm births. AI approaches include electronic health record-based models, deep learning for ultrasound image analysis, cervical texture and radiomics feature extraction, elastography-derived parameters, and multi-omics integration using transformer architectures. Area under the receiver operating characteristic curve values range from 0.61 to 0.89 across modalities. However, the systematic reviews identified significant methodological limitations; 79% of the studies had a high risk of bias according to the prediction model risk-of-bias assessment tool criteria, with a median transparent reporting of multivariable prediction model for individual prognosis or diagnosis (TRIPOD) adherence of only 49%. Common deficiencies include inadequate sample sizes, a lack of external validation, and failure to report calibration metrics. Although AI-based prediction shows promise, substantial improvements in methodological rigor are required before clinical implementation. Priority areas include rigorous external validation, adherence to TRIPOD+AI reporting standards, and prospective evaluation of clinical utility.
Introduction
Preterm birth, defined as delivery before 37 weeks of gestation, remains the leading cause of neonatal and long-term morbidity worldwide [1]. In 2020, an estimated 13.4 million infants were born preterm globally, accounting for 9.9% of all live births. Despite extensive research and clinical efforts over the preceding decade, no measurable reduction in preterm birth rate has been achieved. The burden is particularly high in low- and middle-income countries, although high-income nations continue to face persistent rates that have proven resistant to intervention. Preterm birth has lifelong consequences beyond immediate neonatal complications, including an increased risk of neurodevelopmental disabilities, chronic respiratory disease, and cardiovascular morbidity [2].
The etiology of spontaneous preterm birth is multifactorial and involves complex interactions among genetic predisposition, infection and inflammation, uterine overdistension, cervical insufficiency, decidual hemorrhage, and psychosocial stressors [3]. This heterogeneity complicates efforts to develop accurate prediction tools because different pathophysiological pathways may predominate in different women. Current clinical approaches to preterm birth prediction rely primarily on cervical length measurement using transvaginal ultrasound and biochemical markers such as fetal fibronectin [4]. The International Society of Ultrasound in Obstetrics and Gynecology recommends universal cervical length screening between 18 weeks and 24 weeks of gestation, when resources permit, with vaginal progesterone administered to women with a short cervix [4,5]. Standardized cervical assessment protocols, including elastographic measurement techniques, have been developed to improve the reproducibility of these assessments [6].
However, these methods demonstrate moderate predictive performance. A systematic review of fetal fibronectin testing in symptomatic women reported a positive likelihood ratio of 5.42 for delivery within 7–10 days [7]. The positive predictive value remains low in most clinical settings owing to the relatively low prevalence of preterm birth, limiting the clinical utility for individual patient management. Similarly, cervical length measurements showed modest discrimination with considerable overlap between women who delivered preterm and those who delivered at term.
The application of artificial intelligence (AI) and machine learning to medical prediction has grown substantially over the past decade with demonstrated success in areas such as radiology, pathology, and cardiovascular risk assessment [8,9]. In obstetrics, systematic reviews have identified machine learning as a tool for predicting pregnancy complications including preterm birth, preeclampsia, and gestational diabetes [10]. Recent reviews have highlighted the expanding role of AI in obstetric practice, including fetal growth assessment, placental pathology analysis, and delivery outcome prediction [11]. Machine learning approaches have also been applied to other areas of gynecologic oncology, demonstrating the broad applicability of these methods [12]. The potential use of AI tools, including large language models, in Korean obstetric practice has also been recognized [13].
Machine learning algorithms offer theoretical advantages over traditional statistical methods for predicting preterm birth. They can integrate multiple risk factors simultaneously, identify nonlinear relationships and complex interactions, and extract predictive features from unstructured data such as medical images without the need for manual annotations. Methodological approaches have evolved from conventional algorithms, including logistic regression, random forests, and gradient boosting applied to electronic health record data, to deep learning methods capable of directly analyzing ultrasound images [14]. Recently, transformer-based architectures have enabled the integration of high-dimensional multiomics data, including cell-free DNA and RNA profiles [15].
However, several challenges remain before AI-based predictions can be translated into routine clinical practice. Systematic reviews have consistently identified methodological limitations, including small sample sizes, lack of external validation, and poor adherence to reporting guidelines, such as transparent reporting of multivariable prediction model for individual prognosis or diagnosis (TRIPOD) [16]. This narrative review summarizes the current evidence on AI-based preterm birth prediction, evaluates the methodological quality and clinical applicability across different data modalities, and discusses the gaps that must be addressed for successful clinical translation.
Materials and methods
1. Ethics statement
This was a literature-based study; therefore, neither approval by the Institutional Review Board nor informed consent was required.
2. Study design
This is a narrative review based on a comprehensive search of academic databases.
3.Information sources and search strategy
We searched the PubMed, Embase, and Web of Science databases from January 2015 to December 2025. Search terms included combinations of “preterm birth”, “premature birth”, “preterm delivery”, “preterm labor”, “machine learning”, “artificial intelligence”, “deep learning”, “neural network”, “prediction”, and “risk model”. The reference lists of the identified systematic reviews were screened for additional relevant studies. Studies were included if they developed or validated machine learning or AI models to predict spontaneous preterm birth. We prioritized systematic reviews and meta-analyses, followed by original studies with external validation, large sample sizes, and novel methodological approaches.
Results
1. Systematic reviews and methodological quality
Several systematic reviews have synthesized the rapidly growing literature on AI applications for preterm birth prediction and representative studies across different data modalities are summarized in Table 1. Sharifi-Heris et al. [17] identified 13 studies using electronic health record data and reported a wide range of area under the receiver operating characteristic curve (AUC) values. Substantial heterogeneity exists in the study population, feature selection approaches, and validation methods. Yang et al. [18] conducted a comprehensive meta-analysis of 29 prediction model studies and identified methodological limitations (Table 2). According to the prediction model risk-of-bias assessment tool (PROBAST) criteria, 79% of the studies had a high overall risk of bias, with the analysis domain being the most frequently problematic owing to inadequate sample sizes, selection of predictors based on univariable analysis, and lack of calibration evaluation. The median adherence to the TRIPOD reporting guidelines was only 49%, indicating that many studies failed to report the essential information required for replication and clinical implementation.
Akazawa and Hashimoto [19] conducted a systematic review of 22 studies that used AI for preterm birth prediction and identified electrohysterogram images, biological profiles, metabolic panels from amniotic fluid or maternal blood, and cervical ultrasound images as the primary data types used. They noted that most datasets were insufficient for robust AI model development, with only three studies utilizing databases exceeding 100,000 cases and that higher predictive accuracy was achieved with metabolic panels and electrohysterogram data. These systematic reviews consistently identified the lack of external validation as a critical gap, with most models being evaluated only on internal test sets from the same institution in which they were developed.
2. Electronic health record-based models
Electronic health records provide readily available data for the development of predictive models without the need for additional testing or specialized equipment. Yu et al. [20] developed a CatBoost model using demographic, obstetric, and laboratory variables from 22,603 singleton pregnancies, achieving an AUC of 0.70 in internal validation. Their model incorporated maternal age, maternal weight and height, parity, first-trimester laboratory values (including hemoglobin and platelet counts), and serial measurements of blood pressure, symphysis fundal height, abdominal circumference, and maternal weight gain in late pregnancy. SHAP-based feature importance analysis identified late-pregnancy diastolic blood pressure, changes in symphysis fundal height and abdominal circumference, maternal weight gain, and aspartate aminotransferase level at registration as the leading predictors.
Zhang et al. [21] compared five machine learning algorithms for preterm birth prediction using clinical data from Chinese hospitals. Their AdaBoost model achieved 100% accuracy for term deliveries but a lower sensitivity for detecting preterm cases, highlighting the challenge of class imbalance when preterm births represent only 11.7% of the dataset. Kong et al. [22] employed automated machine learning frameworks for large-scale prediction using electronic inpatient discharge data, demonstrating the feasibility of automated feature selection and model optimization. Huang et al. [23] developed a longitudinal model incorporating data from multiple prenatal visits, showing that prediction accuracy improved as gestational age advanced and more clinical information became available.
In the Korean context, Lee and Ahn [24] applied artificial neural network analysis to data from 596 obstetric patients at the Korea University Anam Hospital. Comparing six machine learning methods, including neural networks, logistic regression, decision trees, random forest, naïve Bayes, and support vector machines, they found that the artificial neural network achieved an accuracy of 0.91 with an AUC of 0.62. Variable importance analysis revealed that the neural network emphasized hypertension, diabetes mellitus, and prior cone biopsy as major predictors, whereas random forest placed more weight on cervical length, maternal age, and prior preterm birth history. This study provides a foundation for the development of locally validated prediction models using Korean population data.
3. Deep learning for ultrasound image analysis
Deep learning enables automated feature extraction from medical images without manual annotation, thereby potentially capturing visual patterns that are not apparent to human observers. Convolutional neural networks (CNNs) have been successfully applied in various obstetric imaging tasks. Burgos-Artizzu et al. [25] developed a CNN model that analyzed fetal lung ultrasound texture to predict neonatal respiratory morbidity and achieved an accuracy of 91.5%. While not directly predicting preterm birth, this study demonstrates the feasibility of deep learning for extracting clinically relevant features from obstetric ultrasound images.
For cervical assessment specifically, Ohtaka et al. [14] developed a CNN model for predicting preterm delivery in women admitted with threatened preterm labor. By analyzing transvaginal ultrasound images from 59 patients, the best-performing model achieved an accuracy of 71.8% with an AUC of 0.704. Notably, this performance exceeded that of experienced clinicians, while two expert physicians achieved accuracies of only 46.5% and 51.7% when visually assessing the same images. This finding suggests that deep learning can extract predictive features from cervical ultrasound that are not readily apparent through conventional visual inspection, potentially capturing the microstructural changes preceding overt cervical shortening. Kloska et al. [26] combined clinical parameters with blood test results and questionnaire data in multimodal machine learning models, demonstrating improved performance compared to single-modality approaches.
4. Cervical texture analysis and radiomics
In addition to cervical length measurements, quantitative analysis of cervical texture may provide additional predictive information reflecting the microstructural changes that precede measurable shortening. Baños et al. [27] demonstrated that ultrasound-derived textural features, including homogeneity, contrast, and entropy, correlated with gestational age and cervical maturation, suggesting that these features could serve as biomarkers for premature cervical ripening. Burgos-Artizzu et al. [28] demonstrated that combining automated cervical length measurement with texture analysis in the mid-trimester improved prediction compared with length alone.
Pachtman et al. [29] introduced a cervical heterogeneity index derived from grayscale histogram analysis of transvaginal ultrasound images. They found significantly higher heterogeneity values in women who subsequently delivered preterm than in those who delivered at term, suggesting that cervical tissue disorganization may be detected before overt length changes. These quantitative approaches offer the advantages of objectivity and reproducibility compared with subjective visual assessments.
5. Elastography-based prediction
Cervical elastography measures tissue stiffness, which decreases during cervical ripening and may decline prematurely in women at risk of preterm birth. Both strain elastography and shear wave elastography techniques have been investigated. Angelopoulou et al. [30] conducted a systematic review and meta-analysis of cervical elastography for preterm birth prediction, including 13 studies with 4,087 participants. They reported pooled sensitivity of 0.77 and specificity of 0.73, though substantial heterogeneity existed across studies in elastography techniques, measurement protocols, and outcome definitions.
Feng et al. [31] combined first-trimester cervical length with shear wave elastography measurements and achieved improved prediction compared to either parameter alone. This early-pregnancy assessment could enable early identification of high-risk women and timely initiation of preventive interventions. Patberg et al. [32] developed the E-cervix index by integrating multiple elastography-derived parameters measured at 18–22 weeks and demonstrated an incremental predictive value over the standard cervical length assessment. The combination of anatomical (length) and functional (stiffness) cervical assessments represents a promising approach for improved risk stratification.
6. Multi-omics integration and transformer models
Advanced machine learning architectures enable the integration of high-dimensional molecular data that may capture underlying pathophysiological processes. Camunas-Soler et al. [33] analyzed cell-free RNA profiles in maternal blood samples and identified transcriptomic signatures predictive of early and extremely early spontaneous preterm births. Their models achieved AUC of 0.80 for predicting delivery before 35 weeks. The identified genes reflected pathways including placental function, immune regulation, and cervical remodeling, providing biological plausibility for predictive associations.
Zhou et al. [15] developed a transformer-based model that integrated cell-free DNA and RNA sequencing data for preterm birth prediction. In their evaluation using data from 682 pregnancies, the cell-free DNA model alone achieved an AUC of 0.822, and the cell-free RNA model achieved an AUC of 0.851. Notably, integrating both data modalities within the transformer architecture achieved an AUC of 0.890, demonstrating a substantial improvement through multi-omics integration compared with single-modality approaches. Although these results are promising, the requirement for next-generation sequencing and specialized bioinformatics analysis limits their immediate clinical applicability in well-resourced settings.
Conclusion
This review identified a substantial growth in AI applications for preterm birth prediction, with approaches spanning electronic health records, ultrasound imaging, cervical texture analysis, elastography, and multi-omics molecular profiling. The reported discrimination metrics are often promising, with AUC values frequently exceeding 0.75 and exceeding 0.85 for some multi-omics approaches. Deep learning models for cervical ultrasound analysis have demonstrated the ability to extract predictive features that escape expert visual assessments. However, the clinical utility of these models remains uncertain owing to their pervasive methodological limitations.
The finding that 79% of studies had a high risk of bias according to the PROBAST criteria is concerning but consistent with broader patterns in clinical prediction model research [18]. Common methodological deficiencies include inadequate sample sizes leading to overfitting, inadequate handling of class imbalance, failure to account for missing data, and the absence of calibration assessment. A median TRIPOD adherence of only 49% indicates a widespread failure to report the essential methodological details required for replication and clinical implementation.
Compared with established clinical tools, AI-based models show the potential for improved discrimination. While fetal fibronectin testing shows a positive likelihood ratio of 5.42 in symptomatic women [7], several machine learning models have reported AUC values exceeding 0.80. The findings of Ohtaka et al. [14] that their CNN model outperformed experienced clinicians (71.8% vs. 46.5–51.7% accuracy) suggests that AI may capture predictive information from cervical images that humans cannot perceive. However, direct comparisons across studies are limited by population heterogeneity, outcome definitions (varying gestational age thresholds from 32 weeks to 37 weeks), and clinical contexts (asymptomatic screening vs. symptomatic evaluation).
In the Korean clinical context, the current prediction practices rely primarily on cervical length measurements combined with biomarkers. Park et al. [34] demonstrated that cervicovaginal fluid cytokines, particularly interleukin (IL)-6 and IL-17, could serve as predictive markers in symptomatic women with preterm labor, achieving a performance comparable to or exceeding that of fetal fibronectin. The study by Lee and Ahn [24] represents an important step toward locally validated AI prediction models, although the modest AUC of 0.62 and single-center design indicate the need for further development and validation across multiple Korean institutions.
Several barriers must be overcome before clinical translation. First, external validation across diverse populations is essential but is rarely performed. Models developed and validated at single institutions may not be generalizable to different healthcare settings, patient demographics, or clinical practices. Second, most studies reported only discrimination metrics without calibration assessment; calibration, ensuring that predicted probabilities match observed outcomes, is essential for clinical decision-making [35]. Third, the optimal threshold for classifying high-risk patients depends on the intended clinical action and relative costs of false positives and false negatives, which vary across clinical contexts.
Kelly et al. [36] outlined the key challenges in delivering clinical impact with AI in healthcare, including ensuring that model performance generalizes across diverse populations, integrating predictions into clinical workflows, establishing appropriate regulatory pathways, and building clinician trust through interpretability. The TRIPOD+AI statement [37] provides updated guidance for the transparent reporting of prediction models using machine learning methods. The DECIDE-AI guidelines [38] offer a framework for the early-stage clinical evaluation of AI decision support systems before full-scale clinical trials. Adherence to these reporting and evaluation frameworks is essential to advance the field beyond proof-of-concept studies to include clinically useful tools.
AI and machine learning approaches offer the potential to improve preterm birth prediction beyond current clinical methods by integrating diverse data sources and identifying complex patterns that are not apparent through traditional analysis. Current evidence demonstrates the feasibility of using multiple data modalities, including electronic health records, ultrasound imaging, cervical texture, elastography, and molecular biomarkers, with some multi-omics approaches achieving AUC values exceeding 0.85. However, the clinical utility of these models remains unclear owing to their pervasive methodological limitations. With 79% of the studies showing a high risk of bias according to the PROBAST criteria and a median TRIPOD adherence of only 49%, substantial improvement in methodological rigor is needed before clinical implementation can be recommended. Priority areas for future research include rigorous external validation across diverse populations and healthcare settings, adherence to TRIPOD+AI reporting standards, assessment of calibration and clinical utility beyond discrimination, and prospective evaluation of the impact on clinical decision-making and patient outcomes. Only by addressing these gaps can AI-based prediction tools fulfill their potential to improve outcomes in women and infants at risk of preterm birth.
Notes
Conflict of interest
No conflict of interest relevant to this article was reported.
Ethical approval
Not applicable.
Patient consent
Not applicable.
Funding information
None.
