Skip to main content

Predicting dental caries outcomes in young adults using machine learning approach



To predict the dental caries outcomes in young adults from a set of longitudinally-obtained predictor variables and identify the most important predictors using machine learning techniques.


This study was conducted using the Iowa Fluoride Study dataset. The predictor variables - sex, mother’s education, family income, composite socio-economic status (SES), caries experience at ages 9, 13, and 17, and the cumulative estimates of risk and protective factors, including fluoride, dietary, and behavioral variables from ages 5–9, 9–13, 13–17, and 17–23 were used to predict the age 23 D2+MFS count. The following machine learning models (LASSO regression, generalized boosting machines (GBM), negative binomial (NegGLM), and extreme gradient boosting models (XGBOOST)) were compared under 5-fold cross validation with nested resampling techniques.


The prevalence of cavitated level caries experience at age 23 (mean D2+MFS count) was 4.75. The predictive analysis found LASSO to be the best performing model (compared to GBM, NegGLM, and XGBOOST), with a root mean square error (RMSE) of 0.70, and coefficient of determination (R2) of 0.44. After dichotomization of the predicted and observed values of the LASSO regression, the classification results showed accuracy, precision, recall, and ROC AUC of 83.7%, 85.9%, 93.1%, 68.2%, respectively. Previous caries experience at age 13 and age 17 and sugar-sweetened beverages intakes at age 13 and age 17 were found to be the four most important predictors of cavitated caries count at age 23.


Our machine learning model showed high accuracy and precision in the prediction of caries in young adults from a longitudinally-obtained predictor variables. Our model could, in the future, after further development and validation with other diverse population data, be used by public health specialists and policy-makers as a screening tool to identify the risk of caries in young adults and apply more targeted interventions. However, data from a more diverse population are needed to improve the quality and generalizability of caries prediction.

Peer Review reports


Dental caries is a chronic infectious disease that destroys tooth structure and has significant public health implications, including in young adults [1]. The etiology of dental caries is multifactorial, with the most central etiological factors being the cariogenic diet, the action of bacteria, susceptible tooth structure, and time [1].

Few studies have explored the prevalence of cavitated caries in young adults and associated etiological factors. Brown et al.’s study using two National Health and Nutrition Examination Surveys (NHANES I and NHANES III) data found the mean Decayed, Missing Filled, Surface (DMFS) score of 24.8 and 13.9 among participants aged 18 to 25 years in NHANES I and NHANES III, respectively [2]. Also, Ismail et al.‘s study using the 1982–1983 Hispanic Health and Nutrition Examination Survey (HHANES) found a total mean DMFT score of 6.0 among Mexican-Americans aged 18 to 24 [3]. Garcia-Cortes et al. [4]. saw a high caries prevalence of 86.3% (mean DMFT of 5.8) among aged 22 to 25 applicants to San Luis Potosi University, Mexico with females having significantly higher DMFT than males (4.3 ± 4.0 vs. 3.9 ± 3.8; p = 0.04). Drachev et al. [5]. found high caries prevalence (96.0%; mean DMFT of 7.6) among Russian students, with higher mean DMFT in high socioeconomic status (SES) students compared to low SES students. A cohort study of Swedish children following clinical and radiographic examinations (age 20 mean DFS = 5.8) showed that previous caries experience at a younger age (ages 3, 6, and 15) was associated with caries experience at age 20 (p < 0.05) [6]. Jamieson et al.‘s study on Australian Aboriginal young adults aged 16 to 20 found a mean DMFT of 4.8 and that sex and sweets intake were significantly associated with higher mean DMFT [7].

Given the multifactorial and complex etiology of caries, there is a need for studies that use robust predictive modeling techniques like machine learning (ML) to accurately identify the best predictors of caries from complex datasets. Supervised machine learning is a type of artificial intelligence used to predict the value of an outcome measure based on several input measures. An ideal ML model has a favorable bias-variance trade-off (i.e., no model underfitting or overfitting) [8]. It provides a robust approach for the identification and selection of the most important predictors, while avoiding convergence issues and some aspects of the curse of dimensionality (Hughes phenomenon) [9], common issues in traditional statistical modeling with a large number of variables.

There are substantial gaps in our understanding of the predictive effects of longitudinally-obtained dietary/behavioral and fluoride variables on caries outcomes, especially in young adulthood, which is one of the most active stages of life. Previous ML studies [10, 11] have focused on the prediction of caries outcomes in children and we found no studies on the prediction of dental caries in young adults with a very wide range of comprehensive and cumulative (childhood) exposure variables using a machine learning approach. The objectives of this study were: (1) to predict the dental caries outcome in young adults using machine learning techniques and (2) to identify the most important predictors of the dental caries outcome from a large set of sociodemographic, dietary, fluoride, and behavioral variables.


This was a secondary analysis of data collected from ages 5 to 23 within the Iowa Fluoride Study (IFS), a prospective cohort study that completed data collection in February 2019. The recruitment of IFS participants was done in the post-partum wards of eight Iowa hospitals from March 1992 to February 1995 [12]. The participants had dental exams approximately every four years (except for ages 17 to 23, an interval of 6 years) and received oral health questionnaires every six months Approval for the IFS was obtained from the University of Iowa Institutional Review Board for all components and procedures before the initiation of the study and for each examination, with annual renewal, as well as review when any modifications were done [13] (Appendix II).

The IFS dental examinations were done by one of three trained and calibrated dentists using portable dental equipment and halogen headlights with ongoing inter-examiner reliability assessment [13]. After drying the teeth, a DenLite® mirror (Welsh-Allyn Medical Products, Inc., Skaneateles Falls, NY) was used to enhance lighting and provide transillumination. The examinations were based primarily on visualization only, without radiographs, however, gentle explorer probing was used to confirm scoring, when in doubt. They were performed either at the University of Iowa College of Dentistry (Iowa City, IA) or at remote locations (Waterloo, IA, and Des Moines, IA) for those who could not make it to Iowa City. Caries status of each surface was recorded as either sound (S), arrested (D0), non-cavitated (D1), or cavitated (D2+); those with restorations were recorded as filled (F); missing teeth due to caries were recorded as missing (M) surfaces; and dental sealants were recorded separately [13].

The inclusion criteria for these analyses were (1) completion of the dental exams at age 23 and (2) having sufficient cumulative exposure to trapezoidal AUC estimate data (see Appendix I) for at least 35 out of the 51 independent variables for the time periods from ages 5 to 9, 9 to 13, 13 to 17, and 17 to 23.

The primary outcome variable (age 23 cavitated caries (D2+MFS) count) was defined as the sum of decayed (D2+cavitated), missing (M), and filled (F) surfaces at age 23. A total of 51 independent variables were considered, including four sociodemographic variables and 47 other predictors (cumulative exposure) variables. The sociodemographic variables were sex, family income level, mother’s level of education, and composite SES, with the last three assessed with data from a questionnaire in 2007. The main predictor variables were the cumulative exposure AUC variables for the periods from ages 5 to 9, 9 to 13, 13 to 17, and 17 to 23. They were defined for daily brushing frequency category, daily fluoride intake from combined sources, concentration of fluoride from home water, and the beverage variables (daily sugar-free beverage intake (no sugar added), daily milk intake, daily 100% juice intake, daily sugar-sweetened beverages intake, frequency of sugar-free (water-based) beverages consumption, frequency of milk consumption, frequency of 100% juice consumption, and frequency of sugar-sweetened beverages consumption). Additional variables were dental caries experience at ages 9, 13, and 17. Details of the variable definitions are provided in Appendix II.

Statistical analysis

Exploratory data analysis

Descriptive statistics were determined for the person-level age 23 D2+MFS count and all independent variables. Bivariate analyses were conducted to ascertain the relationships between the dependent variable and each of the 51 independent variables. Mann-Whitney U tests were used to explore the relationships between age 23 D2+MFS count and sex and brushing frequency category; Kruskal-Wallis tests were used to explore the relationships between age 23 D2+MFS count and family income, mother’s level of education, and composite SES. Spearman (Rho) correlation tests were conducted to assess the relationships between the age 23 D2+MFS count and each of the continuous independent variables (home fluoride concentration, total fluoride intake, and beverage variables). All statistical analyses were performed with R software version 4.1.2, with a two-tailed alpha level set at 0.05 for bivariate analyses.

Multivariable predictive modeling

Multivariable predictive modeling was performed using four ML models - Least Absolute Shrinkage and Selection Operator (LASSO) regression [8], negative binomial regression, generalized boosting machines (GBM) [14, 15], and extreme gradient boosting (XGBOOST) [16] - using the MachineShop [17] package for R (see Appendix III for the description of LASSO, GBM and XGBOOST). These models were chosen because of their abilities to (1) perform well with high dimensional data, (2) perform variable selection, and (3) handle different data types and distributions with very few assumptions (Details in Appendix III).

Data preprocessing

Prior to fitting the ML models, the k-nearest neighbor (KNN) imputation technique was used to handle the remaining missing data for these participants [18]. Additional data preprocessing (scaling and normalization) of the data was performed using the recipes package in R [19].

Model fitting (training and testing)

The training and testing of all models were done using the nested resampling technique with 5-fold cross-validation, which consists of an inner resampling loop and an outer resampling loop for testing the model performance [20]. We chose the nested resampling technique due to its ability to use different portions of the data to iteratively perform training and testing, thereby obtaining an unbiased performance estimate. In the outer resampling loop, we had five training/test sets (each with an 80 to 20 ratio). On each of these outer training sets, we optimized the model by performing parameter tuning and feature selection on the inner resampling loop. The optimized models then were fitted on the outer training sets and their performances were evaluated on the outer test sets. This technique gives a more honest estimation of model performance, although it is computationally expensive [20]. These models then were optimized by tuning them using the TunedModel function in the MachineShop package and the tuning parameters were chosen using the cross-validation technique [17].

Model evaluation

Model performance was assessed using root mean square error (RMSE), mean absolute error (MAE), and the R-squared value (coefficient of determination). Lower RMSE and MAE values indicate better model performance, while a higher R-squared value indicates better model performance. The best-performing model was selected based on the RMSE and R2. However, MAE was defined to better understand the overall model performance. The metrics for model performance were obtained by averaging the scores obtained from nested resampling with 5-fold cross-validation.

For easier interpretability, the observed and predicted values from the selected best model were first discretized and then dichotomized into dental caries as Yes (if values were above zero, indicating cavitated caries) or No (if values were zero, indicating no caries present). The following metrics then were used to show the model performance: accuracy, receiver operating characteristics area under the curve (ROC AUC), positive predictive value (precision), and sensitivity (recall). Details of the codes are provided in Appendix X. This study was reported using both the STROBE (Appendix XI) and TRIPOD guidelines (Appendix XII).


There were 258 participants who fulfilled the inclusion criteria, with 41 participants (16%) having at least one imputed data point and 3,458 out of 18,126 data points (19%) imputed using the k-nearest neighbor technique. There was favorable tooth-level inter-examiner reliability, with kappa statistics of 0.73, 0.71, 0.77, and 0.82 at ages 9, 13, 17, and 23, respectively.

Table 1 summarizes the frequency distributions of the categorical predictor variables. 58% of participants were female, and 13% of the subjects’ family income levels were below $40,000, with 48% $80,000 and above. About 14%, 32%, and 54% of participants were in the lower, middle, and higher SES groups, respectively.

Table 1 Descriptive analyses of the categorical independent variables

As shown in Table 2, the prevalence of cavitated caries at age 23 (D2+MFS23 count) was 69.1%, with a mean D2+MFS23 of 4.75 (SD = 4.32). The mean values for the cumulative exposure AUC predictor variables from ages 5 to 9, 9 to 13, 13 to 17, and 17 to 23 were: 0.71, 0.72, 0.85, and 1.04, respectively, for fluoride intake from combined sources (mg F/day); 1.67, 1.50, 1.56, and 1.11, respectively, for milk intake per day (cups/day); 0.61, 1.16, 1.42, and 1.76, respectively, for 100% juice intake (cups/day); and 0.61, 1.16, 1.42 and 1.76, respectively, for intake of sugar-sweetened beverages (cups/day). Also, mean caries (D2+MFS) experience at ages 9, 13, and 17 were 0.46, 1.15, and 2.94, respectively (See Appendix IV for more details about the descriptive statistics).

Table 2 Descriptive analyses of the continuous independent variables and dependent variables

As shown in Table 3, D2+MFS23 count was significantly associated with family income, composite SES and age 9 to 13 cumulative estimates of participants’ brushing frequency (p < 0.05). There were significant correlations between D2+MFS23 count and caries experience at ages 9, 13, and 17, respectively (r = 0.28, 0.56, and 0.73, respectively; p-values < 0.001). D2+MFS23 count was negatively associated with cumulative estimates of frequency of milk intake at ages 5 to 9 and 9 to 13 (r = -0.12, -0.13, respectively, p < 0.05) and positively associated with cumulative estimates of age 13 to 17 total fluoride intake, age 5 to 9 amount and frequency of sugar-sweetened beverages, and amount and frequency of sugar-sweetened beverages at ages 9 to 13 and 13 to 17 (r = 0.14, 0.22, 0.21, 0.28, 0.26, 0.29 and 0.31, respectively, p < 0.05).

Table 3 Bivariate analyses of the relationships between the independent variables and the dependent variable of age 23 cavitated caries count

Multivariable model prediction and performance

As shown in Table 4, the best performing model was from the LASSO regression, with a RMSE of 0.70, R2 of 0.44, and MAE of 0.48. The GBM and the Negative binomial GLM also performed fairly well, with RMSE scores of 0.74, and 0.76, respectively. The worst performing model was the XGBOOST, with RMSE score of 0.79. More details on the model performance are provided in Appendix V. The lower RMSE and a boxplot showing the comparison of the performance metrics (RMSE, R2, and MAE) across all four ML models can be found in Appendix VI. The observed values were found to be calibrated well with the predicted values, as shown in the calibration plot (Appendix VII). After dichotomization from the LASSO model, the classification results (Table 4) showed an accuracy of 83.7%, precision (positive predictive value) of 85.9%, recall (sensitivity) of 93.1%, and ROC AUC of 80.6%.

Table 4 Generalization performance of all the predictive models and performance of the LASSO regression model (best performing model) on a binary scale

The assessment of variable importance (Table 5) showed that 4 of the 51 independent variables (age 13 caries count, age 13 caries count, the amount of sugar-sweetened beverages intake from age 9 to 13, and the frequency of sugar-sweetened beverages intake from age 13 to 17) were important in the prediction of and all were positively associated with the cavitated caries outcome count at age 23. The age 17 caries count was the most important variable in the prediction of the D2+MFS23 count (see Appendix VIII for variable importance plot).

Table 5 Variable importance and beta-coefficients from the LASSO regression model


Dental caries is a chronic infectious disease with significant public health implications, including in young adults. Our study is one of the first to use machine learning to predict cavitated caries outcomes in young adults from using longitudinally obtained behavioral, and dietary variables.

Our study found a relatively high prevalence of cavitated caries, similar to the findings from the Garcia-Cortes et al. [4] and Jamieson et al. [7] studies conducted within the same age group. However, other studies had much higher mean DMFT/S and percentage prevalence (D2+MFS > 0) for this age group compared to our study [5, 6]. These variations might have been due to the variation in the studies’ caries assessment methods, geographic differences and time periods, with caries rates now generally lower overall than in the past.

Exploratory data analysis showed that the D2+MFS23 count was significantly correlated with family income and composite SES, agreeing with Ismail et al.’s study [3], but contradicting Drachev et al.’s study [5]. Also, the correlations between D2+MFS23 count and previous caries experience at 9, 13, and 17, found in our study are consistent with the conventional knowledge and findings of other studies [21,22,23].

Out of all four of the ML models assessed, LASSO regression was the best-performing model, followed by GBM, then GLM (negative binomial), and lastly the XGBOOST model. The LASSO model had the lowest error rate (RMSE and MAE) and highest R-squared compared to the rest of the models. This is contrary to our conventional approach in traditional statistics where count data are usually analyzed using Poisson regression or negative binomial regression models. This clearly demonstrates one of the capabilities of ML to objectively identify models that best fit and explain the variability in the data, rather than relying on statistical assumptions as in regular statistics. Based on the R-squared, only about 44% of the variability in the age 23 caries counts was explained by the variables in the model. A limitation of the use of only R-squared as a performance metric is that it cannot indicate prediction bias in a model (i.e., bias-variance trade-off). It does not tell if the model adequately fits the data or not.

With the discretization and dichotomization of the observed and predicted values of the LASSO model, the model was 84% accurate overall in predicting whether or not a young adult will have caries given their previous caries experience and exposure to dietary, fluoride, and behavioral elements. Our study’s precision (86%) and recall (93%) mean that only 14% were wrongly diagnosed as having had caries experience when they did not, while only 7% of those who had caries experience were misdiagnosed and predicted as having had no caries. There are no other similar studies in children, adolescents, young adults, middle-aged, or older adults with which to compare our findings.

We identified four variables (age 13 caries experience, age 17 caries experience, the amount of sugar-sweetened beverages intake from age 9 to 13, and frequency of sugar-sweetened beverages intake from age 13 to 17) as the most important ones in the prediction of age 23 cavitated caries counts. Age 17 caries experience was the most important predictor of caries counts in young adults, followed by the age 13 caries count, then the amount of sugar-sweetened beverages intake at age 13, and finally, the frequency of sugar-sweetened beverages intake at age 17. This agrees with our hypotheses and conventional knowledge that there are positive associations between caries outcomes and consumption of sugar-sweetened beverages and previous caries experiences. Other variables like total fluoride intake, SES, and brushing frequency which were significant in the bivariate analysis were not selected in the final model. Our finding also suggests that it takes about 5 to 10 years for the teeth to show obvious cavitation following exposure to sugar-sweetened beverages. The policy implication of this finding suggests that it will take about 5 to 10 years to truly observe the effects of preventive oral health interventions such as sugar taxes on caries outcomes at a population level.

The limitations of the study include the moderate sample size, inability to include all possible explanatory variables like genetic variables, and non-generalizability of the findings due to the local nature of the data (mostly non-Hispanic white and higher than average SES Iowans). We attempted to address the issue of limited sample size by using the nested resampling technique with cross-validation. The addition of other variables, like genetic factors, oral bacterial profiles, dental visits, malocclusion, and other systemic diseases might help improve the accuracy and precision of the predictive models.

This study is unique and innovative because it is the first study to use machine learning to predict a cavitated caries experience outcome in young adults using longitudinal obtained fluoride, dietary, and behavioral variables. The longitudinal predictor variables and the use of data from prior years to make predictions add some level of temporality to our study, allowing us to attribute some level of causality to our study findings and prediction. The use of nested resampling with cross-validation helped minimize bias in prediction by ensuring multiple portions of the data were prospectively used in the prediction of the caries outcome. Finally, unlike regular statistical modeling, the choice of an ML model like LASSO regression allowed for the capability of performing dimensionality reduction and feature (variable) selection, as well as assessment of variable collinearity and possible interactions among predictor variables.


Our ML model generated an accurate, sensitive, and precise model for caries prediction of caries in young adults using longitudinally obtained exposure variables. Our model suggests that continued exposure to a sugary diet for about 5 to 10 years could result in cavitated caries. Our ML algorithm could, in the future, after further development and validation with other diverse population data, be used by dentists and non-dentists as a screening tool to identify the risk of caries in young adults. This will facilitate the translation of caries research into actionable insights that can help improve the quality of life of young adults.

Data availability

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions. We are currently working to share all original Iowa Fluoride Study/Iowa Bone Development Study data later in 2023 through the dbGaP repository under U01- DE028522.


  1. Featherstone JD, Domejean-Orliaguet S, Jenson L, Wolff M, Young DA. Caries risk assessment in practice for age 6 through adult. J Calif Dent Assoc. 2007;35(10):703–13.

    PubMed  Google Scholar 

  2. Brown LJ, Wall TP, Lazar V. Trends in caries among adults 18 to 45 years old. J Am Dent Association. 2002;133(7):827–34.

    Article  Google Scholar 

  3. Ismail AI, Burt BA, Brunelle JA. Prevalence of total tooth loss, dental caries, and periodontal disease in Mexican-American adults: results from the southwestern HHANES. J Dent Res. 1987;66(6):1183–8.

    Article  CAS  PubMed  Google Scholar 

  4. García-Cortés JO, Medina-Solís CE, Loyola-Rodriguez JP, Mejía-Cruz JA, Medina-Cerda E, Patiño-Marín N, Pontigo-Loyola AP. Dental caries’ experience, prevalence and severity in Mexican adolescents and young adults. Revista De Salud Pública. 2009;11:82–91.

    Article  PubMed  Google Scholar 

  5. Drachev SN, Brenn T, Trovik TA. Dental caries experience and determinants in young adults of the Northern State Medical University, Arkhangelsk, North-West Russia: a cross-sectional study. BMC Oral Health. 2017;17:1–0.

    Article  Google Scholar 

  6. Isaksson H, Alm A, Koch G, Birkhed D, Wendt LK. Caries prevalence in Swedish 20-year-olds in relation to their previous caries experience. Caries Res. 2013;47(3):234–42.

    Article  CAS  PubMed  Google Scholar 

  7. Jamieson LM, Roberts-Thomson KF, Sayers SM. Dental caries risk indicators among Australian Aboriginal young adults. Commun Dent Oral Epidemiol. 2010;38(3):213–21.

    Article  Google Scholar 

  8. Hughes G. On the mean accuracy of statistical pattern recognizers. IEEE Trans Inf Theory. 1968;14(1):55–63.

    Article  Google Scholar 

  9. Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B: Stat Methodol. 1996;58(1):267–88.

    Article  Google Scholar 

  10. Toledo Reyes L, Knorst JK, Ortiz FR, et al. Early Childhood predictors for Dental Caries: A Machine Learning Approach. J Dent Res. 2023;102(9):999–1006.

    Article  CAS  PubMed  Google Scholar 

  11. Park Y-H, Kim S-H, Choi Y-Y. Prediction models of early childhood caries based on machine learning algorithms. Int J Environ Res Public Health. 2021;18(16):8613.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Levy SM, Hong L, Warren JJ, Broffitt B. Use of the fluorosis risk index in a cohort study: the Iowa fluoride study. J Public Health Dent. 2006;66(2):92–6.

    Article  PubMed  Google Scholar 

  13. Levy SM, Warren JJ, Davis CS, Kirchner HL, Kanellis MJ, Wefel JS. Patterns of fluoride intake from birth to 36 months. J Public Health Dent. 2001;61(2):70–7.

    Article  CAS  PubMed  Google Scholar 

  14. Greenwell B, Boehmke B, Cunningham J, Developers GB. Gbm: generalized boosted regression models. R Package Version. 2019;2(5):37–40.

    Google Scholar 

  15. Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics. Oct. 2001;1:1189–232.

    Google Scholar 

  16. Chen T, Guestrin C, Xgboost. A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining 2016 Aug 13 (pp. 785–794).

  17. Smith BJ. MachineShop: machine learning models and tools. R Package Version. 2021;3(0).

  18. Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46(3):175–85.

    Article  Google Scholar 

  19. Kuhn M, Wickham H. Recipes: Preprocessing tools to create design matrices. R Package Version. 2020; (1.8).

  20. Becker M, Binder M, Bischl B, Lang M, Pfisterer F, Reich NG, et al. mlr3 book. 2021. In: Applied machine learning using mlr3 in R. CRC Press; 2021.

  21. Alm A, Wendt LK, Koch G, Birkhed D, Nilsson M. Caries in adolescence–influence from early childhood. Commun Dent Oral Epidemiol. 2012;40(2):125–33.

    Article  CAS  Google Scholar 

  22. Haugejorden O, Magne Birkeland J. Ecological time-trend analysis of caries experience at 12 years of age and caries incidence from age 12 to 18 years: Norway 1985–2004. Acta Odontol Scand. 2006;64(6):368–75.

    Article  PubMed  Google Scholar 

  23. Rise J, Haugejorden O, Birkeland JM. Relationship between caries prevalence and incidence among adolescents. Commun Dent Oral Epidemiol. 1982;10(6):340–4.

    Article  CAS  Google Scholar 

Download references


I would like to acknowledge the contributions of Chukwuebuka Ogwo (the corresponding authors) led and contributed to the conception, design, data acquisition, and interpretation, performed all statistical analyses, and drafted and critically revised the manuscript. Steven Levy contributed to the conception, design, data acquisition, and interpretation, and drafted and critically revised the manuscript. John Warren, and Grant Brown contributed to the conception, design, data analysis, and interpretation, and critically revised the manuscript. Daniel Caplan contributed to the data analysis, interpretation, drafting, and revision of the manuscript. Thank you, Alex Curtis and Chandler Pendelton, for data management and statistical support.


This research was supported in part by NIH grants (R01-DE09551, R01-DE12101, M01-RR00059, UL1-RR024979), the Roy J. Carver Charitable Trust, the Delta Dental of Iowa Foundation, and the analysis of the dissertation was supported by the Wefel award from the University of Iowa College of Dentistry and the Post-Comprehensive Graduate Research award from the University of Iowa Graduate College. The publication was supported by Cary Kleinman Oral Health Sciences Research Fund.

Author information

Authors and Affiliations



Chukwuebuka Ogwo, John Warren, Daniel Caplan, and Steven Levy wrote the main manuscript text - specifically the background/introduction, methods, and discussion. Chukwuebuka Ogwo and Grant Brown performed the statistical analysis and wrote the statistical analysis and the results sections. All authors reviewed the manuscript.

Corresponding author

Correspondence to Chukwuebuka Ogwo.

Ethics declarations

Ethics approval and consent to participate

Approval for the Iowa Fluoride Study was obtained from the University of Iowa Institutional Review Board for all components and procedures of the study. Informed consent was obtained from the participants prior to the examinations and questionnaires during age 23 assessments, with assent obtained at ages 13 and 17. Informed Consent also was obtained from the participants’ parents for all ages to children’s age 17. All the methods included in this study are in accordance with the declaration of Helsinki.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ogwo, C., Brown, G., Warren, J. et al. Predicting dental caries outcomes in young adults using machine learning approach. BMC Oral Health 24, 529 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: