Recruitment and data collection
The Center for Oral Health Research in Appalachia (COHRA) was created to identify the community-, family-, and individual-level predictors of oral health outcomes in the Appalachian population , a vulnerable subpopulation with poorer oral health compared to the greater US population [27–29]. COHRA participants were recruited by household as previously described [6, 7, 26], whereby eligible households were required to include at least one biological parent-offspring pair with the child being 1 to 18 years of age. All members of eligible households were invited to participate without regard to their oral health status, demography, or biological or legal relationships. Written informed consent was provided by all adult participants. Assent with parent or guardian written consent was provided on behalf of all child participants. The study was approved by the COHRA research committee and the Institutional Review Boards of the University of Pittsburgh and West Virginia University.
In total, 732 households were recruited, which comprised 2,663 individuals from 740 biological kinships of 1 to 20 family members (mean = 4.72 members). Some kinships spanned multiple households, whereas other households contained multiple kinships. Reported familial relationships were validated using panels of ancestry-informative  and whole-genome  genetic marker data provided by the Center for Inherited Disease Research at Johns Hopkins University and quality checked jointly by study investigators and the Coordinating Center for the NIH Genes and Environment Initiative (GENEVA; ).
Dental caries was assessed via visual inspection with a dental explorer during intra-oral dental examinations conducted by dentists or research dental hygienists calibrated with respect to a reference dentist at least once per year. Inter- and intra-examiner concordances of caries assessments were high [7, 26]. Each tooth surface was scored as sound, pre-cavitated, decayed, filled, missing due to decay, or missing due to reasons other than decay, in accordance with the World Health Organization DMFS/dfs scale and in accordance with the NIH/NIDCR-approved protocol for assessing dental caries for research purposes . This method of caries assessment is compatible with that recommended by the PhenX Toolkit (http://www.phenxtoolkit.org; designed to facilitate combining data across studies), and the National Center for Health Statistics Dental Examiners Procedures Manual (See Section 126.96.36.199) . Third molars were excluded from caries assessment. Edentulous individuals were recruited into the study but were excluded from caries assessment and analysis.
The analytic goal of the present study was to explore patterns of dental caries of the permanent dentition in adults. Therefore we excluded children by restricting our study sample to the 1,068 participants aged 18 to 75 years. For each participant, surface-level caries data on 128 surfaces (i.e., 4 surfaces for each incisor and canine, and 5 surfaces for each premolar and molar) were coded as 0 for sound or missing due to reasons other than decay, or coded as 1 for pre-cavitated, decayed, missing due to decay, or filled/restored. Thus, we generated a matrix of 1,068 participants by 128 indicators of surface-level caries affection status. This matrix was used as input for two related methods of extracting patterns within the data: PCA and FA .
PCA uses singular value decomposition of the data matrix to extract a set of uncorrelated variables (called principal components scores, PCs) where the first PC (i.e., PC1) explains the greatest possible amount of variability in the data in a single dimension, and the second PC (i.e., PC2) explains the greatest possible amount of remaining variability in the data in a single dimension orthogonal to PC1, and so on. The result is a number of orthogonal PCs equal to the number of original variables (in our data, 128), with successive PCs each explaining less and less of the data variability. Each PC can be defined as a linear combination of the original variables weighted by their loadings. The first several PCs may represent important patterns in the data, essentially assessing underlying signals from a greater number of correlated phenotype measurements. The loadings provide a way of interpreting the PCs in terms of the original variables. In other words, the loadings describe the pattern of carious lesions across the permanent dentition for a given PC, whereas the actual PCs indicate the extent/severity of caries of that decay pattern.
FA is similar to PCA in that it is used to extract latent variables called factor scores (FACs) from an original data matrix. Like PCs, FACs are calculated as linear combinations of the original variables weighted by their loadings, except that the number of FACs used to model the patterns in the data is chosen a priori, and the FACs are not constrained to be orthogonal. In this study, we modeled the caries data matrix using 10 factors. Like PCA, the goal of FA is to generate FACs representing underlying signals in the data matrix that can then be used as phenotypes, in this case, to identify the risk factors for dental caries.
In practice, FA and PCA often perform similarly. However the two methods take opposite perspectives in extracting patterns from a data matrix: PCA assumes that the observed variables provide the basis for the patterns, whereas FA assumes that latent patterns provide the basis for the observed variables. In this way, PCA is often used for dimension reduction, i.e., summarizing the information from a large number of variables with a few variables, whereas FA may better represent underlying "endophenotypes", i.e., unmeasured phenotypes that manifest as the observed variables. For both PCA and FA, the loadings define the patterns of decay and the PCs and FACs describe the severity of disease for their corresponding patterns.
For comparison to the PCs and FACs, we also generated three a priori caries phenotypes: the DMFS index, pit and fissure surface caries (PFS), and smooth surface caries (SMS). These a priori phenotypes are commonly used in the caries literature. DMFS was calculated as the number of pre-cavitated, decayed, missing due to decay, or filled/restored surfaces. PFS and SMS were calculated in the same way as DMFS except that counts were limited to pit and fissure surfaces and smooth surfaces, respectively. Occlusal surfaces of the premolars and molars, buccal surfaces of the maxillary molars, and lingual surfaces of the mandibular molars were considered pit and fissure surfaces. All other tooth surfaces were considered smooth surfaces.
In order to assess the stability of patterns identified by PCA and FA, we performed a sensitivity analysis by repeating PCA and FA on ten random subsets of the data comprised of 80% of the full sample. We compared the PCs and FACs obtained from random subsets to those from the full sample using the Pearson correlation coefficient, r. PCs 1-4 were extremely stable (r = 0.98 to 1.00), PCs 5-9 were stable (r = 0.86 to 0.95), and PC 10 was moderately stable (r = 0.77) across random subsets. FACs 1-6 were stable (r = 0.86 to 0.99), and FACs 7-10 were moderately stable (r = 0.69 to 0.82) across random subsets. Likewise, we assessed the effect of relatives on PCA and FA by repeating these methods in the maximal subset of unrelated individuals. PCs 1-10 and FACs 1-8 from the unrelated sample were highly correlated (r > 0.95) with those from the full sample, whereas FAC9 and FAC10 were moderately correlated (r = 0.57, and 0.81, respectively). Altogether, these results suggest that caries patterns were generally stable and robust to the inclusion of relatives among the sample.
Heritability estimates of PCs and FACs were calculated using the variance components approach. This method models phenotype correlations among all types of relatives as a function of the expected degree of genetic sharing (i.e. that parents and offspring share 50% of their genome, siblings share 50%, half-siblings share 25%, unrelated individuals share 0%, etc.). Details for this method as applied to our study sample have previously been reported [6, 36]. The heritability estimate is interpreted as the proportion of phenotype variance attributable to the cumulative effect of all genes.
All statistical analyses were performed in the R software package (R Foundation for Statistical Computing, Vienna, AU), except heritability estimates which were obtained from genetic modeling performed in SOLAR . Principal components analysis was performed using the prcomp function with default parameters. Factor analysis was performed using the factanal function with the Thomson's regression-based scores option, 10 factors, and other default parameters. Prevalences, correlations, and figures were all generated in R.