^{1}

^{*}

^{2}

^{3}

Conceived and designed the experiments: RVR SMF BSW. Performed the experiments: RVR. Analyzed the data: RVR. Wrote the paper: RVR SMF BSW.

The authors have declared that no competing interests exist.

With the expansion of offender/arrestee DNA profile databases, genetic forensic identification has become commonplace in the United States criminal justice system. Implementation of familial searching has been proposed to extend forensic identification to family members of individuals with profiles in offender/arrestee DNA databases. In familial searching, a partial genetic profile match between a database entrant and a crime scene sample is used to implicate genetic relatives of the database entrant as potential sources of the crime scene sample. In addition to concerns regarding civil liberties, familial searching poses unanswered statistical questions. In this study, we define confidence intervals on estimated likelihood ratios for familial identification. Using these confidence intervals, we consider familial searching in a structured population. We show that relatives and unrelated individuals from population samples with lower gene diversity over the loci considered are less distinguishable. We also consider cases where the most appropriate population sample for individuals considered is unknown. We find that as a less appropriate population sample, and thus allele frequency distribution, is assumed, relatives and unrelated individuals become more difficult to distinguish. In addition, we show that relationship distinguishability increases with the number of markers considered, but decreases for more distant genetic familial relationships. All of these results indicate that caution is warranted in the application of familial searching in structured populations, such as in the United States.

The forensic identification of criminal suspects through DNA profiling is now common in the United States. Indirect identification by familial DNA profiling is increasingly proposed to extend the utility of DNA databases. In familial searching, a DNA profile from a crime scene partially matches a database profile entry, implicating close relatives of the partial match. While the basic principles behind familial searching methods are simple and elegant, statistical confidence that a partially matched profile belongs to a true genetic relative has not been fully explored. Here, we derive relative identification likelihood ratio statistics and consider how the ability of familial searching to distinguish relatives from unrelated individuals varies over population samples and is affected by inaccurately assumed population background. We observe lower relationship distinguishability for population samples with less identifying information in the genetic loci considered. Additionally, we show that relationship distinguishability decreases with discordance between true and assumed population samples. These results indicate that, if an inappropriate genetic population group is assumed, individuals from certain marginalized groups may be disproportionately more often subject to false familial identification. Our results suggest that care is warranted in the use and interpretation of familial searching forensic techniques.

Forensic identification via exact genetic profile matching has become common practice in the United States

Familial searching is now used fairly frequently in the United Kingdom and was instrumental in the identification of suspects of violent crimes for 20 cases lacking other evidence as of 2008

Despite the increasing use of familial searching in the United States, important questions about the method remain on both social and scientific grounds. In order to understand these concerns, we must appreciate that familial searching is most useful as a database mining method in cases with no suspects. In the United States, the Combined DNA Index System (CODIS) is the Federally administered system for National DNA Index System (NDIS), the national offender/arrestee database, which includes entries from State DNA Index Systems

These features of the forensic testing landscape matter because, unlike exact DNA identification, a typical database search for familial matches prospectively identifies candidate suspects who, while closesly genetically related to database entrants, are not in themselves in the database, provoking complex privacy concerns

The question of relative inference has been well-studied in other contexts with varying marker types, relationships, and numbers of individuals

In this study, we aim to examine some of these concerns by exploring how familial searching techniques behave on populations with varying allele frequency distributions and varying accuracy of allele frequency specification. We formulate and calculate confidence intervals for familial identification likelihood ratio (LR) estimates, and investigate how well siblings and unrelated individuals can be distinguished over different population samples with varying allele frequency distributions and under accurately and inaccurately assumed allele frequency distributions. We show that population samples vary in the amount of identifying information encoded in the CODIS loci and, therefore, in relationship distinguishability, even with correctly specified allele frequencies. Since completely accurate allele frequency specification is not guaranteed and the most appropriate population sample may not be known or available, we are also interested in the systematic effects of assuming allele frequencies which are appropriate for one population, but which are not appropriate for the individuals investigated. We show that relationship distinguishability decreases with the accuracy of allele frequency estimates, potentially resulting in high rates of coincidental familial identification for some groups. These results are especially pertinent in the multiple testing context of large database searching. In addition, we explore the relationships between relationship distinguishability, the number and type of markers used for identification, the relationship considered, and the true and assumed coancestry coefficient parameter value.

To determine if a partial genotype match is better explained by a genetic familial relationship or stochasticity, we used the ratio of the likelihood of the observed partial match assuming the individuals share a given genetic familial relationship, to the likelihood of the observed partial match assuming the individuals are unrelated. With the data available, this LR is the most powerful statistic to separate relatives from unrelated individuals

Unrelated individuals were simulated in a randomly mating population by independently drawing alleles from allele frequency distributions, similarly to Bieber

Note that across all of these simulations specific parameter values were chosen and kept constant, specifically, sibling relationships, the assumed coancestry coefficient (probability of two alleles being identical by descent (IBD) between two individuals not recently related) used in calculations of

To understand the degree to which

Each individual plot shows the distribution of

True population sample | |||||

Vietnamese | African American | European American | Latino | Navajo | |

Vietnamese | 6.94 | 6.35 | 6.19 | 6.51 | 5.00 |

African American | 5.91 | 7.38 | 6.41 | 6.59 | 4.68 |

European American | 6.23 | 6.64 | 6.98 | 6.90 | 4.71 |

Latino | 6.31 | 6.80 | 6.83 | 7.07 | 5.21 |

Navajo | 5.01 | 5.94 | 5.74 | 5.94 | 5.87 |

Differences in distinguishability between population samples are rooted in differences in the shapes of allele frequency distributions. Since alleles and individuals are simulated independently, varying distinguishability over populations cannot be due to varying consanguinity and must be attributed to varying allele frequency distributions. In the examined population samples, the shape of allele frequency distributions can vary substantially. As a dramatic, but atypical, example,

Intuitively, it is clear that a monomorphic locus contains no identifying information, while a locus with a unique polymorphism for every individual contains complete identifying information. Along this spectrum, a locus with a low-variance allelic type distribution is less identifying than a locus with a high-variance allele frequency distribution.

This concept of varying identifying information can be quantified as observed gene diversity (or equivalently, average expected heterozygosity)

The calculated

The empirical distinguishability (

Information theory can provide a more direct measure of identifying information through entropy, which we calculate to quantify the number of bits required to encode an equivalent amount of information as a CODIS haplotype for each population group. We find that relationship distinguishability is even more correlated with entropy than observed gene diversity, which follows since entropy quantifies information content which directly affects distinguishability (see

By calculating LCL under different assumed and true population sample allele frequencies, the relationship between allele frequency misspecification and relationship distinguishability can be examined. By looking at plots and values off the diagonal,

In this study, we chose not to define a single decision threshold for determining positive relative identifications since such a threshold depends on a number of factors beyond the scope of this study (e.g., the social, economic, and political cost of false positives and negatives). For a range of decision thresholds,

The empirical power (dashed) and false positive rate (solid) are shown for a range of sibling versus unrelated

It is notable that even when the correct allele frequencies are used, the false positive rate is lower than the

We observed lower distinguishability when the true and assumed allele frequency distributions differ more. The degree of difference between population sample allele frequency distributions at the CODIS alleles is quantified for every population pair using

African American | European American | Latino | Navajo | |

Vietnamese | 0.038 | 0.026 | 0.021 | 0.068 |

African American | 0.017 | 0.015 | 0.067 | |

European American | 0.000 | 0.059 | ||

Latino | 0.041 |

To explore the relationship between distinguishability and the genetic distance between true and assumed population samples, in

The empirical measure of distinguishability (

Intuitively, when allele frequencies are misspecified, the most likely error is assuming that common alleles are more rare simply because truly common alleles are more likely to be observed than truly rare alleles. In the same way, rare alleles are assumed to be common, but by definition, rare alleles are less likely to be observed shared between individuals, so overall the misspecification of common alleles as rare dominates. When misspecifying common alleles as rare, observing the same common alleles in multiple individuals seems surprising, so a genetic relationship model is favored over a model of no relationship. That is, the probability of a partial match assuming a relationship is inflated and the probability of a partial match assuming no relationship is deflated. In this way, allele frequency misspecification results in an increase in false positive relative identifications.

Although the relationship between distinguishability and allele frequency misspecification has not yet been deeply considered in the context of genetic familial identification (but see

We have shown clear differences in average observed gene diversity of the CODIS loci and resulting differences in sibling and unrelated individual distinguishability in the five population samples considered. To ensure that these findings extend beyond the samples examined, we considered a larger dataset with a total of 32 population samples

In the analysis presented thus far, we showed how distinguishability varies over true and assumed population samples with varying allele frequency distributions. To maintain manageable dimensionality, some key parameters likely to vary in forensic analyses were kept constant. Here we explore the relationships between these parameters, particularly different genetic relationships, varying marker data, and varying the true and assumed coancestry coefficients (

Pairs of individuals were simulated taking into account the true coancestry coefficient,

We simulated two types of markers with equi-frequent alleles: 10-allele STRs and 2-allele SNPs. We varied the number of simulated markers over 10, 20, 30, 40, 50, and 60 STRs and 10, 50, 100, 150, 200, and 250 SNPs in independent simulations. Distinguishability between the LCL distributions of true relatives and unrelated individuals were calculated for each of these simulations (

For unrelated individuals,

The genetic similarlity of relatives can be quantified with the kinship coefficient, which is the probability that a pair of alleles from relatives are IBD. The kinship coefficient for parent-offspring, sibling, half-sibling, first cousin, and second cousin relationships are

To explore the relationship between true coancestry coefficient

The analysis presented here confirms and quantifies the intuition from population genetics that for particular loci, groups with comparatively low-variance allele frequency distributions have less identifying information encoded in genotypes. Decreased identifying information results in lower relationship distinguishability, even when the correct allele frequency estimates are used (

With a basic understanding of population genetics, it is clear that socially defined groups, like Navajo, Latino, or European American, have very different underlying population structures reflecting distinct demographic history, degrees of genetic diversity, and admixture. It is hardly surprising that a group which has undergone multiple population size reductions, like the Navajo, has a lower-variance allele frequency distribution than a group with a history of genetic diversity and social inclusion, like African Americans. This is particularly evident at the CODIS loci, which were chosen in part because of their broad allele frequency distributions in a few studied populations, without considering all relevant populations

These population differences in allele frequency distributions are key when considering a potential source of error: inappropriately assumed allele frequency distributions. When the allele frequency distributions for an inaccurately specified population group are assumed, the probabilities of the observed data under a sibling relationship and under no close genetic relationship become less distinct, so relationship distinguishability decreases. We found that distinguishability decreases with increased distance between assumed and true allele frequency distributions, as measured through

The results of this analysis indicate that when a decision threshold is chosen so that the power to identify siblings is reasonably high, population samples with allele frequencies which differ from those assumed would experience disproportionately higher rates of false positive familial identification (

Motivated by the question of forensic familial searching, in this analysis we focus on distinguishing relatives with a specified relationship and unrelated individuals. In other contexts, it may be more appropriate to distinguish different kinds of relatives (e.g. siblings and parent-offspring) or relatives with an unspecified relationship and unrelated individuals. In the former case, the ratio of LRs for the relationships of interest versus unrelated individuals reduces to the LR comparing the two specified relationships. In the later case, models allowing IBD sharing probabilities to vary can be formulated and incorporated into the LR. For example, when comparing a null model with set IBD sharing probabilities for unrelated individuals and an alternative where the likelihood of data is maximized over any IBD sharing probabilities, a LR test can be formulated which follows a

This analysis considers familial identification in a forensic context, but is applicable to tests for relatedness applied in the various contexts especially when considering unlinked genetic markers as in paternity investigation, ecological surveys, and conservation biology. When more extensive genotype or sequence data are available, it is appropriate to use more sophisticated tests for relatedness considering linkage or shared haplotype length

The population genetic model used in forensic identification is remarkably coarse. In direct identification, the CODIS loci provide ample data to determine identity and non-identity, even with the coarse population genetic model of a small number of discrete homogenous genetic groups corresponding to social racial groups. We have shown that under this model, new concerns arise with familial searching. However, the model itself requires some scrutiny. It is clear that human genetic population structure is complex and humans are not easily split into a small number of discrete homogenous genetic groups

Forensic familial searching will most likely be implemented in the context of a large offender/arrestee database, introducing questions of multiple testing over both database entrants, and the number of genetic familial relationships considered. Because forensic methodology practice varies over jurisdictions, it is not clear how these multiple testing issues have been, or will be, addressed. However, it is reasonable to assume that familial searching will result in a list of partial database matches with

Individual and population genotype information is necessary to determine the extent to which inaccurately assumed allele frequencies cause high false positive rate in familial matching in practice. For instance, in this study, we considered unrelated individuals, conforming to exactly one of five allele frequency distributions, in completely randomly mating populations. However the use of familial searching rests on the premise that relative groups are in the database and population structure is undeniably present in most databases

If implemented with the core CODIS loci, familial searching may result in low distinguishability and potentially high false positive rates among certain groups, especially if only African American, European American, Southeastern Latino, and Southwestern Latino allele frequency distributions are in assumed LR calculations, as recommended by SWGDAM

Our analysis makes use of allele frequency data for the 13 CODIS loci over different population samples socially defined by race. Note that alternate schemes to group individuals will also produce genetic differences between groups

The consent and population grouping procedures used in obtaining these data are not clear. In the time since these data were collected, dominant cultural ethics regarding informed consent process have changed considerably, motivated largely by several cases of severe misuse of samples provided by Indigenous communities

LRs are used to compare the probability of observed genotypes for two individuals under two different hypotheses: the individuals are unrelated (

By assuming independence between all CODIS loci,

Relationships between individuals can be described using the identical by descent (IBD) sharing probabilities

Using these IBD sharing probabilities, the LR becomes

The LR described above provides information about whether the observed data are more likely for unrelated or related individuals. However, the true population allele frequencies (

Sampling variation is inherent in allele frequency estimation since a random sample must be chosen for the estimate. By their nature, different random samples vary in their representation of specific alleles, resulting in different allele frequency estimates. Additionally, random genetic sampling exists in the historical differentiation of populations, resulting in population groups with distinct allele frequencies. Since all present-day human population groups descend from a common ancestral population, the alleles present in each present-day population group reflect a sample of the alleles from the common ancestral population.

Under evolutionary equilibrium and a simple model of demographic history, the relationship between population group allele frequencies (

Using the same approach as Beecham and Weir

Using the data provided by Budowle and Moretti

The total lack of population structure or cryptic relatedness (

In the second part of this analysis, when we consider the interplay between various parameters, it is necessary to simulate unrelated individuals from a population with a given non-zero coancestry coefficient (

We are interested in comparing LCL distributions generated with different parameters, particularly LCL distributions for truly unrelated individuals and truly related individuals. If the relationship

To quantify distinguishability, we use an empirical version of the measure proposed by Visscher and Hill

Confidence intervals by population samples. Each plot shows the 100 replicates of

(EPS)

Allele frequency distributions. Each plot shows the D3S1358 allele frequency distribution for each population.

(EPS)

(EPS)

Distinguishability (

(EPS)

(EPS)

(EPS)

Supporting work is presented, specifically genotype probability equations,

(PDF)

We are immensely grateful to the individuals whose DNA samples make up the datasets referenced in this manuscript. This work could not have been performed without that information. In addition, we thank Timothy Thornton and Kirk Lohmueller for their thoughtful discussion during the preparation of this manuscript, Nanibaa' Garrison for her insightful guidance regarding data description and use, Kim TallBear for valuable consultation on appropriate data handling, and three anonymous reviewers for productive commentary and discussion.

_{ST}