^{1}

^{2}

^{3}

^{*}

^{4}

^{2}

^{5}

^{6}

^{7}

^{8}

^{9}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: AR RDG. Performed the experiments: AR KW. Analyzed the data: AR CSL SCB KW. Contributed reagents/materials/analysis tools: CSL SCB KW. Wrote the paper: AR SCB RBA EHC RDG.

Many factors affect the risks for neurodevelopmental maladies such as autism spectrum disorders (ASD) and intellectual disability (ID). To compare environmental, phenotypic, socioeconomic and state-policy factors in a unified geospatial framework, we analyzed the spatial incidence patterns of ASD and ID using an insurance claims dataset covering nearly one third of the US population. Following epidemiologic evidence, we used the rate of congenital malformations of the reproductive system as a surrogate for environmental exposure of parents to unmeasured developmental risk factors, including toxins. Adjusted for gender, ethnic, socioeconomic, and geopolitical factors, the ASD incidence rates were strongly linked to population-normalized rates of congenital malformations of the reproductive system in males (an increase in ASD incidence by 283% for every percent increase in incidence of malformations, 95% CI: [91%, 576%], ^{−5}). Such congenital malformations were barely significant for ID (94% increase, 95% CI: [1%, 250%], ^{−5}), and 43% ID rate increase (CI: [23%, 67%], ^{−5}). Furthermore, the state-mandated rigor of diagnosis of ASD by a pediatrician or clinician for consideration in the special education system was predictive of a considerable decrease in ASD and ID incidence rates (98.6%, CI: [28%, 99.99%], ^{−16}).

Disease clusters are defined as geographically compact areas where a particular disease, such as a cancer, shows a significantly increased rate. It is presently unclear how common such clusters are for neurodevelopmental maladies, such as autism spectrum disorders (ASD) and intellectual disability (ID). In this study, examining data for one third of the whole US population, the authors show that (1) ASD and ID display strong clustering across US counties; (2) counties with high ASD rates also appear to have high ID rates, and (3) the spatial variation of both phenotypes appears to be driven by environmental, and, to a lesser extent, economic incentives at the state level.

Autism spectrum disorders (ASD) are a collection of chronic, complex neuropsychiatric diseases with well-characterized comorbidities and increasing apparent prevalence

Along these lines, geospatial clustering of ASD has been observed in California

Here, we report a mixed-effect Poisson regression analysis of the spatial incidence patterns of ASD and, for comparison, intellectual disability (ID). The data was derived from a very large insurance claims database containing nearly 100 million patients in the United States, which was augmented with census data to introduce additional county-level covariates that captured socioeconomic, demographic, and environmental effects. We present strong statistical evidence for environmental and legal factors driving the apparent spatial heterogeneity of both phenotypes, while documented socioeconomic factors and population structure have much weaker effects.

We analyzed the strength of disparate factors on the apparent incidence rates of ASD and ID by computationally interrogating insurance claims for approximately one third of the US population, using a bivariate-response, three-level, mixed-effects Poisson regression model with 50 free parameters, 44 of which correspond to the fixed effects of known factors while the remaining 6 account for the variance and covariance among the random effects (see Methods). The bivariate outcomes modeled by these parameters were the incidence counts for the ASD and ID phenotypes, tabulated separately for males and females in 3,111 counties, nested within 50 states (plus the District of Columbia) and adjusted for population size.

The results are summarized in ^{−16}). The estimated Poisson rates of incidence produced by model-based inference are shown in

Plots for females look essentially identical, but the absolute rate of incidence is lower (data not shown). The predictor variables shown here are

The asterisks indicate the level of significance of individual regression coefficients; see the figure key and

In the figures we color-coded the Empirical Bayes estimates of the state-level random effects, separately for ASD and ID. County- and state-level random effects model the unknown factors that vary geographically and govern differences in county-specific disease rates

In the figures we color-coded the Empirical Bayes estimates of the state-level random effects, separately for ASD and ID. While county-specific random effects are directly comparable

Parameter | Effect | Estimate | Event Rate | Estimate's CI | Rate's CI | |||

Lower | Upper | Lower | Upper | |||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

^{−5} |
||||||||

Accumulating evidence ^{−5}). Similarly, non-reproductive congenital male malformations accounted for a 31.8% ASD rate increase (CI: [12%, 52%], ^{−5}). In contrast, male congenital malformations of the reproductive system were barely significantly predictive for ID (94%, CI: [1%, 250%], ^{−5}). Another variable significantly affecting both ASD and ID was population-adjusted incidence of viral infections in males (^{−16}, Fisher's exact test).

Male congenital malformations of the reproductive system are subdivided in the ICD9 taxonomy ^{−16}), as were hypospadias and epispadias (

Both ASD and ID showed significant gender-specific incidence effects, with males affected more frequently than females; this was more extreme for ASD (^{−5}) and a 2.7% rate increase for ID (CI: [1.8%, 3.7%], ^{−5}). Other important socioeconomic predictors included the percentage of urban population in a county; a one percent increase in urbanization predicted about a 3% increase in ASD and ID incidence (

By analyzing the spatial incidence patterns of autism and intellectual disability drawn from insurance claims for nearly one third of the total US population, we found strong statistical evidence that environmental factors drive the apparent spatial heterogeneity of both phenotypes while economic incentives and population structure appear to have relatively large albeit weaker effects. The strongest predictors for autism were associated with the environment: congenital malformations of the reproductive system in males (an increase in ASD incidence by 283% for every per cent of increase in the incidence of malformations), non-reproductive congenital malformations (31.8% ASD rate increase), and viral infections in males (19% ASD rate increase). For ID we observed similar but weaker effects: 93% increase of ID rate for every per cent of increase in congenital malformations of the reproductive system in males, 43% increase for per cent of non-reproductive congenital malformations, and 23% for viral infections in males.

We highlight the role of male congenital genitourinary malformations as an approximate measurement of environmental exposure to unmeasured developmental risk factors, including toxins. Some infants are born with congenital malformations with unknown genetic etiology–not explained by known Mendelian mutations or detectable chromosomal aberrations. At least a fraction of such birth defects may be due to parental exposure to environmental insults. The environmental factors implicated so far include pesticides

It is known that some birth malformations are caused by

The hypospadias of the male urethra can arise during early embryonic development, specifically weeks 9–12 (p. 206 in

Following similar logic, in addition to causing birth defects, environmental toxins, such as pesticides

Importantly, the effect of state-level regulations involving ASD appeared relatively large in magnitude (over 98% ASD and ID rate decrease) but with a wide confidence interval and inconsistent effects across states, resulting in only marginal significance. Furthermore, our estimates of random effects at the state and county levels, see

As with other statistical analyses (see

Study (state) | |||||

All live births and diagnostic records for children born in CA between 1992 and 2000. | All live births in CA occurring in 1996–2000 | Administrative educational data for prevalence of autism and other special education categories for the academic years 2000–2001 through 2005–2006. | Record-based surveillance for 8 NC counties biennially from the Autism and Developmental Disabilities Monitoring Network. | Record-based surveillance for eight-year-old children born in 1994 and living in Utah in 2002 from the Utah Registry of Autism and Developmental Disabilities Program. | |

4,906,926 | 2,453,717 | 4,057,712 | 11,034 | 26,108 | |

18,731 | 9,900 | 7,022 (ASD+ID) | 532 (ASD), 1,028 (ID) | 99 (ASD-only), 33 (ASD and ID), 113 (ID-only) | |

Multilevel logistic regression. | Spatial clustering and bivariate mixed Poisson regression. | Multilevel Poisson regression. | Generalized additive model. | Multiple single-variable logistic regressions. |

Our results have implications for the ongoing scientific quest for the etiology of neurodevelopmental disorders. We provide evidence that routinely expanding the scope of inquiry to include environmental, demographic and socioeconomic factors, and governmental policies at a broad scale in a unified geospatial framework. It appears that detailed documentation of environmental factors should be recorded and used in genetic analyses of ASDs and failure to do so risks omitting important information about possibly strong confounders.

Our analysis involved de-identified patient data and was approved by the University of Chicago Institutional Review Board.

Our multi-level, mixed-effects model predicted the incidence of ASD and ID conditional on several individual-level, county-level, and state-level covariates. For the regression analysis described below, we used county level variables to predict disease rate. In the analysis of the comorbidity between congenital malformations and ASD or ID, we used patient-level data.

We used the Truven Health Analytics MarketScan Commercial Claims and Encounters Database to provide geocoded diagnosis counts by gender. This database spans the years 2003 to 2010 and consists of approved commercial health insurance claims for between 17.5 and 45.2 million people annually, with linkage across years, yielding a total of approximately 105 million patient records. (Note that, consistent with low prevalence of both phenotypes, only a small proportion of individuals described in this enormous dataset were diagnosed with either ASD or ID.) This national database contains information contributed by well over 100 insurance carriers and large self-insuring companies. We scanned approximately 4.6 billion inpatient and outpatient service claims and identified almost 6 billion diagnostic codes. After removing duplicates, almost 1.3 billion diagnostic codes were found to be associated with over 99.1 million individuals, yielding approximately 12.89 unique diagnostic codes per individual. Claims were de-identified, that is, all patient-level personal information was redacted, and geocoded at the county level by Truven, and thus, this did not require any additional processing on the authors' part.

The MarketScan insurance claims dataset is not a random sample of the USA population. This is because compilation of this dataset required reaching agreements between Truven and numerous individual insurance providers to share data, and the insurance providers inherently had uneven and non-random coverage of geographic areas. It is possible, therefore, that the dataset carries traces of hidden correlations imposed by the data collection method. Furthermore, while the entire USA is well represented in the data, it is possible that coverage across geographic areas is not perfectly proportional to population density.

We framed our analysis as a mixed-effect regression model for Poisson-distributed count data

The model parameters were estimated using a joint statistical inference as follows. Most of the parameters (44 out of 50) were real-valued coefficients representing regression weights of individual factors, such as average income in county, percentage of ethnic groups, per cent of urban and poor population, see

Fixed effects by design have stochastic but predictable influence on data, while random effects describe zero-centered random influence, not captured by the fixed effects. Below we present more formal formulation of the model.

The Truven database was augmented with US census data

To account for variation in policies for special education eligibility and reimbursement, we used four variables derived by hand-coding the state policies for eligibility in special education programs under the Individuals with Disabilities Education Act

The assumptions of the Poisson regression model _{ijkl}_{ijkl}_{ijkl}_{k}_{k}_{k}_{k}_{ik}_{ijk}_{ijkl}

The third assumption was that data was hierarchical: the zero-centered random effects were independently introduced at ^{2}_{ik}^{2}_{ijk}_{i}_{ij}

Together these assumptions define the following likelihood for the given _{ijkl}

While estimates varied slightly across different implementations of the model and estimation approaches, the major trends were identical across all. Here we present the results of the Markov chain Monte Carlo/Empirical Bayes analysis. The estimation methods and starting parameter values varied considerably across the implementations. For example, SuperMix started with finding an analytical solution of the fixed-effect part of the equations and then estimated parameters for the full model involving random effects. The Markov chain Monte Carlo-based GLLAMM started with a random set of parameter guesses and then rather quickly discovered the high-probability area of parameter values.

We tested several putative confounding variables, such as county-specific median mother's age at childbirth, and the proportion of county population in the childbearing age. While these variables produced statistically significant fixed-effect coefficients, they did not affect the relationship between the outcome variable and the compound environmental predictor variables as would be expected if these added variables were true confounders.

The authors are grateful to David Blair, Christiann Frazier, Ravinesh Kumar, Diane Lauderdale, Rita Rzhetsky, and Kevin White for numerous comments that significantly improved the manuscript.