^{1}

^{2}

The author has the following interests: commercial products under development related to the application of machine learning to law-making prediction and analysis. These products use a different approach than what is described in this paper, but accomplish the similar goal of predicting enactment probabilities for pending bills in the U.S. Congress. This does not alter the author’s adherence to PLOS ONE policies on sharing data and materials. No patents or patent applications exist for the work in this paper.

Out of nearly 70,000 bills introduced in the U.S. Congress from 2001 to 2015, only 2,513 were enacted. We developed a machine learning approach to forecasting the probability that any bill will become law. Starting in 2001 with the 107th Congress, we trained models on data from

The U.S. legislative branch creates laws that impact the lives of hundreds of millions of citizens. For example, the Patient Protection and Affordable Care Act (ACA) significantly affected the health care industry and individuals’ health insurance coverage. Bills often consist of hundreds of pages of dense legal language. In fact, the ACA is more than 900 pages long. There are thousands of bills under consideration at any given time and only about 4% will become law. Furthermore, the number of bills introduced is trending upward (see

Due to the complexity of law-making and the aleatory uncertainty in the underlying social systems, we predict enactment probabilistically. It’s important to make

Forecasting model performance should be estimated using multiple metrics on large amounts of test data measured

Although previous research found that bill text was useful for predicting whether bills will survive committee [

Analyzing a model that makes successful ex ante predictions can be more informative than ex-post interpretations of socio-political events (outside experiment-like settings) due to the over-fitting that plagues most modeling of observational data [

Continuous-space vector representations of words can capture subtle semantics across the dimensions of the vector [

Trees are decision rules that divide predictor variable space into regions by choosing variables and their threshold values on which to make binary splits [

Using the predictions from the inversion of the word vector language model (as described in Section 2.1.1.) as features allows the training process to learn interactions between contextual variables and textual probabilities. Additionally, the sensitivity analysis can then estimate the impact of text predictions on enactment probabilities along with the contextual predictors, controlling for the effect of the probability of the bill text when estimating non-textual effects.

Random forests and GBMs combine weak learners to create a strong learner. Stacking combines strong learners to create a stronger learner. A cross-validation stacking process on the training data is used to learn a combination of the three base models to form a meta-predictor [

We use the two most frequently applied binary classification probability scoring functions: the log score and the brier score (see

We train language models with word2vec for enacted House bills, failed House bills, enacted Senate bills, and failed Senate bills and then investigate the most similar words within each of these four models to word vector combinations representing topics of interest. That is, for each of the four models, return a list of most similar words: _{i} is one of _{1:N} are the _{i} is _{i}, and

We conduct a sensitivity analysis on our model of the legislative system by varying inputs to the model and measuring the effect on the output. If input values are varied one at a time, while keeping the others at “default values,” sensitivities are conditional on the chosen default values [

Next, we expand the factor variables out so each level is represented in the design matrix as a binary indicator variable. This allows us to estimate the effect of each level of a factor, e.g. the 39 subject categories. We add interaction terms between the Chamber and bill characteristics, e.g. whether the bill originated in the Senate and the number of characters, to estimate these interaction effects potentially automatically learned by the tree models. Finally, we estimate the relationship between the resulting matrix of input values and the vector of predicted probability outputs with a partial rank correlation coefficient (PRCC) analysis, which estimates the correlation between an input variable and the predicted probability of bill enactment, discounting the effects of the other inputs and allowing for potentially non-linear relationships by rank-transforming the data before model estimation [

We include all House and Senate bills and exclude simple, joint, and concurrent resolutions because simple and concurrent resolutions do not have the force of law and joint resolutions are very rare. We downloaded all bill data (from the 103rd Congress through the 113th Congress) other than committee membership from govtrack.us/developers/data, which is created by scraping THOMAS.gov. We downloaded committee membership data from web.mit.edu/17.251/www/data_page.html [

There is often more than one version of the full text for each bill. In order to create a forecasting problem that predicts enactment as soon as possible, the earliest dated full text is used, which is, for more than 99% of the bills in the testing data, the text as it was introduced. To understand how much predictive power newer versions add, we collect the most recent version of each bill, which is, for 87% of the bills in the testing data, the version as introduced. Bills can change dramatically between the time of their introduction and the time of the last action taken on them. H.R. 3590 in the 111th Congress, was a short bill on housing tax changes for service members when it was introduced, and shortly before it was enacted it was the 906-page Affordable Care Act. H.R. 34 in the 114th Congress was originally introduced as the Tsunami Warning, Education, and Research Act and was about 30 pages long. Shortly before it was enacted, H.R. 34 was the 312-page 21st Century Cures Act.

The full text of all introduced bills is only available starting with the 103rd Congress (1993–1995) and therefore this is the first Congress used to train language models. The 104th Congress is the first used to train the base models of the ensemble because they require the language model predictions and the language models need the 103rd for training. The 107th Congress (2001–2003) is the first to serve as a testing Congress because the full model needs multiple Congresses worth of data for training. We used the list of predictor variables from [

The following variables capture characteristics of a bill’s sponsor and committee(s):

The following variables capture political and temporal context of bills:

The following variables capture aspects of bill content and characteristics:

Five models are compared across the two time conditions.

Using only text outperforms using only context on two of three performance measures (AUC and Brier) for the newest data, while using only context outperforms only text on three measures for the oldest data (

Dashed lines separate newest and oldest data within each measure. Because

Lower mean brier score (MeanBrier) and mean log loss (MeanLogLoss) is better and higher AUC is better.

Model | AUC | MeanBrier | MeanLogLoss |
---|---|---|---|

w2vGLM | 0.96 | 0.021 | 0.083 |

w2v | 0.93 | 0.027 | 0.127 |

GLM | 0.87 | 0.028 | 0.118 |

w2vTitle | 0.81 | 0.049 | Inf |

Null | 0.58 | 0.035 | 0.157 |

w2vGLMOld | 0.85 | 0.029 | 0.122 |

w2vOld | 0.76 | 0.035 | 0.154 |

GLMOld | 0.83 | 0.031 | 0.131 |

w2vTitleOld | 0.8 | 0.047 | Inf |

Predicted probabilities of

The boxes are the inter-quartile ranges (IQRs) of the predicted probabilities, the bold line is the median, the whiskers extend from ends of IQR to +/− 1.5 *

We conduct an error analysis (see

Probabilities increased between old and new forecasts for the two enacted bills, and the mean of the probabilities for the failed bills decreased.

ShortTitle | ForecastNew | ForecastOld | BaselineForecast |
---|---|---|---|

ACA | 0.6 | 0.23 | 0.05 |

Failed Amend Repeal | 0.02 | 0.03 | 0.05 |

ARRA | 0.55 | 0.52 | 0.05 |

Now that we have a model validated on thousands of predictions, we analyze it to better understand law-making. With our language models, we create “synthetic summaries” of hypothetical bills by providing a set of words that capture any topic of interest. Comparing these synthetic summaries across chamber and across Enacted and Failed categories uncovers textual patterns of how bill content is associated with enactment. The title summaries are derived from investigating similarities within

To demonstrate the power of our approach, we investigated the words that best summarize “climate change emissions”, “health insurance poverty”, and “technology patent” topics for Enacted and Failed bills in both the House and Senate (

Our language model provides sentence-level predictions for an overall bill and thus predicts what sections of a bill may be the most important for increasing or decreasing the probability of enactment.

For each bill, we convert the variable length vectors of predicted sentence probabilities to

We conducted a partial rank correlation coefficient sensitivity analysis to estimate the effect of each predictor variable on the predicted probability of enactment. These are not bivariate correlations between variables and the predicted probabilities, rather, they are estimates of correlation

Bars represent 95% confidence intervals.

The two subjects with the largest negative effects are Foreign Trade and International Finance, and Taxation (

If the bill sponsor’s party is the majority party of their chamber, the probability of the bill is much higher, especially with the oldest data where the model relies on this as a key signal of success. Increasing the number of terms the sponsor has served in Congress also has a positive effect. The predictive model learned interactions as well: the number of co-sponsors has a stronger positive effect in the Senate for the newest data and in the House for the oldest data. If the bill text scored by the language model is in the second session of the Congress, for the newest data model, this can serve as a signal that a bill is being updated and thus it has a higher chance of enactment. For the oldest data, this means the bill was introduced in the second session, which is not particularly indicative of success or failure.

We compared five models across three performance measures and two data conditions over 14 years. A model using only bill text outperforms a model using only bill context for newest data, while context-only outperforms text-only for oldest data. In all conditions text consistently adds predictive power.

In addition to accurate predictions, we are able to improve our understanding of bill content by using a text model designed to explore differences across chamber and enactment status for important topics. Our textual analysis serves as an exploratory tool for investigating subtle distinctions across categories that were previously impossible to investigate at this scale. The same analysis can be applied to any words in the large legislative vocabulary. The global sensitivity analysis of the full model provides insights into the variables affecting predicted probabilities of enactment. For instance, when predicting bills as they are first introduced, the text of the bill and the proportion of the chamber in the bill sponsor’s party have similarly strong positive effects. The full text of the bill is by far the most important predictor when using the most up-to-date data. The oldest data model relies more on title predictions than the newest data model, which is understandable given that titles rarely change after bill introduction. Comparing effects across time conditions and across models not including text suggests that controlling for accurate estimates of the text probability is important for estimating the effects of non-textual variables.

Although the effect estimates are not causal and estimates on predictors correlated with each other may be biased, they represent our best estimates of predictive relationships within a model with the strongest predictive performance and are thus useful for understanding the process of law-making. This methodology can be applied to analyze any predictive model by treating it as a “black-box” data-generating process, therefore predictive power of a model can be optimized and subsequent analysis can uncover interpretable global relationships between predictors and output. Our work provides guidance on effectively combining text and context for prediction

(PDF)