tql.modeling.h2o package
Submodules
tql.modeling.h2o.estimator module
- class tql.modeling.h2o.estimator.BootstrapConf(bootstraps: int = 20, seed: str = 'default_seed')
Bases:
object
A simple configuration object for estimating an ensemble of Bayesian Boootstrap variants of a model :param bootstraps: The number of bootstraps to train. default=20 :type bootstraps: int :param seed: The seed to use when hashing the cluster_var to compute weight deviations. default=’default_seed’ :type seed: str
- get_bootstraps()
- get_seed()
- class tql.modeling.h2o.estimator.H2OEstimate(model: h2o.model.model_base.ModelBase, train_metrics: dict, test_metrics: dict, train_test_metrics: dict)
Bases:
object
A Simple wrapper object for a trained H2OModel and dictionary of model metrics with which to construct an H2OPublishedModel :param model: A trained H2O Model :type model: ModelBase :param metrics: A dictionary of model metrics computed from the test frame :type metrics: dict
- get_coef()
- get_h2o_params()
- get_model()
- get_test_metrics()
- get_train_metrics()
- get_train_test_metrics()
- class tql.modeling.h2o.estimator.H2OEstimator(response: str, resultset: zeenk.tql.resultset.ResultSet, model_cls: str, train_partition_name: str = 'train', test_partition_name: str = 'test')
Bases:
object
- This class trains H2O models. H2OEstimator must be created by configuring:
Specifying response type (logistic or linear)
TQL ResultSet with least two partitions, by default assumed to be named ‘train’ and ‘test’
Model class (currently GBM and GLM are supported)
During creation, the H2OEstimator will parse the TQL ResultSet into H2OFrames and store internally. The H2OEstimator then exposes a single operational method .train(…) which estimates a model from the training data and produces an instance of H2OPublishedModel. This class supports hyper-parameter tuning and Bayesian bootstrapping by providing optional configuration to .train(…).
Example Usage:
>>> from zeenk.tql import * >>> from zeenk.tql.modeling.h2o import H2OEstimator >>> resultset = select(label(...), col(...), col(...), ...).from_events(...) >>> trainer = H2OEstimator('linear', resultset, 'glm') >>> model = trainer.train() >>> model.publish('my_model')
Create an instance of H2OEstimator. ResultSet is expected to have at least two data partitions, by default assumed to be named ‘train’, and ‘test’, although this behavior can be customized with kwargs. The label column in your data will be cast to the correct H2O type given the response type - logistic=enum and linear=numeric. At this time, we support two model classes: ‘gbm’ which corresponds to H2O’s GradientBoostingEstimator, and ‘glm’ which corresponds to H2O’s GeneralizedLinearModel. When this constructor is invoked, the TQL ResultSet partitions will be converted to H2OFrames and attached to this trainer. Those H2OFrames will be available for inspection or out-of-band reuse via .get_train_frame() and .get_test_frame().
- Parameters
response (str) – The type of response, Valid options are ‘logistic’ and ‘linear’.
resultset (ResultSet) – A TQL ResultSet to create H2OFrames from for training and testing.
model_cls (str) – The type of model to use during training. Valid options are ‘glm’ and ‘gbm’.
is_caual (bool) – Indicate whether this is a causal model. affects model metrics and publishing.
train_partition_name (str) – Customize which partition in the TQL ResultSet will be used for training.
test_partition_name (str) – Customize which partition in the TQL ResultSet will be used for model metrics.
- CONTINUOUS_RESPONSES = ('continuous', 'value', 'linear')
- DISCRETE_RESPONSES = ('binary', 'event', 'discrete', 'logistic')
- MODEL_CLASSES = ('glm', 'gbm')
- get_features()
- get_label()
- get_model_class()
- get_response()
- get_resultset()
- get_tag()
- get_test_frame()
- get_train_frame()
- get_weight()
- is_continuous()
- train(tuning_conf: Optional[tql.modeling.h2o.estimator.TuningConf] = None, bootstrap_conf: Optional[tql.modeling.h2o.estimator.BootstrapConf] = None, h2o_params: Optional[dict] = None) tql.modeling.h2o.estimator.H2OPublishedModel
Trains the model from the imported H2OFrames, producing an instance of H2OPublishedModel with the given type which can be published if desired. Optionally, Hyper-parameter tuning and/or Bootstrap model estimation can also be configured. A model type must be provided at time of training, which will be attached to the PublishedModel. Training and publishing a model will allow for PREDICT() expressions to be used in subsequent TQL queries.
If tuning_conf is provided, hyperparameter tuning will be run using the bayes_opt package, and an updated set of h2o parameters will be used when training the final model, including bootstraps.
If bootstrap_conf is provided, an ensemble of Bayesian Bootstrap models will also be trained, using perturbed weights. Bootstrap models will be published along with the main model, and are available using the PREDICT_ENSEMBLE(model_type) and CAUSAL_EFFECT_ENSEMBLE(model_type) TQL functions.
If h2o_params are provided, they will override the default values set by the system. Note that invoking Hyperparameter tuning will supersede user-provided h2o params.
- Parameters
tuning_conf (TuningConf) – Optionally provide a hyper-parameter tuning conf.
bootstrap_conf (BootstrapConf) – Optionally provide a bootstrap conf.
h2o_params (dict) – A dictionary of user provided H2O params
- Returns
An H2OPublishedModel
- Return type
- class tql.modeling.h2o.estimator.H2OModelSummary(model, coef, metrics, has_bootstraps)
Bases:
object
- METRIC_TYPES = ('train_test', 'train', 'test')
- OVERVIEW_METRICS = ('positives', 'label_sum', 'pred_sum', 'realized', 'incr_rate')
- coefficients()
- get_coef_dict()
- get_model_metrics_dict()
- model()
- model_metrics()
- scoring_overview()
- class tql.modeling.h2o.estimator.H2OPublishedModel(resultset: zeenk.tql.resultset.ResultSet, estimates: list, model_type: Optional[str] = None)
Bases:
zeenk.tql.modeling.publishedmodel.PublishedModel
An H2O-specific subclass of noumena.tql.model.PublishedModel. Once this model is published, it can be used in TQL queries to make predictions using the PREDICT(model_type) function. Additionally, if the is_causal_model flag is set to true, a CausalEffectConf will be extracted from the ResultSet and attached to the model so that it can be used in CAUSAL_EFFECT(model_type, effect_type) expressions.
- resultset (ResultSet): The TQL ResultSet that was used to train this model. The ResultSet is required to attach the
column expressions and causal model metadata to the published model.
- estimates (list): A List of one or more H2OEstimate objects to attach to the PublishedModel. Typically, a model
will have multiple estimates if bootstrapping was configured during training, in which case there will exist a model estimate for each bootstrap. To access the additional predictions using TQL, use the PREDICT_ENSEMBLE(type) and CAUSAL_EFFECT_ENSEMBLE(type) functions.
- model_type (str): A string with charset [A-Za-z0-9_] representing a unique name for this model. If a model with the
same name exists in this project, it will be overwritten by this one when it is published. This type cal also be provided to publish()
Create a new model estimate based on an artifact generated in model training. If this estimate is published and enabled, it can be used in predict(), and causal_effect() when configured, statements during dataset generation, either to score a set of data or for use as features in datasets used to train other models.
- Parameters
project_id – the project_id for the model
columns – the list of Features used to generate the training dataset
artifact – a TrainingArtifact pointing to the output from model estimation. Supported: H2O mojos, VW output
additional_artifacts – Additional TrainingArtifacts produced with the same columns. Used for bootstrap metrics.
model_type – the “type” of model, used as the argument to predict() in DSL. It is not required to be provided here, but is required at publish time, but can also be provided in publish().
event_variables – precomputed event variables that was defined in the original query
timeline_variables – precomputed timeline variables that was defined in the original query
global_variables – precomputed global variables that was defined in the original query
Other properties:
.constant_enabled, .enabled_prefixes, .disabled_prefixes: control the use of features and the constant term at prediction time. Default settings: enable all features and constant. These may be manipulated directly, but favor the setter for the _prefixes settings.
.causal_effects_config: access directly to configure these settings
- get_h2o_estimates()
- summarize(conf=0.95)
- class tql.modeling.h2o.estimator.TuningConf(metric: str = 'r2', iterations: int = 20, init_pts: int = 5, random_state: int = 7, bounds: Optional[dict] = None)
Bases:
object
A simple configuration object for searching for the best set of H2O parameters to use when training a model. This configuration is provided to H2OEstimator.train(tuning_conf=TuningConf(…))
- Parameters
metric (str) – The model metric to maximize. Default is r2.
iterations (int) – Passed to the BayesianOptimization maximizer. Default=20
init_pts (int) – Passed to the BayesianOptimization maximizer. Default=5
random_state (int) – Passed to the BayesianOptimization maximizer. Default=7
bounds (dict) – Provide search parameters and bounds. If not given, sensible defaults will be chosen based on the algorithm (gbm, glm, etc).
- get_bounds()
- get_init_pts()
- get_iterations()
- get_metric()
- get_random_state()
tql.modeling.h2o.utilities module
- tql.modeling.h2o.utilities.cluster_multiplier(cluster_var, seed='salt', distribution='exponential', stratified=None)
Bayesian Bootstrap: Near exact approximation of the nonparametric bootstrap through a computationally efficient sampling strategy. By using a hashable ‘cluster_var’, approximate block-bootstrap sampling. These two bootstrap strategies approximate heteroskedasticity- and cluster-“robust” standard errors, respectively.
Stratified Bayesian Bootstrap (For Small Samples): Enforce asymptotic balance across Bayesian bootstrap sampling weights within each records (or ‘cluster_var’ group). 1. Use the hash of ‘cluster_var’ to permute a list of 1:B integers. 2. For the b^th bootstrap, use the random index c = c(b) to define the quantile stratum(b) = (c-1)/B,c/B). 3. Transform the hash of ‘cluster_var’ into the stratum interval: p ~ uniform(stratum(b)). 3. Draw the iCDF(p, ‘exponential’). This will ensure that each observation gets a quasi-random, representative, and stratified distribution of Bayesian bootstrap weights across the set of bootstraps to better approximate the asymptotic approximation that would be obtained from a larger sample of bootstraps.
- Parameters
cluster_var (list) – list of strings to be hashed
seed (str) – seed for random hashing
distribution (str) – Exponential or Poisson sampling distribution for records.
stratified (tuple) – (b,B) where B is the number of bootstraps, and b is the current bootstrap index.
- Returns
Array of random exponential draws from (0,infinity) deterministically generated from cluster_var, given the seed.
Examples:
cluster_var = [str(i) for i in range(0,6)] seed = 'bootstrap' cluster_multiplier(cluster_var, seed=seed, distribution='exponential') B = 10 for b in range(1,B+1): print(cluster_multiplier(cluster_var, seed=seed + str(b), distribution='exponential', stratified=(b,B,'strat_seed')))
- tql.modeling.h2o.utilities.compute_column_ratios(df: h2o.frame.H2OFrame, pf: h2o.frame.H2OFrame)
Computes the ratio between the scale of the columns in df and pf
- tql.modeling.h2o.utilities.compute_model_metrics(df, col_types=None, label=None, weight=None, metric=None)
Compute and return the following metrics for a trained weighted regression. This function is typically used to compute metrics on the test sample but can also be used on the training sample to assess goodness-of-fit.
Metrics:
r2: R-squared (R^2) of the model where each error is weighted by its model weight.
r2_hom: (Computed, but not returned) Weighted r-squared of the model when the incrementality model features are zeroed out, except the average treatment effect (ATE) feature(s).
r2_base: (Computed, but not returned) Weighted r-squared of the model when the incrementality model features are zeroed out.
r2_incr: Incremental r-squared is the difference between the weighted r2 and r2_base. Intuitively, this is a measure of the predictive contribution of the incrementality features.
r2_het: Heterogeneous incremental r-squared is the difference between the weighted r2 and r2_hom. Intuitively, this is a measure of the predictive contribution of the heterogeneous incrementality features.
auc: Area-Under-the-Curve (AUC) is a measure of model predictive accuracy. This is a weighted version of the metric that that redefines a positive as
'label'!=0
.aucc: Area-Under-the-Curve: Continuous (AUCc) is a generalization of AUC. Positive/Negative is defined by > and < the mean outcome. Each label is weighted by its deviation from the mean. This weight is multiplicative with any other sampling or frequency weighting. This metric can be interpreted as generalizing the True Positive Rate (TPR) and False Positive Rate (FPR) with “the share of positive (negative) labels’ deviation from the mean.” Intuitively, AUCc captures the the idea that continuous-valued labels should be rank-ordered correctly by a prediction, even if the predictions are not calibrated (unbiased and appropriately scaled), perhaps due to regularization, nonlinearities, or imperfect numerical convergence of an estimator. AUC is the
k=0
(L0-norm) version of AUCc-k wherelabel > mean(label)
is the definition of a positive andweight = w * (label - mean(label))^k
wherew
is a sampling weight andk
modulates the penalty for deviations. Here, we usek=1
for AUCc. AUCc can be interpreted as a summarization of the deviations in a quantile-quantile plot.auc_base/aucc_base: Weighted AUC (AUCc) metric of the model when the incrementality model features are zeroed out.
auc_incr/auc_het/auc_hom: See
r2
definitions for theirauc
analog.aucc_incr/aucc_het/aucc_hom: See
r2
definitions for theiraucc
analog.label_avg: Weighted average value of the outcome/label.
pred_avg: Weighted average value of the prediction of the outcome/label.
avg_error: Average error of the prediction
= label - prediction
.pred_sum/label_sum: Weighted sum of the prediction of the outcome/label. Should approximate the sum of the value of the positives.
conv_rate: Conversion rate: instantaneous conversion prediction. Approximates the total number of conversions based on the negative samples—should be close to
pred_sum
/label_sum
though not identical since importance sampling is a form of numerical integration, which introduces a small amount of statistical error. Increase thenum_samples
in importance sampling to reduce this error.incr_rate_avg/incr_rate_var: Average and variance of the individual
incr_rate
predictions. Seeincr_rate
.incr_rate: Weighted incrementality rate: instantaneous conversion prediction minus prediction with incrementality features zeroed out. Approximates the total number of incremental conversions based on the negative samples, similar to expected and realized causal effects metrics which are computed at the impression (treatment) and conversion (outcome) level. See also
conv_rate
andCausmos.effects()
.realized/realized_raw: Realized causal effect: The sum of the weighted labels’ “incrementality share,” the ratio of the instantaneous calculation of
incr_rate / conv_rate
evaluated at the outcome’s timestamp. Estimates the total causal impact of the treatments on the observed outcomes during the sample window, analogous toincr_rate
andexpected - residual
causal effect estimates. See alsoCausmos.effects()
. Clipped/truncated to incrementality share between [-1,1].realized_raw
is unclipped.het_rate: The sum of the demeaned
incr_rate
predictions. Should be close to zero.het_rate_abs/het_rate_sq: Sum of the absolute/squared difference or variance of the demeaned difference in predictions between the full heterogeneous (HTE) and baseline predictions. Demeaning removed the average treatment effect (ATE).
samples: Number of records.
weight: Sum of the record weights.
positives: Number of records with nonzero outcome/label.
negatives: Number of records with zero outcome/label.
- Parameters
df – Data frame output from
Causmos.predict_with_h2o()
with a label, weight, and prediction columns. Prediction columns must be named ‘prediction’ (full heterogeneous model), ‘prediction_base’ (baseline), and ‘prediction_hom’ (homogeneous).col_type (dict) – (Optional) Output from
Causmos.get_col_type_h2o()
can be used to retrieve values forlabel
andweight
.label (str) – Name of ‘LABEL’ column in
df
.weight (str) – Name of ‘WEIGHT’ column in
df
.metric (str) – Name of metric to selectively compute. The default of
None
computes all metrics.
- Returns
Dictionary of computed model metrics.
- Return type
dict
- tql.modeling.h2o.utilities.create_mojo(h2o_model)
- tql.modeling.h2o.utilities.get_col_type_h2o(col_type) str
Given a TQL column type, return the corresponding H2O type, or None if no mapping exists
- tql.modeling.h2o.utilities.get_col_types_h2o(columns) dict
Take a ResultSet’s output dataframe column names and try to append an H2O column type to improve importing of sparse data (e.g., nulls/zeros). See
help(h2o.import_file)
for more details.- Parameters
columns (list) – A list of Dataset output columns (expanded or not).
- Returns
A dictionary of
h2o.import_file
compatible types, e.g.,{'colname':'numeric'}
.- Return type
dict
- tql.modeling.h2o.utilities.partition_to_h2o(partition: zeenk.tql.resultset.Partition)
Converts a TQL Resultset Partition to an H2OFrame
- tql.modeling.h2o.utilities.penalty_factor_h2o(df, label, weight, col_types, standardize=True, col_stdev=None, label_stdev=False, penalty_factor=None, default_penalty=1, invert=False)
Apply (or invert) a custom per-feature penalty factor. This is implemented by setting
standardize=False
when training an H2O GLM and then, rather than dividing each label and feature columns by their respective variances, divide by a custompenalty_factor
. Typically, this penalty factor will be a multiple of the standard deviation of the feature. This implements a special case of Tikhonov regularization.Note: Only computes standard deviation and applies transformations to
- TODO: This does not apply to categorical columns. Consider how to use
penalty_factor to accommodate categorical columns by setting the default_penalty in a way that seeks to accommodate categorical feature sparsity.
- Parameters
df – H2O dataframe containing
model_cols
.model_cols (dict) – A dictionary denoting the columns of
df
corresponding to the ‘label’, ‘weight’, and a list of ‘features’ for the model.col_types (dict) – Output of
Causmos.get_col_types_h2o()
for the training dataset source.standardize (boolean) – Multiply the penalty_factor by the standard deviation of each column with nonzero standard deviation. Bypass recomputing this by using
col_stdev
.col_stdev (dict) – A dictionary of column names with their corresponding standard deviations. If not provided for any required column, will be computed and returned. Required if
invert==True
.label_stdev (boolean) – Compute and transform the label as well, if it is numeric. This may help with numerical stability.
penalty_factor (dict) – A dictionary of column names with the corresponding custom relative penalizations. For example,
{'x1':0.01}
would weaken the penalization on featurex1
by a factor of 100 while{'x1':10}
would strengthen its contribution by a factor of 10, leading it to dominate penalization calculations.default_penalty (float) – The value of
penalty_factor
for columns not inpenalty_factor
.invert (boolean) – Rather than dividing each column by the penalty_factor and standard deviation, instead multiply (divide by the reciprocal). This should revert the data frame to its original state.
- Returns
The a copy of the input H2OFrame with its columns transformed according to the inputs. See the description of
col_stdev
in the function arguments. Thedivisor
applied to each column.- Return type
(df, col_stdev, divisor)
- tql.modeling.h2o.utilities.predict(model, frame, label, weight, features, cal_df=None, is_causal=False)