tql.modeling.h2o package

Submodules

tql.modeling.h2o.estimator module

class tql.modeling.h2o.estimator.BootstrapConf(bootstraps: int = 20, seed: str = 'default_seed')

Bases: object

A simple configuration object for estimating an ensemble of Bayesian Boootstrap variants of a model :param bootstraps: The number of bootstraps to train. default=20 :type bootstraps: int :param seed: The seed to use when hashing the cluster_var to compute weight deviations. default=’default_seed’ :type seed: str

get_bootstraps()

get_seed()

class tql.modeling.h2o.estimator.H2OEstimate(model: h2o.model.model_base.ModelBase, train_metrics: dict, test_metrics: dict, train_test_metrics: dict)

Bases: object

A Simple wrapper object for a trained H2OModel and dictionary of model metrics with which to construct an H2OPublishedModel :param model: A trained H2O Model :type model: ModelBase :param metrics: A dictionary of model metrics computed from the test frame :type metrics: dict

get_coef()

get_h2o_params()

get_model()

get_test_metrics()

get_train_metrics()

get_train_test_metrics()

class tql.modeling.h2o.estimator.H2OEstimator(response: str, resultset: zeenk.tql.resultset.ResultSet, model_cls: str, train_partition_name: str = 'train', test_partition_name: str = 'test')

Bases: object

This class trains H2O models. H2OEstimator must be created by configuring:

Specifying response type (logistic or linear)
TQL ResultSet with least two partitions, by default assumed to be named ‘train’ and ‘test’
Model class (currently GBM and GLM are supported)

During creation, the H2OEstimator will parse the TQL ResultSet into H2OFrames and store internally. The H2OEstimator then exposes a single operational method .train(…) which estimates a model from the training data and produces an instance of H2OPublishedModel. This class supports hyper-parameter tuning and Bayesian bootstrapping by providing optional configuration to .train(…).

Example Usage:

>>> from zeenk.tql import *
>>> from zeenk.tql.modeling.h2o import H2OEstimator
>>> resultset = select(label(...), col(...), col(...), ...).from_events(...)
>>> trainer = H2OEstimator('linear', resultset, 'glm')
>>> model = trainer.train()
>>> model.publish('my_model')

Create an instance of H2OEstimator. ResultSet is expected to have at least two data partitions, by default assumed to be named ‘train’, and ‘test’, although this behavior can be customized with kwargs. The label column in your data will be cast to the correct H2O type given the response type - logistic=enum and linear=numeric. At this time, we support two model classes: ‘gbm’ which corresponds to H2O’s GradientBoostingEstimator, and ‘glm’ which corresponds to H2O’s GeneralizedLinearModel. When this constructor is invoked, the TQL ResultSet partitions will be converted to H2OFrames and attached to this trainer. Those H2OFrames will be available for inspection or out-of-band reuse via .get_train_frame() and .get_test_frame().

Parameters

response (str) – The type of response, Valid options are ‘logistic’ and ‘linear’.
resultset (ResultSet) – A TQL ResultSet to create H2OFrames from for training and testing.
model_cls (str) – The type of model to use during training. Valid options are ‘glm’ and ‘gbm’.
is_caual (bool) – Indicate whether this is a causal model. affects model metrics and publishing.
train_partition_name (str) – Customize which partition in the TQL ResultSet will be used for training.
test_partition_name (str) – Customize which partition in the TQL ResultSet will be used for model metrics.

CONTINUOUS_RESPONSES = ('continuous', 'value', 'linear')

DISCRETE_RESPONSES = ('binary', 'event', 'discrete', 'logistic')

MODEL_CLASSES = ('glm', 'gbm')

get_features()

get_label()

get_model_class()

get_response()

get_resultset()

get_tag()

get_test_frame()

get_train_frame()

get_weight()

is_continuous()

train(tuning_conf: Optional[tql.modeling.h2o.estimator.TuningConf] = None, bootstrap_conf: Optional[tql.modeling.h2o.estimator.BootstrapConf] = None, h2o_params: Optional[dict] = None) → tql.modeling.h2o.estimator.H2OPublishedModel

Trains the model from the imported H2OFrames, producing an instance of H2OPublishedModel with the given type which can be published if desired. Optionally, Hyper-parameter tuning and/or Bootstrap model estimation can also be configured. A model type must be provided at time of training, which will be attached to the PublishedModel. Training and publishing a model will allow for PREDICT() expressions to be used in subsequent TQL queries.

If tuning_conf is provided, hyperparameter tuning will be run using the bayes_opt package, and an updated set of h2o parameters will be used when training the final model, including bootstraps.

If bootstrap_conf is provided, an ensemble of Bayesian Bootstrap models will also be trained, using perturbed weights. Bootstrap models will be published along with the main model, and are available using the PREDICT_ENSEMBLE(model_type) and CAUSAL_EFFECT_ENSEMBLE(model_type) TQL functions.

If h2o_params are provided, they will override the default values set by the system. Note that invoking Hyperparameter tuning will supersede user-provided h2o params.

Parameters

tuning_conf (TuningConf) – Optionally provide a hyper-parameter tuning conf.
bootstrap_conf (BootstrapConf) – Optionally provide a bootstrap conf.
h2o_params (dict) – A dictionary of user provided H2O params

Returns

An H2OPublishedModel

Return type

H2OPublishedModel

class tql.modeling.h2o.estimator.H2OModelSummary(model, coef, metrics, has_bootstraps)

Bases: object

METRIC_TYPES = ('train_test', 'train', 'test')

OVERVIEW_METRICS = ('positives', 'label_sum', 'pred_sum', 'realized', 'incr_rate')

coefficients()

get_coef_dict()

get_model_metrics_dict()

model()

model_metrics()

scoring_overview()

class tql.modeling.h2o.estimator.H2OPublishedModel(resultset: zeenk.tql.resultset.ResultSet, estimates: list, model_type: Optional[str] = None)

Bases: zeenk.tql.modeling.publishedmodel.PublishedModel

An H2O-specific subclass of noumena.tql.model.PublishedModel. Once this model is published, it can be used in TQL queries to make predictions using the PREDICT(model_type) function. Additionally, if the is_causal_model flag is set to true, a CausalEffectConf will be extracted from the ResultSet and attached to the model so that it can be used in CAUSAL_EFFECT(model_type, effect_type) expressions.

resultset (ResultSet): The TQL ResultSet that was used to train this model. The ResultSet is required to attach the: column expressions and causal model metadata to the published model.
estimates (list): A List of one or more H2OEstimate objects to attach to the PublishedModel. Typically, a model: will have multiple estimates if bootstrapping was configured during training, in which case there will exist a model estimate for each bootstrap. To access the additional predictions using TQL, use the PREDICT_ENSEMBLE(type) and CAUSAL_EFFECT_ENSEMBLE(type) functions.
model_type (str): A string with charset [A-Za-z0-9_] representing a unique name for this model. If a model with the: same name exists in this project, it will be overwritten by this one when it is published. This type cal also be provided to publish()

Create a new model estimate based on an artifact generated in model training. If this estimate is published and enabled, it can be used in predict(), and causal_effect() when configured, statements during dataset generation, either to score a set of data or for use as features in datasets used to train other models.

Parameters

project_id – the project_id for the model
columns – the list of Features used to generate the training dataset
artifact – a TrainingArtifact pointing to the output from model estimation. Supported: H2O mojos, VW output
additional_artifacts – Additional TrainingArtifacts produced with the same columns. Used for bootstrap metrics.
model_type – the “type” of model, used as the argument to predict() in DSL. It is not required to be provided here, but is required at publish time, but can also be provided in publish().
event_variables – precomputed event variables that was defined in the original query
timeline_variables – precomputed timeline variables that was defined in the original query
global_variables – precomputed global variables that was defined in the original query

Other properties:

.constant_enabled, .enabled_prefixes, .disabled_prefixes: control the use of features and the constant term at prediction time. Default settings: enable all features and constant. These may be manipulated directly, but favor the setter for the _prefixes settings.
.causal_effects_config: access directly to configure these settings

get_h2o_estimates()

summarize(conf=0.95)

class tql.modeling.h2o.estimator.TuningConf(metric: str = 'r2', iterations: int = 20, init_pts: int = 5, random_state: int = 7, bounds: Optional[dict] = None)

Bases: object

A simple configuration object for searching for the best set of H2O parameters to use when training a model. This configuration is provided to H2OEstimator.train(tuning_conf=TuningConf(…))

Parameters

metric (str) – The model metric to maximize. Default is r2.
iterations (int) – Passed to the BayesianOptimization maximizer. Default=20
init_pts (int) – Passed to the BayesianOptimization maximizer. Default=5
random_state (int) – Passed to the BayesianOptimization maximizer. Default=7
bounds (dict) – Provide search parameters and bounds. If not given, sensible defaults will be chosen based on the algorithm (gbm, glm, etc).

get_bounds()

get_init_pts()

get_iterations()

get_metric()

get_random_state()

tql.modeling.h2o.utilities module

tql.modeling.h2o.utilities.cluster_multiplier(cluster_var, seed='salt', distribution='exponential', stratified=None)

Bayesian Bootstrap: Near exact approximation of the nonparametric bootstrap through a computationally efficient sampling strategy. By using a hashable ‘cluster_var’, approximate block-bootstrap sampling. These two bootstrap strategies approximate heteroskedasticity- and cluster-“robust” standard errors, respectively.

Stratified Bayesian Bootstrap (For Small Samples): Enforce asymptotic balance across Bayesian bootstrap sampling weights within each records (or ‘cluster_var’ group). 1. Use the hash of ‘cluster_var’ to permute a list of 1:B integers. 2. For the b^th bootstrap, use the random index c = c(b) to define the quantile stratum(b) = (c-1)/B,c/B). 3. Transform the hash of ‘cluster_var’ into the stratum interval: p ~ uniform(stratum(b)). 3. Draw the iCDF(p, ‘exponential’). This will ensure that each observation gets a quasi-random, representative, and stratified distribution of Bayesian bootstrap weights across the set of bootstraps to better approximate the asymptotic approximation that would be obtained from a larger sample of bootstraps.

Parameters

cluster_var (list) – list of strings to be hashed
seed (str) – seed for random hashing
distribution (str) – Exponential or Poisson sampling distribution for records.
stratified (tuple) – (b,B) where B is the number of bootstraps, and b is the current bootstrap index.

Returns

Array of random exponential draws from (0,infinity) deterministically generated from cluster_var, given the seed.

Examples:

cluster_var = [str(i) for i in range(0,6)]
seed = 'bootstrap'

cluster_multiplier(cluster_var, seed=seed, distribution='exponential')

B = 10
for b in range(1,B+1):
  print(cluster_multiplier(cluster_var, seed=seed + str(b), distribution='exponential', stratified=(b,B,'strat_seed')))

tql.modeling.h2o.utilities.compute_column_ratios(df: h2o.frame.H2OFrame, pf: h2o.frame.H2OFrame): Computes the ratio between the scale of the columns in df and pf

tql.modeling.h2o.utilities.compute_model_metrics(df, col_types=None, label=None, weight=None, metric=None)

Compute and return the following metrics for a trained weighted regression. This function is typically used to compute metrics on the test sample but can also be used on the training sample to assess goodness-of-fit.

Metrics:

r2: R-squared (R^2) of the model where each error is weighted by its model weight.
r2_hom: (Computed, but not returned) Weighted r-squared of the model when the incrementality model features are zeroed out, except the average treatment effect (ATE) feature(s).
r2_base: (Computed, but not returned) Weighted r-squared of the model when the incrementality model features are zeroed out.
r2_incr: Incremental r-squared is the difference between the weighted r2 and r2_base. Intuitively, this is a measure of the predictive contribution of the incrementality features.
r2_het: Heterogeneous incremental r-squared is the difference between the weighted r2 and r2_hom. Intuitively, this is a measure of the predictive contribution of the heterogeneous incrementality features.
auc: Area-Under-the-Curve (AUC) is a measure of model predictive accuracy. This is a weighted version of the metric that that redefines a positive as 'label'!=0.
aucc: Area-Under-the-Curve: Continuous (AUCc) is a generalization of AUC. Positive/Negative is defined by > and < the mean outcome. Each label is weighted by its deviation from the mean. This weight is multiplicative with any other sampling or frequency weighting. This metric can be interpreted as generalizing the True Positive Rate (TPR) and False Positive Rate (FPR) with “the share of positive (negative) labels’ deviation from the mean.” Intuitively, AUCc captures the the idea that continuous-valued labels should be rank-ordered correctly by a prediction, even if the predictions are not calibrated (unbiased and appropriately scaled), perhaps due to regularization, nonlinearities, or imperfect numerical convergence of an estimator. AUC is the k=0 (L0-norm) version of AUCc-k where label > mean(label) is the definition of a positive and weight = w * (label - mean(label))^k where w is a sampling weight and k modulates the penalty for deviations. Here, we use k=1 for AUCc. AUCc can be interpreted as a summarization of the deviations in a quantile-quantile plot.
auc_base/aucc_base: Weighted AUC (AUCc) metric of the model when the incrementality model features are zeroed out.
auc_incr/auc_het/auc_hom: See r2 definitions for their auc analog.
aucc_incr/aucc_het/aucc_hom: See r2 definitions for their aucc analog.
label_avg: Weighted average value of the outcome/label.
pred_avg: Weighted average value of the prediction of the outcome/label.
avg_error: Average error of the prediction = label - prediction.
pred_sum/label_sum: Weighted sum of the prediction of the outcome/label. Should approximate the sum of the value of the positives.
conv_rate: Conversion rate: instantaneous conversion prediction. Approximates the total number of conversions based on the negative samples—should be close to pred_sum/label_sum though not identical since importance sampling is a form of numerical integration, which introduces a small amount of statistical error. Increase the num_samples in importance sampling to reduce this error.
incr_rate_avg/incr_rate_var: Average and variance of the individual incr_rate predictions. See incr_rate.
incr_rate: Weighted incrementality rate: instantaneous conversion prediction minus prediction with incrementality features zeroed out. Approximates the total number of incremental conversions based on the negative samples, similar to expected and realized causal effects metrics which are computed at the impression (treatment) and conversion (outcome) level. See also conv_rate and Causmos.effects().
realized/realized_raw: Realized causal effect: The sum of the weighted labels’ “incrementality share,” the ratio of the instantaneous calculation of incr_rate / conv_rate evaluated at the outcome’s timestamp. Estimates the total causal impact of the treatments on the observed outcomes during the sample window, analogous to incr_rate and expected - residual causal effect estimates. See also Causmos.effects(). Clipped/truncated to incrementality share between [-1,1]. realized_raw is unclipped.
het_rate: The sum of the demeaned incr_rate predictions. Should be close to zero.
het_rate_abs/het_rate_sq: Sum of the absolute/squared difference or variance of the demeaned difference in predictions between the full heterogeneous (HTE) and baseline predictions. Demeaning removed the average treatment effect (ATE).
samples: Number of records.
weight: Sum of the record weights.
positives: Number of records with nonzero outcome/label.
negatives: Number of records with zero outcome/label.

Parameters

df – Data frame output from Causmos.predict_with_h2o() with a label, weight, and prediction columns. Prediction columns must be named ‘prediction’ (full heterogeneous model), ‘prediction_base’ (baseline), and ‘prediction_hom’ (homogeneous).
col_type (dict) – (Optional) Output from Causmos.get_col_type_h2o() can be used to retrieve values for label and weight.
label (str) – Name of ‘LABEL’ column in df.
weight (str) – Name of ‘WEIGHT’ column in df.
metric (str) – Name of metric to selectively compute. The default of None computes all metrics.

Returns

Dictionary of computed model metrics.

Return type

dict

tql.modeling.h2o.utilities.create_mojo(h2o_model)

tql.modeling.h2o.utilities.get_col_type_h2o(col_type) → str: Given a TQL column type, return the corresponding H2O type, or None if no mapping exists

tql.modeling.h2o.utilities.get_col_types_h2o(columns) → dict

Take a ResultSet’s output dataframe column names and try to append an H2O column type to improve importing of sparse data (e.g., nulls/zeros). See help(h2o.import_file) for more details.

Parameters: columns (list) – A list of Dataset output columns (expanded or not).
Returns: A dictionary of h2o.import_file compatible types, e.g., {'colname':'numeric'}.
Return type: dict

tql.modeling.h2o.utilities.partition_to_h2o(partition: zeenk.tql.resultset.Partition): Converts a TQL Resultset Partition to an H2OFrame

tql.modeling.h2o.utilities.penalty_factor_h2o(df, label, weight, col_types, standardize=True, col_stdev=None, label_stdev=False, penalty_factor=None, default_penalty=1, invert=False)

Apply (or invert) a custom per-feature penalty factor. This is implemented by setting standardize=False when training an H2O GLM and then, rather than dividing each label and feature columns by their respective variances, divide by a custom penalty_factor. Typically, this penalty factor will be a multiple of the standard deviation of the feature. This implements a special case of Tikhonov regularization.

Note: Only computes standard deviation and applies transformations to

TODO: This does not apply to categorical columns. Consider how to use: penalty_factor to accommodate categorical columns by setting the default_penalty in a way that seeks to accommodate categorical feature sparsity.

Parameters

df – H2O dataframe containing model_cols.
model_cols (dict) – A dictionary denoting the columns of df corresponding to the ‘label’, ‘weight’, and a list of ‘features’ for the model.
col_types (dict) – Output of Causmos.get_col_types_h2o() for the training dataset source.
standardize (boolean) – Multiply the penalty_factor by the standard deviation of each column with nonzero standard deviation. Bypass recomputing this by using col_stdev.
col_stdev (dict) – A dictionary of column names with their corresponding standard deviations. If not provided for any required column, will be computed and returned. Required if invert==True.
label_stdev (boolean) – Compute and transform the label as well, if it is numeric. This may help with numerical stability.
penalty_factor (dict) – A dictionary of column names with the corresponding custom relative penalizations. For example, {'x1':0.01} would weaken the penalization on feature x1 by a factor of 100 while {'x1':10} would strengthen its contribution by a factor of 10, leading it to dominate penalization calculations.
default_penalty (float) – The value of penalty_factor for columns not in penalty_factor.
invert (boolean) – Rather than dividing each column by the penalty_factor and standard deviation, instead multiply (divide by the reciprocal). This should revert the data frame to its original state.

Returns

The a copy of the input H2OFrame with its columns transformed according to the inputs. See the description of col_stdev in the function arguments. The divisor applied to each column.

Return type

(df, col_stdev, divisor)

tql.modeling.h2o.utilities.predict(model, frame, label, weight, features, cal_df=None, is_causal=False)