tql.modeling package

class tql.modeling.BootstrapConf(bootstraps: int = 20, seed: str = 'default_seed')

Bases: object

A simple configuration object for estimating an ensemble of Bayesian Boootstrap variants of a model :param bootstraps: The number of bootstraps to train. default=20 :type bootstraps: int :param seed: The seed to use when hashing the cluster_var to compute weight deviations. default=’default_seed’ :type seed: str

get_bootstraps()

get_seed()

class tql.modeling.H2OEstimator(response: str, resultset: zeenk.tql.resultset.ResultSet, model_cls: str, train_partition_name: str = 'train', test_partition_name: str = 'test')

Bases: object

This class trains H2O models. H2OEstimator must be created by configuring:

Specifying response type (logistic or linear)
TQL ResultSet with least two partitions, by default assumed to be named ‘train’ and ‘test’
Model class (currently GBM and GLM are supported)

During creation, the H2OEstimator will parse the TQL ResultSet into H2OFrames and store internally. The H2OEstimator then exposes a single operational method .train(…) which estimates a model from the training data and produces an instance of H2OPublishedModel. This class supports hyper-parameter tuning and Bayesian bootstrapping by providing optional configuration to .train(…).

Example Usage:

>>> from zeenk.tql import *
>>> from zeenk.tql.modeling.h2o import H2OEstimator
>>> resultset = select(label(...), col(...), col(...), ...).from_events(...)
>>> trainer = H2OEstimator('linear', resultset, 'glm')
>>> model = trainer.train()
>>> model.publish('my_model')

Create an instance of H2OEstimator. ResultSet is expected to have at least two data partitions, by default assumed to be named ‘train’, and ‘test’, although this behavior can be customized with kwargs. The label column in your data will be cast to the correct H2O type given the response type - logistic=enum and linear=numeric. At this time, we support two model classes: ‘gbm’ which corresponds to H2O’s GradientBoostingEstimator, and ‘glm’ which corresponds to H2O’s GeneralizedLinearModel. When this constructor is invoked, the TQL ResultSet partitions will be converted to H2OFrames and attached to this trainer. Those H2OFrames will be available for inspection or out-of-band reuse via .get_train_frame() and .get_test_frame().

Parameters

response (str) – The type of response, Valid options are ‘logistic’ and ‘linear’.
resultset (ResultSet) – A TQL ResultSet to create H2OFrames from for training and testing.
model_cls (str) – The type of model to use during training. Valid options are ‘glm’ and ‘gbm’.
is_caual (bool) – Indicate whether this is a causal model. affects model metrics and publishing.
train_partition_name (str) – Customize which partition in the TQL ResultSet will be used for training.
test_partition_name (str) – Customize which partition in the TQL ResultSet will be used for model metrics.

CONTINUOUS_RESPONSES = ('continuous', 'value', 'linear')

DISCRETE_RESPONSES = ('binary', 'event', 'discrete', 'logistic')

MODEL_CLASSES = ('glm', 'gbm')

get_features()

get_label()

get_model_class()

get_response()

get_resultset()

get_tag()

get_test_frame()

get_train_frame()

get_weight()

is_continuous()

train(tuning_conf: Optional[zeenk.tql.modeling.h2o.estimator.TuningConf] = None, bootstrap_conf: Optional[zeenk.tql.modeling.h2o.estimator.BootstrapConf] = None, h2o_params: Optional[dict] = None) → zeenk.tql.modeling.h2o.estimator.H2OPublishedModel

Trains the model from the imported H2OFrames, producing an instance of H2OPublishedModel with the given type which can be published if desired. Optionally, Hyper-parameter tuning and/or Bootstrap model estimation can also be configured. A model type must be provided at time of training, which will be attached to the PublishedModel. Training and publishing a model will allow for PREDICT() expressions to be used in subsequent TQL queries.

If tuning_conf is provided, hyperparameter tuning will be run using the bayes_opt package, and an updated set of h2o parameters will be used when training the final model, including bootstraps.

If bootstrap_conf is provided, an ensemble of Bayesian Bootstrap models will also be trained, using perturbed weights. Bootstrap models will be published along with the main model, and are available using the PREDICT_ENSEMBLE(model_type) and CAUSAL_EFFECT_ENSEMBLE(model_type) TQL functions.

If h2o_params are provided, they will override the default values set by the system. Note that invoking Hyperparameter tuning will supersede user-provided h2o params.

Parameters

tuning_conf (TuningConf) – Optionally provide a hyper-parameter tuning conf.
bootstrap_conf (BootstrapConf) – Optionally provide a bootstrap conf.
h2o_params (dict) – A dictionary of user provided H2O params

Returns

An H2OPublishedModel

Return type

H2OPublishedModel

tql.modeling.H2OModelTrainer: alias of zeenk.tql.modeling.h2o.estimator.H2OEstimator

class tql.modeling.TuningConf(metric: str = 'r2', iterations: int = 20, init_pts: int = 5, random_state: int = 7, bounds: Optional[dict] = None)

Bases: object

A simple configuration object for searching for the best set of H2O parameters to use when training a model. This configuration is provided to H2OEstimator.train(tuning_conf=TuningConf(…))

Parameters

metric (str) – The model metric to maximize. Default is r2.
iterations (int) – Passed to the BayesianOptimization maximizer. Default=20
init_pts (int) – Passed to the BayesianOptimization maximizer. Default=5
random_state (int) – Passed to the BayesianOptimization maximizer. Default=7
bounds (dict) – Provide search parameters and bounds. If not given, sensible defaults will be chosen based on the algorithm (gbm, glm, etc).

get_bounds()

get_init_pts()

get_iterations()

get_metric()

get_random_state()

tql.modeling.load_published_model(project_id, identifier): Loads a specific PublishedModel

tql.modeling.show_published_models(project_id=None, type=None, only_enabled=True, limit=20, date_range=None): Shows the published models

Subpackages

tql.modeling.h2o package

Submodules

tql.modeling.publishedmodel module

class tql.modeling.publishedmodel.CausalEffectsConf(opportunity_filter_expressions=None, outcome_filter_expressions=None)

Bases: object

Required settings to use causal_effect(“type”)

Parameters

opportunity_filter_expressions (list) – a list of Expressions that evaluate to true to identify a treatment opportunity
outcome_filter_expressions (list) – a list of Expressions that evaluate to true to identify a positive outcome

json() → dict

Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation.

Returns: The current TQL Query as a dict

class tql.modeling.publishedmodel.Coefficient(name, weight)

Bases: object

Universal Model Coefficient

Parameters

name – (string) coefficient name
weight – (int) coefficient weight

class tql.modeling.publishedmodel.ModelEstimate(artifact_path=None, h2o_model=None)

Bases: object

Base class for training artifacts.

add_test_case(test_case)

Test cases are just an dictionary for now. They can be used to test that executing the artifact yields expected results. The names and values in the dictionary should match the expectations of the estimate artifact from the specific estimator engine. EG:

{ 'label': float,
    'truth': nullable float,
    'weight': nullable float,
    'tag': str,
    'features': [
        { 'name': name known to artifact
          'value': str
          'numerical_value': float
        },...
    ]
}

If these graduate to a class, json and _from_spec will need overrides

get_artifact_path()

get_engine()

get_test_cases()

json() → dict

Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation.

Returns: The current TQL Query as a dict

class tql.modeling.publishedmodel.PublishedModel(project_id, columns, artifact, additional_artifacts=None, model_type=None, event_variables=None, timeline_variables=None, global_variables=None)

Bases: object

An PublishedModel wraps an artifact generated via model training and can be published and enabled for a project. It can be then be invoked in the DSL via predict() and causal_effect(), either to score a set of data or to use as features in datasets used to train other models.

Create a new model estimate based on an artifact generated in model training. If this estimate is published and enabled, it can be used in predict(), and causal_effect() when configured, statements during dataset generation, either to score a set of data or for use as features in datasets used to train other models.

Parameters

project_id – the project_id for the model
columns – the list of Features used to generate the training dataset
artifact – a TrainingArtifact pointing to the output from model estimation. Supported: H2O mojos, VW output
additional_artifacts – Additional TrainingArtifacts produced with the same columns. Used for bootstrap metrics.
model_type – the “type” of model, used as the argument to predict() in DSL. It is not required to be provided here, but is required at publish time, but can also be provided in publish().
event_variables – precomputed event variables that was defined in the original query
timeline_variables – precomputed timeline variables that was defined in the original query
global_variables – precomputed global variables that was defined in the original query

Other properties:

.constant_enabled, .enabled_prefixes, .disabled_prefixes: control the use of features and the constant term at prediction time. Default settings: enable all features and constant. These may be manipulated directly, but favor the setter for the _prefixes settings.
.causal_effects_config: access directly to configure these settings

causal_effects_conf(oppr_conf, where_clauses): Uses oppr_conf to setup a CausalEffectsConf for this PublishedModel

delete(): Deletes this PublishedModel

describe(): Producing a readable version of Estimate

disable_model(): Disables this PublishedModel

disable_prefixes(prefixes)

Provide a list of prefixes identifying Features that should be suppressed when performing model inference. Clears enabled_prefixes.

Parameters: prefixes – the list of prefixes, eg [‘w’, ‘u’]

enable_prefixes(prefixes)

Provide a list of prefixes identifying Features that should be enabled when performing model inference (all others will be disabled). Clears disabled_prefixes.

Parameters: prefixes – the list of prefixes, eg [‘w’, ‘u’]

get_artifacts(): Gets the artifacts of this PublishedModel

get_causal_effects_conf(): Gets the causal effects of this PublishedModel

get_columns(): Gets the columns of this PublishedModel

get_disabled_prefixes(): Gets the disabled prefixes of this PublishedModel

classmethod get_enabled(project_id, type): Return the enabled published model for specified project and type

get_enabled_prefixes(): Gets the enables prefixes of this PublishedModel

get_id(): Gets the ID of this PublishedModel

get_project_id(): Gets the Project ID of this PublishedModel

get_type(): Gets the type of this PublishedModel

json() → dict

Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation.

Returns: The current TQL Query as a dict

classmethod list(project_id=None, type=None, only_enabled=True, limit=20, date_range=None)

Pretty print a list of published model estimates and some info about them

Parameters

project_id (int) – only list models for this project (optional)
type – only models of this type
only_enabled (boolean) – only enabled models
limit (int) – list the most recent <limit> models
date_range (tuple) – created in (from, to) datetime.Date, inclusive
filter_out (list) – a list of lambdas to run on the returned rows, dropping any that return true
show (bool) – whether or not to display resulting dataframe

classmethod load(id): Load a published estimate by id

publish(model_type=None): Publish this Estimate to the webservice API. If it is enabled, it will be available for use in dataset generation. Returns the model id.

class tql.modeling.publishedmodel.RowVector(label=None, truth=None, weight=None, tag=None, features=None, meta_features=None)

Bases: object

A RowVector is to be used with universal model

Generates feature vector to be used with universal model

Parameters

label – feature label value (numerical)
truth – feature truth value (numerical)
weight – feature weight value (numerical)
tag – feature tag value (numerical)
features – a list of features specs backend accepts (see generate_feature_spec() for example)
meta_features – a list of features specs backend accepts (see generate_feature_spec() for example)

add_feature(name, value, numerical_value): Adds a feature to this RowVector

add_meta_feature(name, value, numerical_value): Adds a meta feature to this RowVector

json() → dict

Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation.

Returns: The current TQL Query as a dict

tql.modeling.publishedmodel.load_published_model(project_id, identifier): Loads a specific PublishedModel

tql.modeling.publishedmodel.show_published_models(project_id=None, type=None, only_enabled=True, limit=20, date_range=None): Shows the published models