tql package

tql.categorical(expr, name: Optional[str] = None, filters: Optional[dict] = None) zeenk.tql.column.FeatureColumn

Creates a categorical feature column from a TQL expression, optionally providing name and filters

Parameters
  • expr – A TQL expression string

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

  • filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A FeatureColumn object to be provided to select(...)

Return type

FeatureColumn

tql.col(c: any, name: Optional[str] = None, type: Optional[str] = None, filters: Optional[dict] = None)

Creates a column from a TQL expression or given input, optionally providing name, type, and filters. The first argument can be a variety of formats, including:

  • classes or subclasses of type Column

  • dictionary containing keys for name, expression, and type

  • raw strings, which will be interpreted as TQL expressions.

The created column is by default unnamed, and will be assigned a name if used in a TQL select(...) statement. The default type of columns is ‘METADATA’.

Parameters
  • c (any) – A TQL expression string, column object, dictionary, list, tuple, or FeatureColumn object

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

  • type (str) – Optionally provide a type for the column. If not provided, the column type will be METADATA

  • filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A Column or FeatureColumn object to be provided to select(...)

tql.constant()

Used in select(…) statements to select the FeatureColumn “1.0” with name “constant”

Returns

A tuple with the FeatureColumn “1.0” with name “constant”

Return type

tuple

tql.create_project(name: str, or_update: bool = False) zeenk.tql.timelines.Project

Creates a new project. This project will require further configuration before timelines can be created. If the project already exists and you wish to update it, use load_project(name), or give the or_update=True option here.

Parameters
  • name (str) – The name of the Project

  • or_update (bool) – Whether to update the Project

Returns

The Project that has just been created

Return type

Project

tql.create_timeseries(name: str) zeenk.tql.timeseries.TimeSeries

Creates a TimeSeries with the given name

Parameters

name (str) – The name of the TimeSeries

Returns

A new TimeSeries with the given name

Return type

TimeSeries

tql.debugger(project, expression: str = '', theme: str = 'light')

Create an expression debugger widget - default row limit of 10 is applied automatically. Only works in Jupyter Notebooks or Jupyter Labs

Parameters
  • project – The Project ID or name to run the expression against

  • expression (str) – The initial value of the expression, if any

  • theme (str) – The editor theme, ‘light’ or ‘dark’

Returns

A Jupyter Notebook/Lab widget

tql.delete_udf(project_id: int, function_name: str)

Deletes a UDF for a specific Project by function name

Parameters
  • project_id (int) – ID of the Project to delete udf from

  • function_name (str) – Name of the udf to be deleted

tql.describe_project(project_identifier, fail_if_not_found: bool = True) zeenk.tql.timelines.Project

Loads the specified project by name, or throws TQLAnalysisException if not found

Parameters
  • project_identifier – The name (str) or ID (int) of the Project

  • fail_if_not_found (bool) – Whether to throw an exception if the Project is not found

Returns

The Project with the specified name or ID

Return type

Project

tql.drop_project(name_or_id, if_exists: bool = False)

Deletes the project by name or ID

Parameters
  • name_or_id – The name (str) or ID (int) of the Project

  • if_exists (bool) – Only drop the Project if it exists

tql.event_metadata() tuple

To be used in select(…) statements for returning timeline.id, id, datetime, and type

Returns

A tuple of Columns

Return type

tuple

tql.event_time() tuple

To be used in select(…) statements for returning timestamp and duration

Returns

A tuple of Columns

Return type

tuple

tql.generate_events(sample_generator_expression: str, functions: Optional[list] = None, variables: Optional[dict] = None, attribute_expressions: Optional[dict] = None, inherit_attributes: bool = False) zeenk.tql.sampling.SamplingConf

Manual sampling is similar to importance sampling but allows the user to manually override parameters such as the list of sample events (e.g., timestamps at which sampling occurs) and other properties of each generated sample event. See generate_importance_events() for an advanced use case.

tql.generate_importance_events(num_samples: float = - 1.0, min_ts: Optional[str] = None, max_ts: Optional[str] = None, time_shift_factor: float = 0.0, fraction_uniform_time: float = 0.0, sampling_distribution: str = 'exponential', sampling_events_expression: str = 'FILTER(timeline.events, (x) -> x.type=null)', sampling_kernels: str = '5m,15m,1h,4h,1d,3d,7d') zeenk.tql.sampling.SamplingConf

Importance sampling can be used to retain modeling unbiasedness (avoid introducing selection bias when sampling records) while still increasing the number of records where the modeling is most interesting. For example, when modeling the causal effect of a treatment on an outcome, we would like to ensure that most of our records (whether ‘positive’, an outcome event, or ‘negative’, a non-outcome sampling event) are in the vicinity of a treatment opportunity or an outcome. By so doing, we increase the model’s statistical power at deciphering the relationship between the two. In contrast, if most outcomes happen during only 10% of the sample time period, we most of our observations will be during the “boring” portion of the timeline when no events of interest are occurring.

generate_importance_events helps you configure what time periods are “interesting.” You configure how many records to randomly sample for each timeline, which timestamps or events you want to increase your sampling around, the distribution (shape and scale) around each event from which you would like to randomly sample, and the probability you would like to draw from the background uniform distribution (e.g., a random point in the timeline).

In summary, one way to generate negative (non-outcome) records would be to simply draw uniformly between the start and end of the timeline’s observation window. However, we can improve upon that by instructing the extractor to generate these negatives by a configurable time-importance-weighted sampling methodology around times of timeline events.

tql.get_udf(project_id: int, function_name: str) zeenk.tql.udf.UDF

Retrieve a specific UDF from a Project

Parameters
  • project_id (int) – ID of the Project to retrieve UDF on

  • function_name (str) – Name of the UDF to retrieve

Returns

A UDF or throws TQLAnalysisNotFound exception

Return type

UDF

tql.kernelize(features, name: Optional[str] = None, treatment_filter_event: Optional[str] = None, event_group_where_event: Optional[str] = None, intended_action: Optional[list] = None, actual_action: Optional[list] = None, treatment_model: Optional[str] = None, kernel_types: Optional[list] = None, opportunity_conf: Optional[zeenk.tql.opportunity_conf.OpportunityConf] = None, kernel_parameters: Optional[str] = None, kernel_distribution: Optional[str] = None, filters: Optional[dict] = None, parse_kv: bool = False)

Takes a list of opportunity-level features based on user/opportunity/treatment/proxy-outcome-level fields and create a transformed feature with SUM_OPPORTUNITIES() accumulating the effects across each opportunity. These “kernelized” features are for use in a Causmos timeline-based causal model where each observation record in a dataset is an outcome or a potential outcome (e.g., a moment in time when an outcome could have occurred). For example, we can power dataset generation using a treatment-propensity model to reduce the potential bias if treatment and the outcome are correlated. Apply this function to each list of features to be kernelized. There are several types of transformations described as (KD = ‘treatment’, BKD = ‘baseline’, GKD = ‘ghost’, NKD = ‘nonrandom’).

Parameters
  • features (list) – List of col() to be kernelized (KD,BKD,GKD,NKD). Examples: ["1.0", "IF(1.0,'L','R')"], [{"name":"constant", "expression":"1.0", "type":"NUMERICAL"}]). Columns can be “CATEGORICAL” or “NUMERICAL”. Each categorical expression creates an expansion of features; each numerical expression multiplies/reweights the opportunity’s contribution to the SUM_OPPORTUNITIES() sum.

  • name (str) – Base string for the feature sets. Will append type of kernel and prepend “w” to denote non-incrementality features for BKD, GKD, and NKD. If name=’’ (default), just return the list of features with the default prefix: AKD_..., wBKD_....

  • treatment_filter_event (str) – Column that defines whether a treatment opportunity resulted in treatment. For example, was the advertiser’s impression shown after bidding? In a sample dataset, request.impression is the relevant field.

  • event_group_where_event (str) – second where condition within passed EventGroup (currently this is set to be output from filtering like dedupe - but should be made more explicit).

  • intended_action – List of numerical/categorical expressions defining the intended/optimal decisions, e.g., bid_amount or eligible_treatment_groups.

  • actual_action – List of numerical/categorical expressions defining the actual action/decision taken, e.g., IF(ghost_bid, 0, bid_amount) or assigned_treatment_group.

  • treatment_model (str) – String for the deployed/published treatment prediction model (GKD, NKD-only). In practice, this will be the win-rate model based on the leaves.

  • kernel_types (list) – List of kernels types to include in the feature set. Subset of ['KD','BKD','GKD','NKD'].

  • opportunity_conf – An OpportunityConf() object that contains the defaults for opportunity_filter, kernel_parameters, and kernel_distribution.

  • kernel_parameters (str) – Calibration for kernel.days in the kernel parameters. Arrays of kernel features are created with suffixes using values such as seconds (s), minutes (m), hours (h), and days (d) in combination with a number, such as '15m,4h,3d': 'name_feature' becomes 'name_feature-15m', etc.

  • kernel_distribution (str) – Positive-support distributions (short-form abbreviations): exponential (e, exp), uniform (u, unif), triangular (t, tri), halfnormal (h, hnorm), and halflogistic (l, hlog). Positive- & negative-support distributions are the symmetric analogs of the positive-support distributions: laplace (a, lap), rectangular (r, rect), symmetrictriangular (s, stri), normal (n, norm), and logistic (o, log). Time-independent constant kernel (c, const) is also useful for various use cases such as static models in order to accumulate all opportunities and treatments for each outcome.

  • filters (dict) – column filters to apply to all of the features.

  • parse_kv (boolean) – Try to parse ‘NUMERICAL’ input features as key-value pairs ‘k:v’. This has extra overhead, but more complex mixed ‘categorical:numerical’ input features.

Returns

A list of feature sets for each kernel_type, each containing a list of kernelized features.

Return type

list

tql.label(expr, name: str = '_label') zeenk.tql.column.Column

Creates a label column from a TQL expression. The expression provided to label() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default label value of 0.0. Label columns will be automatically be named “_label”. It is expected that a dataset will have at most one label column.

Parameters
  • expr – A TQL expression string

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

Column

tql.list_udfs(project_id: int) zeenk.tql.udf.UDF

Retrieves all UDFs defined for a Project

Parameters

project_id (int) – ID of the Project to retrieve UDFs on

Returns

All existing UDFs for the Project

Return type

UDF

tql.load_project(project_identifier, fail_if_not_found: bool = True) zeenk.tql.timelines.Project

Loads the specified project by name, or throws TQLAnalysisException if not found

Parameters
  • project_identifier – The name (str) or ID (int) of the Project

  • fail_if_not_found (bool) – Whether to throw an exception if the Project is not found

Returns

The Project with the specified name or ID

Return type

Project

tql.load_query(id: int) zeenk.tql.query.Query

Loads the query from the given ID

Parameters

id (int) – The ID of the query

Returns

A new Query instance

Return type

Query

tql.load_resultset(id: int) zeenk.tql.resultset.ResultSet

Loads the ResultSet with the given ID

Parameters

id (int) – The ID of the ResultSet to be loaded

Returns

The ResultSet

Return type

ResultSet

tql.metadata(expr, name: Optional[str] = None) zeenk.tql.column.Column

Creates a metadata column from a TQL expression. Metadata columns will return the expression values “as is”, meaning they will not be post-processed with charset filtering or expansion of numerical columns. Metadata columns are also not subject to column filters. Metadata columns are the default column type and are often the correct choice for arbitrary datasets that are not specifically intended to be consumed by a ML training package.

Parameters
  • expr – A TQL expression string

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

Column

tql.numerical(expr, name: Optional[str] = None, filters: Optional[dict] = None) zeenk.tql.column.FeatureColumn

Creates a numerical feature column from a TQL expression, optionally providing name and filters

Parameters
  • expr – A TQL expression string

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

  • filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A FeatureColumn object to be provided to select(...)

Return type

FeatureColumn

tql.query(project='lethe4')

A simple example query with event_metadata() and a limit of 10

Parameters

project – The Project or project ID to run the Query against

Returns

An example Query object

Return type

Query

tql.random_partition(partition_var: str = 'timeline.id', seed: str = '', shares: list = [0.8, 0.2], names: list = ['train', 'test'])

Use the arguments to create a random partition TQL expression like col("MD5_PARTITIONS(timeline.id, 'my hashing seed', [0.008, 0.002], ['train','test'])"). This can be used as a column or with .partition_by().

Parameters
  • partition_var (str) – Single-line TQL subexpression

  • seed (str) – String to serve as the seed for the hash partitioning

  • shares (list) – List of the relative shares of each partition. Relative shares do not need to sum to one

  • names (list) – List of the names for each partition

Returns

A metadata column with the random_partition expression for use in .partition_by() or as a column

Return type

Column

tql.select(*cols) zeenk.tql.query.Query

Create a new query object from one or more columns. Columns can be defined as TQL expression strings, or wrapped using one of the provided TQL column functions label(), weight(), tag(), categorical(), numerical(), or metadata(). Typically select(...) will immediately be followed by .from_timelines(...) or .from_events(...) during query construction.

Parameters

cols – One or more TQL column objects or expressions

Returns

A new tql.query.Query instance for further chaining/modification

Return type

Query

tql.show_projects()

Shows the available projects

tql.tag(expr, name: str = '_tag') zeenk.tql.column.Column

Creates a tag column from a TQL expression. The expression provided to tag() is expected to return a non-null value for all rows. Typically this expression will uniquely identify the row, which is useful for debugging and tracing datasets later. Uniqueness is not required for the return value, but highly encouraged. Tag columns will automatically be named “_tag”. It is expected that a dataset will have at most one tag column.

Parameters
  • expr – A TQL expression string that returns a unique identifier for the row

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

Column

tql.update_udf(project_id: int, *function_src)

Updates UDF(s) for the specified Project

Parameters
  • project_id (int) – ID of the Project to upload UDFs to

  • function_src – One or more function’s source string

Throws TQLAnalysisException if function doesn’t exist in the Project yet

tql.upload_udf(project_id: int, *function_src)

Uploads udf(s) to the specified Project

Parameters
  • project_id (int) – ID of the Project to upload UDFs to

  • function_src – One or more function’s source string

Throws TQLAnalysisException if function already exists in the Project

tql.validate_udf(project_id: int, *function_src)

Validates UDF(s)

Parameters
  • project_id (int) – ID of the Project to validate UDFs on

  • function_src – One or more function’s source string

Throws TQLAnalysisException and printout the error message and location if there’s any compilation error in udf

tql.weight(expr, name: str = '_weight') zeenk.tql.column.Column

Creates a weight column from a TQL expression. The expression provided to weight() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default weight value of 1.0. Weight columns will be automatically be named “_weight”. It is expected that a dataset will have at most one weight column.

Parameters
  • expr – A TQL expression string that returns a numeric value

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

Column

Subpackages

Submodules

tql.column module

class tql.column.Column(name: str, expression: str, type: str)

Bases: object

A column is composed of a selectable TQL expression, a friendly name, and a data type. Columns can be of type: label, weight, tag, or metadata

copy()

Makes a copy of this Column

Returns

A copy of this Column

Return type

Column

debugger(project, theme: str = 'light')

Creates an expression debugger widget - default row limit of 10 is applied automatically Only works in Jupyter Notebooks or Jupyter Labs

Parameters
  • project – The project number or name to run the expression against

  • theme (str) – The editor theme, ‘light’ or ‘dark’

Returns

A Jupyter Notebook/Lab widget

get_expression() str

Gets the expression of this Column

Returns

The expression of this Column

Return type

str

get_name() str

Gets the name of this Column

Returns

The name of this Column

Return type

str

get_type() str

Gets the type of this Column

Returns

The type of this Column

Return type

str

is_feature() bool

Gets whether this Column is a feature [Column](tql.html#tql.column.FeatureColumn)

Returns

Whether this Column is a feature

Return type

bool

is_label() bool

Gets whether this Column is a label [Column](tql.html#module-tql.column)

Returns

Whether this Column is a label

Return type

bool

is_weight() bool

Gets whether this Column is a weight [Column](tql.html#module-tql.column)

Returns

Whether this Column is a weight

Return type

bool

json() dict

Gets a Python dict representation of this Column

Returns

A Python dict representation of this Column

Return type

dict

keys() tuple

Gets the keys in the Python dict that represents a Column object

Returns

A tuple of the keys in the Python dict that represents a Column object

Return type

tuple

name(name: str)

Sets the name of this Column

Parameters

name (str) – The name of this Column

Returns

This Column

Return type

Column

type(type: str)

Sets the type of this Column. Available types: 'CATEGORICAL’, ‘NUMERICAL’, ‘LABEL’, ‘WEIGHT’, ‘TAG’, ‘METADATA’

Parameters

type (str) – One of the available Column types

Returns

This Column

Return type

Column

validate(project_id: int)

Validates that the Column is syntactically correct before execution

Parameters

project_id (int) – The project ID to validate against

Returns

Throws an error if the column is not valid

Return type

None

class tql.column.ColumnFilters(**kwargs)

Bases: object

A ColumnFilter is a filter that can be applied to a Column

json() dict

Gets a Python dict representation of this ColumnFilters

Returns

A Python dict representation of this ColumnFilters

min_cardinality(val: int)

Filter columns that have a cardinality less than val

Parameters

val (int) – The minimum cardinality

Returns

This ColumnFilters

Return type

ColumnFilters

min_label_sum(val)

Filter label columns that have a sum less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

min_negative_count(val: int)

Filter negative columns that have a count less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

min_negative_sum(val: int)

Filter negative columns that have a sum less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

min_negative_weighted_count(val: int)

Filter negative weighted columns that have a count less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

min_positive_count(val: int)

Filter positive columns that have a count less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

min_positive_sum(val: int)

Filter positive columns that have a sum less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

min_positive_weighted_count(val: int)

Filter positive weighted columns that have a count less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

min_total_count(val: int)

Filter columns that have a total count less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

min_total_weighted_count(val: int)

Filter weighted columns that have a total count less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

min_weighted_label_sum(val: int)

Filter weighted label columns that have a sum less than val

Parameters

val (int) – The minimum value

Returns

This ColumnFilters

Return type

ColumnFilters

class tql.column.FeatureColumn(name: str, expression: str, type: str, filters: Optional[dict] = None)

Bases: tql.column.Column

A FeatureColumn is Column of type numerical or categorical

When a Column is defined as a categorical() or numerical() FeatureColumn, the following additional filters are available: global_min_total_count, apply_charset_filter, drop_empty_rows option, expand_numerical_feature, drop_numerical_zero_features, etc.

copy()

Makes a copy of the FeatureColumn

Returns

A copy of this FeatureColumn

Return type

FeatureColumn

filter(filters: Optional[dict] = None, **kwargs)

Adds filter(s) to the FeatureColumn

Parameters

filters (dict) – The filter(s) for the FeatureColumn

Returns

This FeatureColumn

Return type

FeatureColumn

get_filters() dict

Gets a Python dict of the FeatureColumn’s filters

Returns

A Python dict of this FeatureColumn’s filters

Return type

dict

json() dict

Gets a Python dict representation of this FeatureColumn

Returns

A Python dict representation of this FeatureColumn

Return type

dict

keys() tuple

Gets the keys in the Python dict that represents this FeatureColumn object

Returns

A tuple of the keys in the Python dict that represents this FeatureColumn object

Return type

tuple

tql.column.categorical(expr, name: Optional[str] = None, filters: Optional[dict] = None) tql.column.FeatureColumn

Creates a categorical feature column from a TQL expression, optionally providing name and filters

Parameters
  • expr – A TQL expression string

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

  • filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A FeatureColumn object to be provided to select(...)

Return type

FeatureColumn

tql.column.col(c: any, name: Optional[str] = None, type: Optional[str] = None, filters: Optional[dict] = None)

Creates a column from a TQL expression or given input, optionally providing name, type, and filters. The first argument can be a variety of formats, including:

  • classes or subclasses of type Column

  • dictionary containing keys for name, expression, and type

  • raw strings, which will be interpreted as TQL expressions.

The created column is by default unnamed, and will be assigned a name if used in a TQL select(...) statement. The default type of columns is ‘METADATA’.

Parameters
  • c (any) – A TQL expression string, column object, dictionary, list, tuple, or FeatureColumn object

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

  • type (str) – Optionally provide a type for the column. If not provided, the column type will be METADATA

  • filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A Column or FeatureColumn object to be provided to select(...)

tql.column.label(expr, name: str = '_label') tql.column.Column

Creates a label column from a TQL expression. The expression provided to label() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default label value of 0.0. Label columns will be automatically be named “_label”. It is expected that a dataset will have at most one label column.

Parameters
  • expr – A TQL expression string

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

Column

tql.column.metadata(expr, name: Optional[str] = None) tql.column.Column

Creates a metadata column from a TQL expression. Metadata columns will return the expression values “as is”, meaning they will not be post-processed with charset filtering or expansion of numerical columns. Metadata columns are also not subject to column filters. Metadata columns are the default column type and are often the correct choice for arbitrary datasets that are not specifically intended to be consumed by a ML training package.

Parameters
  • expr – A TQL expression string

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

Column

tql.column.numerical(expr, name: Optional[str] = None, filters: Optional[dict] = None) tql.column.FeatureColumn

Creates a numerical feature column from a TQL expression, optionally providing name and filters

Parameters
  • expr – A TQL expression string

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

  • filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A FeatureColumn object to be provided to select(...)

Return type

FeatureColumn

tql.column.spaces(string: str, number: int = 2, pad: str = ' ')

Strip leading and lagging newlines and pad newlines with ‘number’ of ‘pad’ characters.

tql.column.tag(expr, name: str = '_tag') tql.column.Column

Creates a tag column from a TQL expression. The expression provided to tag() is expected to return a non-null value for all rows. Typically this expression will uniquely identify the row, which is useful for debugging and tracing datasets later. Uniqueness is not required for the return value, but highly encouraged. Tag columns will automatically be named “_tag”. It is expected that a dataset will have at most one tag column.

Parameters
  • expr – A TQL expression string that returns a unique identifier for the row

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

Column

tql.column.weight(expr, name: str = '_weight') tql.column.Column

Creates a weight column from a TQL expression. The expression provided to weight() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default weight value of 1.0. Weight columns will be automatically be named “_weight”. It is expected that a dataset will have at most one weight column.

Parameters
  • expr – A TQL expression string that returns a numeric value

  • name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

Column

tql.column_utils module

tql.column_utils.parse_feature(s: str) tuple

Parse a feature string into it’s categorical and numerical components. For example, parse_feature(‘key:1.0’) produces (‘key’, 1.0)

Parameters

s (str) – A feature string

Returns

A tuple of (categorical_value, numerical_value)

Return type

tuple

tql.columnset module

tql.columnset.constant()

Used in select(…) statements to select the FeatureColumn “1.0” with name “constant”

Returns

A tuple with the FeatureColumn “1.0” with name “constant”

Return type

tuple

tql.columnset.event_metadata() tuple

To be used in select(…) statements for returning timeline.id, id, datetime, and type

Returns

A tuple of Columns

Return type

tuple

tql.columnset.event_time() tuple

To be used in select(…) statements for returning timestamp and duration

Returns

A tuple of Columns

Return type

tuple

tql.columnset.history_value(project_id=None, name='', filter_type='request', event_fields='impression', custom_value='', custom_function='', days=7, offset='0', cumulative=True, recent_k=None, oldest_k=None, aggregation='COUNT', custom_agg='', weight='1', rate=False, return_value='COALESCE(value,0)', bins='[0,1,2,3]', feature_type='', filters={}, type='INCREMENTALITY', verbose=False)
Aggregate over a history of events, by:
  1. filtering to a specific type and time window of events,

  2. extracting a field value for each event,

  3. summarizing the events’ values using an aggregation function,

  4. (optional) applying daily-rate transformations,

  5. (optional) and/or binning transformations.

Parameters
  • name (str) – Name of the feature set. If name=’’ (default), hash the inputs to create a name.

  • filter_type (str) – Filter timeline.events to only events with type = '{filter_type}', e.g., a Python string, e.g., request.

  • event_fields (list of str) – Create features that extract event_type.field for each field in event_fields (e.g., request.impression extracted using string "impression" or list ["impression", "timestamp"]).

  • custom_value (str) – An optional custom inline/lambda expression to apply to extract arbitrary functions of values instead of using event_type.field. For example, IF(GET_PROPERTY(x,'request.impression'),5,0) will return 5 for all records with truthy values of request.impression. Use x or ${@x} to reference the inline/lambda function variable.

  • custom_function (str) – An optional custom inline/lambda expression to apply to the extracted values from event_type.field or custom_value. For example, COALESCE(x,0) will replace all null values with zeros. Use x or ${@x} to reference the inline/lambda function variable. custom_function may be unnecessary or redundant if custom_value is used.

  • days (list) – A list of at least two histogram knots/bin edges, e.g., [0, 1.5, 3, 7] in units of days. A scalar is also allowed, implying a list with zero prepended: 7 means [0,7].

  • offset (float/str) – Number of days to shift the edges away from the sample time. Typically small quantities such as 60 seconds = 60/(24*3600).

  • cumulative (boolean) – Should the histograms start at days[0] (cumulative=True) or days[d-1] (cumulative=False)?

  • recent_k=None – Input k = 0,1,2,… to truncate the time-filtered event list to just k oldest/recent events within the days time window?

  • oldest_k=None – Input k = 0,1,2,… to truncate the time-filtered event list to just k oldest/recent events within the days time window?

  • aggregation (str) – Aggregation function to apply to the event_fields values: COUNT, AVG, SUM, MIN/AG_MIN, MAX/AG_MAX, MODE, IS_NULL, RECENT, OLD, RECENT_#, OLD_#, CUSTOM. To ignore nulls, these additional aggregations can be used: COUNT_NS, AVG_NS, SUM_NS, MIN_NS/AG_MIN_NS, MAX_NS/AG_MAX_NS, MODE_NS.

  • custom_agg (str) – If aggregation=='CUSTOM', use this custom expression on events_value to define the aggregate value to return. For example, AVG(MAP(events_value, (x) -> EXP(x))) would average the exponential of each event’s value.

  • weight – (Testing) Event-level weight expression. Defaults to 1.

  • rate (boolean) – True transforms the post-aggregation value to a daily rate based on the difference in days bins.

  • return_value (str) – An optional custom final transformation to apply to post-aggregation and post-rate value. For example, if you would like to return a log transformation of the value, LOG(value). This will be applied prior to bins (if enabled).

  • bins (str) – Bin the aggregate value using bins, e.g., [0,1,2,3]. If empty, skip binning/return value.

  • feature_type (str) – Defaults to be empty '', then if binning this becomes “CATEGORICAL”; otherwise it defaults to “NUMERICAL”. Can explicitly override defaults by setting this to “NUMERICAL” or “CATEGORICAL”.

  • filters (dict) – Feature filters to apply to all of the features.

  • verbose (boolean) – Print out each feature’s name as it is created by for-loops of the list-compatible inputs field, agg, and days.

Returns

List of history features.

Return type

list

tql.columnset.kernelize(features, name: Optional[str] = None, treatment_filter_event: Optional[str] = None, event_group_where_event: Optional[str] = None, intended_action: Optional[list] = None, actual_action: Optional[list] = None, treatment_model: Optional[str] = None, kernel_types: Optional[list] = None, opportunity_conf: Optional[zeenk.tql.opportunity_conf.OpportunityConf] = None, kernel_parameters: Optional[str] = None, kernel_distribution: Optional[str] = None, filters: Optional[dict] = None, parse_kv: bool = False)

Takes a list of opportunity-level features based on user/opportunity/treatment/proxy-outcome-level fields and create a transformed feature with SUM_OPPORTUNITIES() accumulating the effects across each opportunity. These “kernelized” features are for use in a Causmos timeline-based causal model where each observation record in a dataset is an outcome or a potential outcome (e.g., a moment in time when an outcome could have occurred). For example, we can power dataset generation using a treatment-propensity model to reduce the potential bias if treatment and the outcome are correlated. Apply this function to each list of features to be kernelized. There are several types of transformations described as (KD = ‘treatment’, BKD = ‘baseline’, GKD = ‘ghost’, NKD = ‘nonrandom’).

Parameters
  • features (list) – List of col() to be kernelized (KD,BKD,GKD,NKD). Examples: ["1.0", "IF(1.0,'L','R')"], [{"name":"constant", "expression":"1.0", "type":"NUMERICAL"}]). Columns can be “CATEGORICAL” or “NUMERICAL”. Each categorical expression creates an expansion of features; each numerical expression multiplies/reweights the opportunity’s contribution to the SUM_OPPORTUNITIES() sum.

  • name (str) – Base string for the feature sets. Will append type of kernel and prepend “w” to denote non-incrementality features for BKD, GKD, and NKD. If name=’’ (default), just return the list of features with the default prefix: AKD_..., wBKD_....

  • treatment_filter_event (str) – Column that defines whether a treatment opportunity resulted in treatment. For example, was the advertiser’s impression shown after bidding? In a sample dataset, request.impression is the relevant field.

  • event_group_where_event (str) – second where condition within passed EventGroup (currently this is set to be output from filtering like dedupe - but should be made more explicit).

  • intended_action – List of numerical/categorical expressions defining the intended/optimal decisions, e.g., bid_amount or eligible_treatment_groups.

  • actual_action – List of numerical/categorical expressions defining the actual action/decision taken, e.g., IF(ghost_bid, 0, bid_amount) or assigned_treatment_group.

  • treatment_model (str) – String for the deployed/published treatment prediction model (GKD, NKD-only). In practice, this will be the win-rate model based on the leaves.

  • kernel_types (list) – List of kernels types to include in the feature set. Subset of ['KD','BKD','GKD','NKD'].

  • opportunity_conf – An OpportunityConf() object that contains the defaults for opportunity_filter, kernel_parameters, and kernel_distribution.

  • kernel_parameters (str) – Calibration for kernel.days in the kernel parameters. Arrays of kernel features are created with suffixes using values such as seconds (s), minutes (m), hours (h), and days (d) in combination with a number, such as '15m,4h,3d': 'name_feature' becomes 'name_feature-15m', etc.

  • kernel_distribution (str) – Positive-support distributions (short-form abbreviations): exponential (e, exp), uniform (u, unif), triangular (t, tri), halfnormal (h, hnorm), and halflogistic (l, hlog). Positive- & negative-support distributions are the symmetric analogs of the positive-support distributions: laplace (a, lap), rectangular (r, rect), symmetrictriangular (s, stri), normal (n, norm), and logistic (o, log). Time-independent constant kernel (c, const) is also useful for various use cases such as static models in order to accumulate all opportunities and treatments for each outcome.

  • filters (dict) – column filters to apply to all of the features.

  • parse_kv (boolean) – Try to parse ‘NUMERICAL’ input features as key-value pairs ‘k:v’. This has extra overhead, but more complex mixed ‘categorical:numerical’ input features.

Returns

A list of feature sets for each kernel_type, each containing a list of kernelized features.

Return type

list

tql.columnset.random_partition(partition_var: str = 'timeline.id', seed: str = '', shares: list = [0.8, 0.2], names: list = ['train', 'test'])

Use the arguments to create a random partition TQL expression like col("MD5_PARTITIONS(timeline.id, 'my hashing seed', [0.008, 0.002], ['train','test'])"). This can be used as a column or with .partition_by().

Parameters
  • partition_var (str) – Single-line TQL subexpression

  • seed (str) – String to serve as the seed for the hash partitioning

  • shares (list) – List of the relative shares of each partition. Relative shares do not need to sum to one

  • names (list) – List of the names for each partition

Returns

A metadata column with the random_partition expression for use in .partition_by() or as a column

Return type

Column

tql.demo_projects module

tql.demo_projects.build_acheron_1(force: bool = False)

Builds Project 1: Synthetic data from the original Acheron simulator Loads noumena-public dataset for 2000 users, 5 requests per user Code for generating Acheron data is not included in the TQL product. Acheron was a purely statistical data generator,

tql.demo_projects.build_acheron_2(force: bool = False)

Builds Project 2: Synthetic data from the original Acheron2 simulator Same as old projects 2, 99 Loads noumena-public dataset for 200 users, over 10 days Acheron2 is the original name for Lethe; this dataset is equivalent to Lethe data built using the “reasonably_rich” config. This dataset has a bunch of features but was designed to exploit the configuration options of the simulator and memorialize realistic traffic distributions more than to set up specific incrementality behaviors.

tql.demo_projects.build_lethe_3(force: bool = False)

Builds Project 3: Synthetic data from the Lethe simulator with treatment-level “ghosting” every 2-hours per user Runs Lethe simulation for the config “demo_set”, a config with the following populations:

  1. A set of very active users who rarely convert

  2. A set of users who convert well but are not at all influenced by ads

  3. A set of users who visit due to ads but do not convert due to them

  4. A set of highly incremental users

The population each user belongs to is captured in the “group” feature, but is influenced by the other demographic features. Additionally, there is an ad stock difference in incrementality based on Ad Size.

The idea is that anything short of an incremental conversion model will make inefficient decisions here.

Time-activity patterns are based on reasonably_rich, which drew them from RTB auction logs. The auction is always-win.

tql.demo_projects.build_lethe_4(force: bool = False)

Builds Project 4: Synthetic data from the Lethe simulator with user-level randomization Runs Lethe simulation for the config “demo_set”, a config with the following populations:

  1. A set of very active users who rarely convert

  2. A set of users who convert well but are not at all influenced by ads

  3. A set of users who visit due to ads but do not convert due to them

  4. A set of highly incremental users

The population each user belongs to is captured in the “group” feature, but is influenced by the other demographic features. Additionally, there is an ad stock difference in incrementality based on Ad Size.

The idea is that anything short of an incremental conversion model will make inefficient decisions here.

Time-activity patterns are based on reasonably_rich, which drew them from RTB auction logs. The auction is always-win.

tql.demo_projects.create_acheron2_time_series(force: bool = False)

Builds Acheron 2 TimeSeries - request, conversion, user

tql.demo_projects.create_acheron_time_series(force: bool = False)

Creates the Acheron TimeSeries - request, conversion, user

tql.demo_projects.create_lethe_time_series(data_url, force: bool = False)

Builds Lethe TimeSeries - bid, activity, user

tql.demo_projects.get_project1_files_dir()
tql.demo_projects.get_project2_files_dir()
tql.demo_projects.get_project3_files_dir()
tql.demo_projects.get_project4_files_dir()
tql.demo_projects.rebuild_demo_projects(force: bool = False)

Rebuilds all demo projects

Parameters

force (bool) – If the projects already exist, overwrite the data

tql.expression_debugger module

tql.expression_debugger.debugger(project, expression: str = '', theme: str = 'light')

Create an expression debugger widget - default row limit of 10 is applied automatically. Only works in Jupyter Notebooks or Jupyter Labs

Parameters
  • project – The Project ID or name to run the expression against

  • expression (str) – The initial value of the expression, if any

  • theme (str) – The editor theme, ‘light’ or ‘dark’

Returns

A Jupyter Notebook/Lab widget

tql.expression_utils module

tql.expression_utils.add_lambda(expr, varname='x', project_id=None)

Add leading inline ‘varname.’ to attribute variables in a string expression or list of string expressions.

Note: Expressions like '${type}' should return '${@x.type}' but instead returns ${x.type}. For now, do not reference values using the deprecated delimiter '${type}'—just use 'type'.

tql.expression_utils.as_variable(expr, varname) str

Save an expression to a DSL variable, e.g., ‘varname=expression;’

Parameters
  • expr (str) – The expression to transform and save to variable ‘varname’

  • varname (str) – The variable name which the expression will be saved to

Returns

expression with the final line set equal to varname

Return type

str

Example:

as_variable("var1=2000; ${@var1}", 'input_name')
# Returns "var1=2000; input_name=${@var1};"
tql.expression_utils.remove_lambda(expr, varname='x') str

Remove leading inline @?varname. from variables in a string expression or list of string expressions. Regex handles many cases outlined in test_remove_lambda().

Note: Expressions like ${x.var} should return itself but returns ${var} when the inline variable collides with one of the input record types.

tql.expression_utils.snippet(string: str, length: int = 20, postfix: str = '...')

Truncate ‘string’ to ‘length’ and append ‘postfix’.

tql.expression_utils.spaces(string: str, number: int = 2, pad: str = ' ')

Strip leading and lagging newlines and pad newlines with ‘number’ of ‘pad’ characters.

tql.expression_utils.to_list(x)

If x is not a list, make it a list

tql.expression_utils.validate_dynamics(dynamics)

Validates that dynamics is a dict and has valid: scale, shape, epsilon, and filters

tql.expression_utils.validate_scale(scale)

Validate the dynamic scale format.

tql.expression_utils.which_distribution(distribution)

Format the distribution name to use the ‘short’ version that SUM_OPPORTUNITIES() accepts.

tql.function_doc module

class tql.function_doc.FunctionDoc(function_name=None, project_id=None)

Bases: object

tql.function_doc.find_function(pattern: str, project_id: Optional[int] = None, descriptions: bool = False) list

Returns DSL function names matching a regexp pattern

Parameters
  • pattern (str) – The pattern to search for

  • project_id (int) – The Project ID

  • descriptions (bool) – Also search description fields

Returns

A list of FunctionDoc

Return type

list

tql.function_doc.function_usage(function_name: str, project_id: Optional[int] = None)

Shows help on a DSL function (or return an info dictionary)

Parameters
  • function_name (str) – The name of the function to get usage

  • project_id (int) – The Project ID

Returns

The function documentation

tql.opportunity_conf module

class tql.opportunity_conf.OpportunityConf(opportunity_filter_expressions: list, kernels: str = '5m,15m,1h,4h,1d,3d,7d', decay_function: str = 'exp', sum_opportunities_epsilon: float = 1e-09)

Bases: object

An OpportunityConf object is used to configure the dataset extractor for incrementality datasets when building features that use the DSL function SUM_OPPORTUNITIES().

When modeling the relationship between the time series of treatment opportunities and outcomes, assertions about timing and the shapes of those relationships must be made. For example, are opportunities typically followed by an increase in the number (or value) of outcomes due to the causal effect of treatment or merely due to temporal correlations (e.g., “activity bias” or other spurious form of selection bias)? Are opportunities preceded by a “run-up” in outcomes due to outcomes being a requirement for opportunities (e.g., retargeting, upselling, etc.)?

We define opportunities by asserting the time or events that represent an “opportunity to treat.” Then, when building features to model the treatment effects and control for sources of bias, we can accumulate the contributions of each opportunity and treatment across CATEGORICAL and NUMERICAL features in accordance with a hypothesized effect distribution shape. We do not have to know the shape in advance, but we have to establish boundaries on the shape by defining:

  • What is an opportunity?

  • What is the hypothesized time range relevant to the effect of the opportunity?

  • What set of basis shapes or distributions should we use to build up a mixture distribution of the dynamic effects/relationships between opportunities, treatments, and outcomes?

The parameters of this class encode the answers to these questions to facilitate computationally efficient strategies for modeling with SUM_OPPORTUNITIES().

See also: nanml.dsl.usage(‘SUM_OPPORTUNITIES’).

Extraction settings for incrementality datasets. OpportunityConf.SPEC has reasonable defaults which can be overridden:

Parameters
  • opportunity_filter_expressions (list) – [] - A list of opportunity filter expressions Filter that defines opportunities to treat a user.

  • kernels (str) – ex: “5m,15m,1h” - Specify a window for effects stemming from opportunities. In the case of the exponential decay_function, this value controls rate. Examples of valid inputs ‘15m’, ‘2h’, ‘7d’ (15 minutes, 2 hours, 7 days). Multiple kernels are specified as a single string ‘15m, 2h, 7d’. Numbers must be integers or decimals. Letters are case insensitive. See also PARSE_KERNELS() function. Each kernel can optionally set a distribution to override the default decay_function. For example, ‘15m-u’ designates a 15-minute scale parameter on a uniform distribution.

  • decay_function (str) – ex: “exp” - Effects from opportunities may take a particular shape. Positive-support distributions (short-form abbreviations): exponential (e, exp), uniform (u, unif), triangular (t, tri), half-normal (h, hnorm), and half-logistic (l, hlog). Positive- & negative-support distributions are the symmetric analogs of the positive-support distributions: laplace (a, lap), rectangular (r, rect), symmetric triangular (s, stri), normal (n, norm), and logistic (o, log).

  • sum_opportunities_epsilon (float) – ex: 0.0 - Set a tolerance on the magnitude of the distribution’s effect below which the feature’s value will be rounded to zero to increase sparsity and improve computational performance. Larger values increase computational efficiency at the potential cost of bias due to omitting potential effects due to round-off.

describe()

Get a description of the object. In Jupyter notebooks, this returns a set of HTML tables. In a regular python interactive shell or script, this will default to the String representation of the OpportunityConf.

classmethod for_type(event_type: str)

A very common case is to want the treatment filter expression to match a specific type This builds and returns an appropriate OpportunityConf object

get_decay_function()

Gets the decay function from this OpportunityConf

get_kernels()

Gets the kernels from this OpportunityConf

get_opportunity_filter_expressions()

Gets the filters from this OpportunityConf

get_sum_opportunities_epsilon()

Gets the epsilon from this OpportunityConf

json() dict

Gets a dictionary representation of this object Used when sending to Icarus (the web server) for evaluation.

Returns

The current configuration instance as a dict

Return type

dict

tql.query module

class tql.query.Query

Bases: object

The primary object of TQL is to use the Query API to extract machine-learning ready ResultSets from Timeline data. Each row in a ResultSet is an event on a timeline, and each column is a piece of data extracted from the event using the TQL Expression Language.

abort()

Request this Query be aborted

static abortById(query_id)

Request Query with the given ID be aborted

Parameters

query_id – to identify the query to be aborted

copy()

Returns a copy of this Query object

dataframe(print_payloads=False, limit=None)

An alias for results().dataframe(). If there are multiple partitions, the rows from all partitions will be concatenated together.

Returns

The ResultSet

Return type

ResultSet

debugger(theme: str = 'light')

Creates an expression debugger widget - default row limit of 10 is applied automatically Only works in Jupyter Notebooks or Jupyter Labs

Parameters

theme (str) – The editor theme, ‘light’ or ‘dark’

Returns

A Jupyter Notebook/Lab widget

describe()

Gets a description of the query object. In Jupyter notebooks, this returns a set of HTML tables. In a regular python interactive shell or script, this will default to the String representation of the query.

Returns

A description of this Query object

downsample_by(sample_rate: float = 1, pos_sample_rate: float = 1, pos_sampling_seed: str = 'label_downsample', neg_sample_rate: float = 1, neg_sampling_seed: str = 'label_downsample', key_expression: str = 'CONCAT(id, timeline.id)', salt: str = '', reweight: bool = True, max_records: Optional[int] = None, neg_pos_ratio: Optional[float] = None, interactive: Optional[bool] = None)

Downsample data specified by a query

Parameters
  • sample_rate – sample rate for all records

  • pos_sample_rate – sample rate for positive records

  • pos_sampling_seed – seed for positive sampling expression

  • neg_sample_rate – sample rate for negative records

  • neg_sampling_seed – seed for negative sampling expression

  • key_expression – key for generating down sampling expression

  • salt – seed for generating down sampling expression

  • max_records – an approximate maximum number of records to return, if max_records is set then also need to set interactive flag to get an accurate estimate

  • neg_pos_ratio – Desired negative-to-positive ratio which typically should be between 1 and 10 to maximize statistical performance of model estimation. neg_pos_ratio=1 yields balanced samples, whereas neg_pos_ratio=3 has 2x as much data but only 33% lower variance and neg_pos_ratio=10 has 5.5x as much data but only (roughly) 50% lower variance than a balanced sample.

  • interactive – specify if the query is interactive or not, needs to be provided if max_records is set to get an accurate estimate

  • reweight – indicate if need to adjust positive/negative records weight

Returns

Query with down sampling where clause

Return type

Query

event_var(name: str, expr: str)

Defines a single precomputed event-level variable that can be retrieved with the expression language function EVENT_VAR('name'). These variables will only be computed once per event and can be reused across multiple features as a speed optimization.

Parameters
  • name (str) – An event variable name to use

  • expr (str) – An expression to be pre-computed

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

event_vars(vars: dict)

Defines a set of precomputed event-level variables that can be retrieved with the expression language function EVENT_VAR('name'). These variables will only be computed once per event, and can be reused across multiple features as a speed optimization.

Parameters

vars (dict) – A dictionary of event variables to precompute

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

external_timelines(timelines: list)

Instead of selecting from the pre-built timelines for the project specified in the .from() clause, process the query ONLY against the supplied example timelines. This is useful for unit testing or developing simple examples.

Example timelines:

timelines = [{
    'id': 'timeline_id1',
    'events': [{
        'id': 1,
        'timestamp': 1587772800000,
        'type': 'request',
        'request': {
            'bid': 2.0,
            'impression': 1
        }
    }]
}]
Parameters

timelines (list) – A list of dictionaries representing timeline objects to run the query against

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

format(format)

Specify a format for the output of this query. This operator is only applicable for non-interactive queries where the results will be written to disk.

Parameters

format – One of parquet, csv, or json

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

from_events(identifier, *types)

Selects from the given timelines, emitting one row per event. Identifier can either be a project id or project name. Optionally a list of event types can be provided. For example, if your timelines are composed of bid, activity, and user events, then .from_events(id, ‘user’) will select only user events from your timelines. This from clause is useful if you wish to build a dataset from your timeline events, and is the most common ‘from’ clause used when constructing machine learning datasets.

Parameters
  • identifier – Project ID or name

  • *types – One or more event types to select from

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

from_timelines(identifier)

Selects from the given timelines, emitting one row per timeline. Identifier can either be a project id or project name. Functionally, this is accomplished by injecting one sampled (fake) event per timeline with timestamp at epoch time 0, filtering out all other events. This from clause is useful to compute a table of summary statistics per timeline, such as “how many click events has each user had in the last 28 days”.

Parameters

identifier – Project ID or name

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

from_union(sampling: zeenk.tql.sampling.SamplingConf, append: bool = False)

Adds a sampling configuration object in the from_events() clause to this query. Sampling objects add new events to the timeline during execution with generated timestamps. This function can be called multiple times, in which case multiple blocks of generated events will be added to each timeline.

Parameters
  • sampling (SamplingConf) – A SamplingConf object to append

  • append (bool) – Append the where conditions to the query instead of replacing

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

get_columns() list

Returns the Columns from the TQL query’s select() statement

Returns

A list with a copy of the Column objects currently stored in this Query

Return type

list

get_event_vars() dict

Gets the event_vars of this Query if any exist

Returns

The event_vars of this Query

Return type

dict

get_external_timelines() list

Returns the external timelines configured for this query, or None

Returns

A list of timeline dictionaries, or None

Return type

list

get_format() str

Gets the format output (one of parquet, csv, or json) for this Query. This operator is only applicable for non-interactive queries where the results will be written to disk.

Returns

The format of the current query

Return type

str

get_global_vars() dict

Get the global variables from this Query if any exists

get_id() int

Gets the ID of this Query

Returns

The ID of this Query

Return type

int

get_interactive() bool

Gets whether this Query is interactive

Returns

Whether this Query is interactive

Return type

bool

get_label_column()

Gets the label Column if it exists

Returns

The first weight Column

Return type

Column

get_limit() tuple

Gets the row limit and the timeline limit from this Query

Returns

The row_limit and the timeline limit

Return type

tuple

get_opportunities()

Returns the OpportunityConf object in the current query

Returns

OpportunityConf object in the current query

Return type

OpportunityConf

get_options() dict

Gets the options for this Query

Returns

The options for this Query

Return type

dict

get_partition_by() str

Gets the partition key expression for this Query

Returns

The expression that this Query was partitioned by

Return type

str

get_project()

Gets the Project with which this Query is associated

Returns

The Project

Return type

Project

get_project_id() int

Gets the Project’s ID with which this Query is associated

Returns

The Project’s ID

Return type

int

get_row_limit() int

Gets the row limit of this Query

Returns

The row limit

Return type

int

get_sampling() list

Returns the SampingConf object(s) in the current query

Returns

The SampingConf object(s) in the current query

Return type

list

get_timeline_limit() int

Gets the timeline limit from this Query

Returns

The timeline limit

Return type

int

get_timeline_sample_rate()

Gets the timeline sample rate

Returns

The timeline sample rate

get_timeline_vars() dict

Gets the timeline vars of this Query if any exists

Returns

The timeline vars

Return type

dict

get_vars() dict

Gets a dictionary of global_vars, timeline_vars, and event_vars

Returns

A dictionary of global_vars, timeline_vars, and event_vars

Return type

dict

get_weight_column()

Gets the weight Column if it exists

Returns

The first weight Column

Return type

Column

get_where()

Gets the where filters of this Query

Returns

The where filters of this Query

global_var(name: str, expr: str)

Defines a single precomputed global variable that can be retrieved with the expression language function GLOBAL_VAR('name'). These variables will only be computed once per query and can be reused across multiple rows and multiple features per row as a speed optimization. The global variable ‘timeline_stats’ is pre-defined for all queries as a dictionary/map object with the following keys: [min_timestamp,max_timestamp,timeline_count, event_count,event_min_timestamp,event_max_timestamp,event_count_min,event_count_max,event_counts_by_type, time_series_min_max_timestamps]

Parameters
  • name (str) – A variable name to use

  • expr (str) – An expression to be pre-computed

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

global_vars(vars)

Defines a set of precomputed global variables that can be retrieved with the expression language function GLOBAL_VAR('name'). These variables will only be computed once per query, and can be reused across multiple rows and multiple features per row as a speed optimization. The global variable ‘timeline_stats’ is pre-defined for all queries as a dictionary/map object with the following keys: [min_timestamp,max_timestamp,timeline_count, event_count,event_min_timestamp,event_max_timestamp,event_count_min,event_count_max,event_counts_by_type, time_series_min_max_timestamps]

Parameters

vars (dict) – A dict of global variables to precompute

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

json() dict

Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation

Returns

The current TQL Query as a dict

limit(rows=None, timelines=None)

Imposes a limit on the number of timelines iterated over, or rows returned. If timelines=<N> is specified, then at most N timelines will be evaluated in the query. If rows=<N> is specified, the at most N rows will be returned in the results. .limit(5) is a typical usage of this operator, and would be the closest analog to the traditional SQL operator. For interactive queries, imposing a timeline limit is not required, as specifying limit(rows=N) is sufficient to shortcut the evaluation. However, for asynchronous queries executed in a distributed environment such as Spark, it is often useful to specify both timeline and row limit, as adding timeline limit could result in less data being read from disk, and hence faster execution time.

Parameters
  • rows (int) – The maximum number of rows to return

  • timelines (int) – The maximum number of timelines to evaluate.

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

classmethod load(query_id)

Loads a Query object from the given ID

Parameters

query_id (int) – The project ID from which to load the Query

Returns

The Query from the given ID

Return type

Query

opportunities(filters, distribution=None, scale=None, epsilon=None)

Defines events that constitute “opportunity to treat” for this dataset. For example, in the online advertising space, opportunities to treat would constitute all bid request events, i.e. “we had an opportunity to buy an ad on the user.” In a clinical drug trial, opportunity would constitute all persons who apply for a drug trial study. Filters will be evaluated as booleans (see docs on the .where() operator). Optionally a distribution function can also be supplied, which defines the hypothesized shape of the causal effect of the opportunities over time. Short, medium, or long string representations of the distribution functions are accepted:

'c', 'const', 'constant'
'e', 'exp', 'exponential'
'l', 'lap', 'laplace'
'u', 'unif', 'uniform'
'r', 'rect', 'rectangular'
't', 'tri', 'triangular'
's', 'stri', 'symmetrictriangular'
'h', 'hnorm', 'halfnormal'
'n', 'norm', 'normal'
'l', 'hlog', 'halflogistic'
'o', 'log', 'logistic'

Also optionally, a string list of time scales can be applied, defining the points in time at which the distribution should be evaluated at. Any numerical value of time is allowed, at the scale of seconds (s), minutes(m), hours(h), and days(d). For example, this is a valid scale string: ‘.5m,4h,1d,3d’, which would correspond to 30 seconds, 4 hours, 1 day, and 3 days.

Also optionally, an epsilon can be provided, which defines a minimal precision below which a feature should be rounded down to zero, e.g., 1e-8 would lead a feature value of 5e-9 to be returned as zero.

Parameters
  • filters – One or more TQL filter expressions

  • distribution – The decay function shape (default exponential)

  • scale – The time scale over which to evaluate the causal effect of opportunities (default 5m,1h,4h,1d,3d,7d)

  • epsilon – The numerical epsilon to use (default 1E-6)

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

options(max_columns=None, global_min_total_count=None, apply_feature_filters=None, apply_charset_filter=None, drop_empty_rows=None, expand_numerical_features=None, drop_numerical_zero_features=None, throw_expression_errors=None, debug_expressions=None, fill_na=None, numerical_feature_precision=None, numerical_feature_epsilon=None, drop_constant_feature_columns=None, fix_column_names=None, allow_invalid_column_expressions=None)

Specify options for this query.

Parameters
  • max_columns – the maximum number of columns to return, top N columns is computed from count of non-null values.

  • global_min_total_count – require that at least this many rows contain a non-null value, or drop the column.

  • apply_feature_filters – flag on/off applying feature filtering (default true)

  • apply_charset_filter – flag on/off cleaning the values of numerical and categorical feature columns (default true)

  • drop_empty_rows – if on, remove rows that have no non-null values. (default false)

  • expand_numerical_features – expand numerical feature arrays into multiple columns. (default false)

  • drop_numerical_zero_features – drop numerical feature columns that contain all zeros. (default false)

  • drop_constant_feature_columns – drop numerical feature columns that are constant (default false)

  • throw_expression_errors – use “fail fast” behavior with invalid expressions (default false)

  • debug_expressions – return extended debugging information about TQL expression evaluation with the result set.

  • fill_na – replace all numerical features with non-numeric values with 0.0.

  • numerical_feature_precision – how many decimal places to return.

  • numerical_feature_epsilon – abs(val) < eps will be rounded down to zero.

  • fix_column_names – specify if backend should rename duplicate column names. (default true)

  • allow_invalid_column_expressions – return errors about invalid columns exppressions

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

partition_by(expr=None, partition_var: str = 'timeline.id', seed: str = '', shares: list = [0.008, 0.002], names: list = ['train', 'test'])

Specify a partition-key TQL expression for this dataset. Partition-key TQL expressions MUST return a non-null value for every row, and the value should have low cardinality, since a separate block of data (i.e. folder of files for asynchronous queries) will be created for each distinct partition value. One common use of partitioning is to generate reproducible train/test splits of your data for machine learning training. For example, .partition_by(“IF(MD5_MOD(timestamp, 10) > 8, ‘train’, ‘test’)”) creates a reproducible 80/20 split of your data, which can be read separately by your training and testing routines. It is reproducible because the split is computed based on an attribute of the event, such as timestamp. Another common use of partition_by() is to split your dataset into logical groupings such as by day: .partition_by(‘date(timestamp)’) to be read with reporting/BI software.

If the ‘expr’ argument is not set, use other arguments(partition_var, seed, shares, names) to create an expression like Expression("MD5_PARTITIONS(timeline.id, 'the dead sea', [0.008, 0.002], ['train','test'])") and set ‘partition_key_expression’ to that expression.

Parameters
  • expr – the partition key expression to use.

  • partition_var – Single-line TQL subexpression.

  • seed – String to serve as the seed for the hash partitioning.

  • shares – List of the relative shares of each partition.

  • names – List of the names for each partition.

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

sampling(sampling: zeenk.tql.sampling.SamplingConf, append: bool = False)

Adds a sampling configuration object to this query. Sampling objects add new events to the timeline during execution with generated timestamps. This function can be called multiple times, in which case multiple blocks of generated events will be added to each timeline.

Parameters
  • sampling – A SamplingConf object to append

  • **append – Append the where conditions to the query instead of replacing.

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

select(*cols, append: bool = False)

Defines the columns on this TQL query object. This replaces any pre-existing columns that were set with previous calls to select().

Parameters
  • cols – One or more TQL column objects or expressions

  • append (bool) – Append cols to the query’s columns instead of replacing the query’s columns

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

show()

Shows the ResultSet dataframe as a pretty printed table on stdout.

state()

Gets the current state of this Query

submit(interactive=True, wait=True, print_payloads=False, analyze=False, spark=None)

Executes this query in Icarus (via http request) and fetches the results. If the interactive flag is true, this query will be executed immediately and the returned ResultSet will have .rows(). This is suitable for interactive development of TQL expressions, as the results will be immediate. If the interactive flag is false, the query will be executed asynchronously, possibly in a distributed computing environment such as on a Spark cluster. In this case, ResultSet will not have .rows(), but will instead have .data_path() which will be a directory in which the query results will be written to upon successful execution.

Parameters
  • interactive (bool) – Whether this query should be executed synchronously or asynchronously

  • wait (bool) – Wait for the dataset compile to complete and show progress/status. Otherwise just make the webservice call and return.

  • print_payloads (bool) – Prints the request and response payloads from Icarus

  • analyze (bool) – Display extended ResultSetMetrics

  • spark (bool) – Whether to submit the query to a Spark Cluster

Returns

A ResultSet object

Return type

ResultSet

timeline_sample_rate(sample_rate)

Sets the sample rate on the Timelines

Parameters

sample_rate – A floating point value between (0,1]

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

timeline_var(name: str, expr: str)

Defines a single precomputed timeline-level variable that can be retrieved with the expression language function TIMELINE_VAR('name'). These variables will only be computed once per timeline and can be reused across multiple rows and multiple features per row as a speed optimization.

Parameters
  • name (str) – a variable name to use

  • expr (str) – an expression to be pre-computed

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

timeline_vars(vars: dict)

Defines a set of precomputed timeline-level variables that can be retrieved with the expression language function timeline_var(‘name’). These variables will only be computed once per timeline, and can be reused across multiple rows and multiple features per row as a speed optimization.

Parameters

vars (dict) – A dictionary of timeline variables to precompute

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

treatments(filters, scale=None, distribution=None, epsilon=None)

An alias for opportunities(…) to treat.

udf(*udfs, append: bool = False)

Attaches UDF (user-defined function) to a Query

Parameters
  • udfs – One or more UDF, each udf could be either a string or a TQL Column object

  • append (bool) – Append cols to the query’s columns instead of replacing the query’s columns

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

union(sampling: zeenk.tql.sampling.SamplingConf, append: bool = False)

Adds a sampling configuration object to this query. Sampling objects add new events to the timeline during execution with generated timestamps. This function can be called multiple times, in which case multiple blocks of generated events will be added to each timeline.

Parameters
  • sampling (SamplingConf) – A SamplingConf object to append

  • append (bool) – Append the where conditions to the query instead of replacing.

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

validate()

Validates that the Query is runnable before execution

Returns

Throws an error if the query is not valid

Return type

None

vars(global_vars: Optional[dict] = None, timeline_vars: Optional[dict] = None, event_vars: Optional[dict] = None)

Convenience method for passing a dictionary of all of a query’s precomputed variables. A dictionary with keys. Optional keys: ‘global_vars’, ‘timeline_vars’, and/or ‘event_vars’

Can be used in conjunction with .get_vars():

>>> q = select("1").from_events(1).limit(1).vars(global_vars={'foo':'bar'})
>>> q.get_vars()
{'global_vars': {'foo': 'bar'}, 'timeline_vars': {}, 'event_vars': {}}
Parameters
  • global_vars (dict) – A dictionary of key-values with valid inputs to .global_vars()

  • timeline_vars (dict) – A dictionary of key-values with valid inputs to .timeline_vars()

  • event_vars (dict) – A dictionary of key-values with valid inputs to .event_vars()

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

visualize(size='', shape='', opacity='', jitter=0.2, rows=None, timelines=None, down_sampling=None, size_rescale=7, shape_dict=None, opacity_rescale=None, include_style_columns=True, ignore_nulls=True)

Visualize a timeline dataframe from query execution.

Parameters
  • rows – the maximum number of rows to return

  • timelines – the maximum number of timelines to evaluate.

  • down_sampling – down sampling expression

  • size (str) – Name of column to use to encode each record’s marker’s size

  • shape (str) – Name of column to use to encode each record’s marker’s shape

  • opacity (str) – Name of column to use to encode each record’s marker’s opacity

  • jitter (float) – Amount of ‘jitter’ (random vertical offset to improve visibility)

  • size_rescale (float) – Multiplier to adjust the marker size. Defaults to 7

  • shape_dict (dict) –

    Dictionary to transform a column into shapes, e.g.:

    {'outcome':'star', 'opportunity':'square', 'treatment':'triangle',
     'default':'circle'}
    

  • opacity_rescale (float) – Multiplier to adjust the marker opacity. None defaults to rescaling by the column’s min/max to 0 and 1

  • include_style_columns (bool) – Should the event data display include style columns?

  • ignore_nulls (bool) – Do not show key-value pairs that have null as the value

Returns

A timeline viewer widget

where(*filters, append=False)

Add one or more where clauses to the query, filtering the timeline events to only those who satisfy the given conditions. The given TQL expression(s) are evaluated as booleans. Non-boolean return values are evaluated in the following manner:

  • numerical values greater than 0 evaluates to true, otherwise false

  • The strings ‘true’ and ‘false’ (non-case sensitive) are evaluated to their boolean equivalents

  • Other non-numerical, non-boolean return values are considered true if value is not null

In case of multiple where clauses, only the last one will be honored, i.e. the following constructs are logically equivalent: .where('cond1').where('cond2') == .where('cond2') Multiple filters in a single where clause are applied using AND, i.e. the following constructs are logically equivalent: .where('cond1', 'cond2') == where('cond1 AND cond2'). Use the following TQL syntax to define an OR clause: .where('cond1 OR cond2')

Parameters
  • filters – One or more TQL columns or expressions that returns a boolean

  • append (bool) – Append the where conditions to the query instead of replacing

Returns

The current TQL Query instance for further chaining/modification

Return type

Query

tql.query.load_query(id: int) tql.query.Query

Loads the query from the given ID

Parameters

id (int) – The ID of the query

Returns

A new Query instance

Return type

Query

tql.query.select(*cols) tql.query.Query

Create a new query object from one or more columns. Columns can be defined as TQL expression strings, or wrapped using one of the provided TQL column functions label(), weight(), tag(), categorical(), numerical(), or metadata(). Typically select(...) will immediately be followed by .from_timelines(...) or .from_events(...) during query construction.

Parameters

cols – One or more TQL column objects or expressions

Returns

A new tql.query.Query instance for further chaining/modification

Return type

Query

tql.query_templates module

tql.query_templates.query(project='lethe4')

A simple example query with event_metadata() and a limit of 10

Parameters

project – The Project or project ID to run the Query against

Returns

An example Query object

Return type

Query

tql.resultset module

class tql.resultset.Partition(json, columns, data_format, interactive=True)

Bases: object

A Partition is a subset of a ResultSet. If the ResultSet was partitioned, there will be a list of Partitions in the Resultset. If the ResultSet was not partitioned, there will only be the default Partition in the ResultSet.

columns() list

Gets the Columns from the Query

Returns

A list of Columns in the Query

Return type

list

data_path() str

Gets the data path of this Partition. This is only applicable for non-interactive Queries.

Returns

The path to the data

Return type

str

dataframe(limit=None)

Gets this Partition’s Dataframe

Parameters

limit (int) – The maximum amount of rows to return

Returns

A list of rows from a Spark Dataframes

Return type

list

name() str

Gets the name of this Partition

Returns

The name of this Partition

Return type

str

pandas_dataframe()

Loads this Partition’s data into a pandas dataframe and return it

Returns

A Pandas Dataframe

positive_row_count() int

Gets the number of positive rows in this Partition

Returns

The number of positive rows in this Partition

Return type

int

row_count() int

Gets the number of rows in the Partition

Returns

The number of rows in this Partition

Return type

int

spark_dataframe(infer: bool = True)

Load this partition data into a spark dataframe and return it

Parameters

infer (bool) – Use inferSchema with CSV output

Returns

A Spark Dataframe

class tql.resultset.ResultSet(json, query=None, execution_time=None, analyze=False, interactive=True)

Bases: object

A ResultSet is the result of a Query. It holds data as columns and rows.

column_names() list

Gets the column names from this ResultSet as a list of strings

Returns

A list of column names as strings

Return type

list

columns()

Gets the columns from this ResultSet as a list of Columns

Returns

A list of Columns

dataframe(limit=None)

Gets a dataframe from the ResultSet

Parameters

limit (int) – The upper limit of rows to return

Returns

A dataframe of rows and columns

default_partition()

Gets the default partition from this ResultSet

Returns

The default Partition

Return type

Partition

get_id()

Gets the ID of this ResultSet

Returns

The ID of this ResultSet

Return type

int

get_query()

Gets the Query that generated this ResultSet

Returns

The Query that generated this ResultSet

Return type

Query

classmethod load(result_set_id, analyze: bool = False)

Loads a new ResultSet from the given ID

Parameters
  • result_set_id (int) – The ResultSet ID

  • analyze (bool) – Whether to return detailed information about the ResultSet

Returns

The ResultSet with the given ID

Return type

ResultSet

metrics()

Gets an HTML representation of metrics about this ResultSet

Returns

An HTML representation of metrics about this ResultSet

pandas_dataframe()

Gets this ResultSet as a Pandas Dataframe

Returns

A Pandas Dataframe

partition(partition_name='_default')

Gets a specific partition from this Resultset

Parameters

partition_name (str) – The name of the Partition

Returns

The partition with the given name

Return type

Partition

partition_names()

Gets a list of partition names from this ResultSet

Returns

A list of partition names

Return type

list

partitions()

Gets a list of Partitions from this Resultset

Returns

A list of partitions

Return type

list

positive_row_count()

Gets the number of positive rows in this ResultSet

Returns

The number of positive rows

Return type

int

refresh()

Populates the non-interactive ResultSet fields with execution details

row_count()

Gets the number of rows in the ResultSet

Returns

The number of rows in the ResultSet

Return type

int

spark_dataframe(infer=True)

Gets this ResultSet as a Spark Dataframe

Parameters

infer (bool) – Whether to use Spark to infer the datatypes

Returns

A Spark Dataframe

class tql.resultset.ResultSetMetrics(metrics)

Bases: object

column_metrics(pandas=False)
column_value_metrics(pandas=False)

Getter for column_value_metrics from ResultSetMetrics.ColumnMetrics

debug(pandas=False)
dropped_expanded_columns(pandas=False)
event_runtime(pandas=False)
expression_compile_errors(pandas=False)
expression_timing_stats(pandas=False)
get_timelines_processed()
json()
query_summary(pandas=False)
timeline_runtime(pandas=False)
tql.resultset.load_resultset(id: int) tql.resultset.ResultSet

Loads the ResultSet with the given ID

Parameters

id (int) – The ID of the ResultSet to be loaded

Returns

The ResultSet

Return type

ResultSet

tql.sampling module

class tql.sampling.SamplingConf(sample_generator_expression: str, functions: Optional[list] = None, variables: Optional[dict] = None, attribute_expressions: Optional[dict] = None, inherit_attributes: bool = False)

Bases: object

Allows you to manually override parameters during sample generation. See generate_events and generate_importance_events

Variables and functions both get registered in the expression context prior to the execution of the sampling expression and event attribute expressions.

Parameters
  • sample_generator_expression (str) –

    ex. "[MILLIS(TO_DATETIME('2021-12-01'))]". An expression providing a list of values at which to generate samples for each timeline. Typically, these will be a list of timestamps. The following example creates midnight timestamps from 2021-12-01 through 2021-12-10 for the timestamps at which we will sample:

    days = 10; millis_in_day = 24*3600*1000;
    min_date = MILLIS(TO_DATETIME('2021-12-01')); iter = RANGE(0, days+1);
    MAP(iter, (x) -> min_date + millis_in_day * x)
    

    A simple example would be just "[0]" (timestamp of 1970-01-01 00:00:00.000) if you just wanted to generate a single record per timeline where dynamics are not relevant for any columns.

  • functions (list) – ex: ["function bar() { 2.0 + 2.0 }"] - can be used like any other function. User-Defined Functions (UDFs) that get attached to the sample_generator_expressions.

  • variables (dict) – ex: {"foo" : "3.3"} - could be referenced in the sampling/event attribute expressions as foo. Key-value pairs used to define variables to parameterize advanced forms of sampling, such as Importance Sampling.

  • attribute_expressions (dict) –

    ex: {'timestamp': '${@sample}'} A dictionary of key-value pairs to replace dot-notation expressions such as ‘id’, ‘timestamp’, and ‘type’. This is flexible enough to accommodate the development of more complex forms of sampling over time, space, and distributions of conversions (e.g., conversion name/type).

    • id: Identifier for each sample, e.g., MONOTONICALLY_INCREASING_ID().

    • timestamp: An expression to define the timestamp associated with the sample generated. If sample_generator_expressions generates timestamps, this field may be specified as sample.

    • type: User provided string to identify the type of sample generated (e.g., 'my_manual_sample').

  • inherit_attributes (bool) – If the generator expression is a list of events, rather than just timestamps, initialize attribute_expressions with all of the properties of the event via GET_PROPERTY().

describe()

Get a description of the object. In Jupyter notebooks, this returns a set of HTML tables. In a regular python interactive shell or script, this will default to the String representation of the SamplingConf.

get_attribute_expressions()

Gets the attributes of this SamplingConf

get_functions()

Gets the functions of this SamplingConf

get_inherit_attributes()

Gets if this SamplingConf should inherit attributes

get_sample_generator_expression()

Gets the generator expression of this SamplingConf

get_variables()

Gets the variables of this SamplingConf

json() dict

Returns a dictionary representation of this object Used when sending to Icarus (the web server) for evaluation.

Returns

The current settings instance as a dict

Return type

dict

set_attribute_expressions_type(expr_type)

Sets the type of the attribute expressions

Parameters

expr_type – The type of the attribute expression

tql.sampling.generate_events(sample_generator_expression: str, functions: Optional[list] = None, variables: Optional[dict] = None, attribute_expressions: Optional[dict] = None, inherit_attributes: bool = False) tql.sampling.SamplingConf

Manual sampling is similar to importance sampling but allows the user to manually override parameters such as the list of sample events (e.g., timestamps at which sampling occurs) and other properties of each generated sample event. See generate_importance_events() for an advanced use case.

tql.sampling.generate_importance_events(num_samples: float = - 1.0, min_ts: Optional[str] = None, max_ts: Optional[str] = None, time_shift_factor: float = 0.0, fraction_uniform_time: float = 0.0, sampling_distribution: str = 'exponential', sampling_events_expression: str = 'FILTER(timeline.events, (x) -> x.type=null)', sampling_kernels: str = '5m,15m,1h,4h,1d,3d,7d') tql.sampling.SamplingConf

Importance sampling can be used to retain modeling unbiasedness (avoid introducing selection bias when sampling records) while still increasing the number of records where the modeling is most interesting. For example, when modeling the causal effect of a treatment on an outcome, we would like to ensure that most of our records (whether ‘positive’, an outcome event, or ‘negative’, a non-outcome sampling event) are in the vicinity of a treatment opportunity or an outcome. By so doing, we increase the model’s statistical power at deciphering the relationship between the two. In contrast, if most outcomes happen during only 10% of the sample time period, we most of our observations will be during the “boring” portion of the timeline when no events of interest are occurring.

generate_importance_events helps you configure what time periods are “interesting.” You configure how many records to randomly sample for each timeline, which timestamps or events you want to increase your sampling around, the distribution (shape and scale) around each event from which you would like to randomly sample, and the probability you would like to draw from the background uniform distribution (e.g., a random point in the timeline).

In summary, one way to generate negative (non-outcome) records would be to simply draw uniformly between the start and end of the timeline’s observation window. However, we can improve upon that by instructing the extractor to generate these negatives by a configurable time-importance-weighted sampling methodology around times of timeline events.

tql.timelines module

class tql.timelines.Project

Bases: object

A Project is an object that contains: a name, a description, UDFs, and links to TimeSeries. It is a conglomeration of one or more sets of timeseries data and timeline data. Projects under 10 are reserved as demonstration projects.

static all()

Gets all Projects

build_timelines(wait=True)

Builds this Project’s timelines

delete()

Deletes this Project from the database

description(description)

Sets a description of the Project

from_timeseries(*ts, append=False)

Created a Project from a TimeSeries

Returns

A new project based on the TimeSeries

Return type

Project

get_annotations()

Gets this Project’s column annotations

get_description()

Gets a description of the Project

get_id()

Gets the ID of this Project

get_metadata()

Gets this Project’s metadata

get_name()

Gets the name of this Project

get_status()

Gets this Project’s status: SUCCESS, PENDING, COMPILING, RUNNING

get_timelines()

Gets this Project’s Timelines

get_timeseries()

Gets this Project’s TimeSeries

get_timeseries_names()

Gets this Project’s TimeSeries names

json() dict

Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation.

Returns

The current Project as a dict

Return type

dict

save()

Saves the metadata (description, etc.) of this Project to the database

class tql.timelines.TimelineStats(json)

Bases: object

Stats about the corresponding timeline

get_event_count()
get_event_counts_by_type()
get_largest_timeline_size()
get_max_timestamp()
get_min_timestamp()
get_smallest_timeline_size()
get_timeline_count()
get_timeseries_stats()
class tql.timelines.Timelines(json, timeseries=None, annotations=None)

Bases: object

A Timeline is a collection of events identified by a common join key, for example a logged in user_id or web cookie, and sorted by timestamp

get_annotation(attr)
get_attributes()
get_attributes_table()
get_created()
get_data_path(for_spark=False)
get_example()
get_id()
get_sample_data_path(for_spark=False)
get_schema()
get_state()
get_statistics()
get_updated()
pandas_dataframe()
spark_dataframe()
tql.timelines.create_project(name: str, or_update: bool = False) tql.timelines.Project

Creates a new project. This project will require further configuration before timelines can be created. If the project already exists and you wish to update it, use load_project(name), or give the or_update=True option here.

Parameters
  • name (str) – The name of the Project

  • or_update (bool) – Whether to update the Project

Returns

The Project that has just been created

Return type

Project

tql.timelines.drop_project(name_or_id, if_exists: bool = False)

Deletes the project by name or ID

Parameters
  • name_or_id – The name (str) or ID (int) of the Project

  • if_exists (bool) – Only drop the Project if it exists

tql.timelines.load_project(project_identifier, fail_if_not_found: bool = True) tql.timelines.Project

Loads the specified project by name, or throws TQLAnalysisException if not found

Parameters
  • project_identifier – The name (str) or ID (int) of the Project

  • fail_if_not_found (bool) – Whether to throw an exception if the Project is not found

Returns

The Project with the specified name or ID

Return type

Project

tql.timelines.show_projects()

Shows the available projects

tql.timeseries module

class tql.timeseries.TimeSeries(name)

Bases: object

The mapping between your data source and TQL is called a TimeSeries. It is important to note that a TimeSeries object in TQL is not the data itself, but merely a script on how to read the data out of storage into TQL.

analyze(columns=None, row_limit=- 1, sample_rate=1.0, top_n_limit=5, print_json=False, wait=True)

Analyzes this TimeSeries

annotate_columns(annotations)

Sets the Column anootation of this TimeSeries

from_files(data_path, format=None, has_csv_header=False)

Sets the files to be used for this TimeSeries

Parameters
  • data_path – The path of data to load

  • format – What format the data is in

  • has_csv_header (bool) – Whether the csv data has column headers

Returns

This TimeSeries for further chaining

Return type

TimeSeries

from_sql(*sql_stmts)

Sets the SQL statements for this TimeSeries

Parameters

*sql_stmts – SQl statements

Returns

This TimeSeries for further chaining

Return type

TimeSeries

from_url(data_url)

Sets the URL for this TimeSeries

Parameters

data_url (str) – The URL at which the data is located

Returns

This TimeSeries for further chaining

Return type

TimeSeries

get_annotation(col_name: str)

Gets the specific annotation of this TimeSeries by name

get_annotations()

Gets the column annotations of this TimeSeries

get_data_path()

Gets the data path of this TimeSeries

get_duration_col()

Gets the duration column of this TimeSeries

get_example()
get_format()

Gets the format of this TimeSeries data

get_metadata()

Gets the metadata of this TimeSeries

get_name()

Gets the name of this TimeSeries

get_sql_statements()

Gets the SQL statements of this TimeSeries

get_timeline_id_col()

Gets the timeline id column of this TimeSeries

get_timestamp_col()

Gets the timestamp column of this TimeSeries

identified_by(timeline_id_col, timestamp_col=None, duration_col=None)

Sets what the TimeSeries is identified by

Parameters
  • timeline_id_col – The name of the id column

  • timestamp_col – The name of the timestamp column

  • duration_col – The name of the duration column

Returns

This TimeSeries for further chaining

Return type

TimeSeries

json() dict

Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation.

Returns

The current TQL TimeSeries as a dict

metadata(meta)

Sets the metadata for this TimeSeries

Parameters

meta – The metadata

Returns

This TimeSeries for further chaining

Return type

TimeSeries

pandas_dataframe()

Gets a Pandas Dataframe of this TimeSeries

Returns

A Pandas Dataframe

read_option(key, value)

Sets the Spark read options for this TimeSeries

Parameters
  • key – The key read option

  • value – The value read option

Returns

This TimeSeries for further chaining

Return type

TimeSeries

spark_dataframe()

Gets a Spark Dataframe of this TimeSeries

Returns

A Spark Dataframe

validate()

Validates this TimeSeries

visualize(size='', shape='', opacity='', jitter=0.2, rows=None, timelines=None, down_sampling=None, size_rescale=7, shape_dict=None, opacity_rescale=None, include_style_columns=True, ignore_nulls=True)

Shows a Jupyter Notebook/Labs visualizer widget

class tql.timeseries.TimeSeriesStats(json)

Bases: object

Statistics about the corresponding TimeSeries

static load(id)
tql.timeseries.create_timeseries(name: str) tql.timeseries.TimeSeries

Creates a TimeSeries with the given name

Parameters

name (str) – The name of the TimeSeries

Returns

A new TimeSeries with the given name

Return type

TimeSeries

tql.udf module

class tql.udf.UDF(udfs)

Bases: object

A User-Defined Function - written in TQL expression language and associated with a Project

get_udf_map()

Gets the defined UDFs as a Python dictionary

tql.udf.delete_udf(project_id: int, function_name: str)

Deletes a UDF for a specific Project by function name

Parameters
  • project_id (int) – ID of the Project to delete udf from

  • function_name (str) – Name of the udf to be deleted

tql.udf.get_udf(project_id: int, function_name: str) tql.udf.UDF

Retrieve a specific UDF from a Project

Parameters
  • project_id (int) – ID of the Project to retrieve UDF on

  • function_name (str) – Name of the UDF to retrieve

Returns

A UDF or throws TQLAnalysisNotFound exception

Return type

UDF

tql.udf.list_udfs(project_id: int) tql.udf.UDF

Retrieves all UDFs defined for a Project

Parameters

project_id (int) – ID of the Project to retrieve UDFs on

Returns

All existing UDFs for the Project

Return type

UDF

tql.udf.update_udf(project_id: int, *function_src)

Updates UDF(s) for the specified Project

Parameters
  • project_id (int) – ID of the Project to upload UDFs to

  • function_src – One or more function’s source string

Throws TQLAnalysisException if function doesn’t exist in the Project yet

tql.udf.upload_udf(project_id: int, *function_src)

Uploads udf(s) to the specified Project

Parameters
  • project_id (int) – ID of the Project to upload UDFs to

  • function_src – One or more function’s source string

Throws TQLAnalysisException if function already exists in the Project

tql.udf.validate_udf(project_id: int, *function_src)

Validates UDF(s)

Parameters
  • project_id (int) – ID of the Project to validate UDFs on

  • function_src – One or more function’s source string

Throws TQLAnalysisException and printout the error message and location if there’s any compilation error in udf

tql.validation module

exception tql.validation.TQLAnalysisException(reason)

Bases: Exception

General TQL Exception

get_expr_compile_errors()

Get expression compile errors from the backend

get_message()

Get the exception’s error message

get_readable_expr_compile_errors()

Return a pretty expression error message

exception tql.validation.TQLAnalysisNotFound(msg)

Bases: Exception

TQL not found error

get_message()

Get the exception’s error message

tql.validation.format_expression_error(expression, position, error, name=None)

Pretty print an expression error message

Parameters
  • expression – Expression with error

  • position – Location of error message (line, column) tuple

  • error – Error message

  • name – Column name for the expression with error

Returns

The expression with formatted error

Return type

str

tql.validation.hide_trace_back()

This method can be used to suppress traceback only for TQLAnalysisException

tql.validation.show_trace_back()

This method can be used to show traceback for TQLAnalysisException

tql.validation.validate_no_exception(lambda_fcn, msg=None)

Run the given lambda and rethrow any Exceptions as a TQLAnalysisException

tql.validation.validate_tql_iterable_type(thing, type)

Check if the thing is an iterable type

tql.validation.validate_tql_state(cond, msg: str)

Throw an exception if the truthy cond is false

tql.validation.validate_tql_type(thing, type, msg: Optional[str] = None)

Make sure that ‘thing’ has ‘expected_type’ or throw the message.

tql.visualizer module

class tql.visualizer.TimeSeriesVisualizer(query=None, timeseries=None)

Bases: object

Visualizes a TimeSeries dataframe

visualize(size='', shape='', opacity='', jitter=0.2, rows=None, timelines=None, down_sampling=None, size_rescale=7, shape_dict=None, opacity_rescale=None, include_style_columns=True, ignore_nulls=True)

Creates an interactive visualizer object from the given input, rendered as a jupyter extension.

Parameters
  • rows – the maximum number of rows to return

  • timelines – the maximum number of timelines to evaluate

  • down_sampling – down sampling expression

  • size (str) – Name of column to use to encode each record’s marker’s size

  • shape (str) – Name of column to use to encode each record’s marker’s shape

  • opacity (str) – Name of column to use to encode each record’s marker’s opacity

  • jitter (float) – Amount of ‘jitter’ (random vertical offset to improve visibility)

  • size_rescale (float) – Multiplier to adjust the marker size. Defaults to 7

  • shape_dict (dict) – Dictionary to transform a column into shapes, e.g.:

  • {'outcome' – ‘star’, ‘opportunity’:’square’, ‘treatment’:’triangle’, ‘default’:’circle’}

  • opacity_rescale (float) – Multiplier to adjust the marker opacity. None defaults to rescaling by the column’s min/max to 0 and 1.

  • include_style_columns (bool) – Should the event data display include style columns?

  • ignore_nulls (bool) – Do not show key-value pairs that have null as the value

Returns

Timeline viewer widget