tql package 

tql.col(c: any, name: Optional[str] = None, type: Optional[str] = None, filters: Optional[dict] = None)

Creates a column from a TQL expression or given input, optionally providing name, type, and filters. The first argument can be a variety of formats, including:

classes or subclasses of type Column

dictionary containing keys for name, expression, and type

raw strings, which will be interpreted as TQL expressions.

The created column is by default unnamed, and will be assigned a name if used in a TQL select(...) statement. The default type of columns is ‘METADATA’.

Parameters

c (any) – A TQL expression string, column object, dictionary, list, tuple, or FeatureColumn object
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
type (str) – Optionally provide a type for the column. If not provided, the column type will be METADATA
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A Column or FeatureColumn object to be provided to select(...)

tql.constant()

Used in select(…) statements to select the FeatureColumn “1.0” with name “constant”

Returns: A tuple with the FeatureColumn “1.0” with name “constant”
Return type: tuple

tql.create_project(name: str, or_update: bool = False) → zeenk.tql.timelines.Project

Creates a new project. This project will require further configuration before timelines can be created. If the project already exists and you wish to update it, use load_project(name), or give the or_update=True option here.

Parameters

name (str) – The name of the Project
or_update (bool) – Whether to update the Project

Returns

The Project that has just been created

Return type

tql.create_timeseries(name: str) → zeenk.tql.timeseries.TimeSeries

Creates a TimeSeries with the given name

Parameters: name (str) – The name of the TimeSeries
Returns: A new TimeSeries with the given name
Return type: TimeSeries

tql.debugger(project, expression: str = '', theme: str = 'light')

Create an expression debugger widget - default row limit of 10 is applied automatically. Only works in Jupyter Notebooks or Jupyter Labs

Parameters

project – The Project ID or name to run the expression against
expression (str) – The initial value of the expression, if any
theme (str) – The editor theme, ‘light’ or ‘dark’

Returns

A Jupyter Notebook/Lab widget

tql.delete_udf(project_id: int, function_name: str)

Deletes a UDF for a specific Project by function name

Parameters

project_id (int) – ID of the Project to delete udf from
function_name (str) – Name of the udf to be deleted

tql.describe_project(project_identifier, fail_if_not_found: bool = True) → zeenk.tql.timelines.Project

Loads the specified project by name, or throws TQLAnalysisException if not found

Parameters

project_identifier – The name (str) or ID (int) of the Project
fail_if_not_found (bool) – Whether to throw an exception if the Project is not found

Returns

The Project with the specified name or ID

Return type

tql.drop_project(name_or_id, if_exists: bool = False)

Deletes the project by name or ID

Parameters

name_or_id – The name (str) or ID (int) of the Project
if_exists (bool) – Only drop the Project if it exists

tql.event_metadata() → tuple

To be used in select(…) statements for returning timeline.id, id, datetime, and type

Returns: A tuple of Columns
Return type: tuple

tql.event_time() → tuple

To be used in select(…) statements for returning timestamp and duration

Returns: A tuple of Columns
Return type: tuple

tql.generate_events(sample_generator_expression: str, functions: Optional[list] = None, variables: Optional[dict] = None, attribute_expressions: Optional[dict] = None, inherit_attributes: bool = False) → zeenk.tql.sampling.SamplingConf: Manual sampling is similar to importance sampling but allows the user to manually override parameters such as the list of sample events (e.g., timestamps at which sampling occurs) and other properties of each generated sample event. See generate_importance_events() for an advanced use case.

tql.generate_importance_events(num_samples: float = - 1.0, min_ts: Optional[str] = None, max_ts: Optional[str] = None, time_shift_factor: float = 0.0, fraction_uniform_time: float = 0.0, sampling_distribution: str = 'exponential', sampling_events_expression: str = 'FILTER(timeline.events, (x) -> x.type=null)', sampling_kernels: str = '5m,15m,1h,4h,1d,3d,7d') → zeenk.tql.sampling.SamplingConf

Importance sampling can be used to retain modeling unbiasedness (avoid introducing selection bias when sampling records) while still increasing the number of records where the modeling is most interesting. For example, when modeling the causal effect of a treatment on an outcome, we would like to ensure that most of our records (whether ‘positive’, an outcome event, or ‘negative’, a non-outcome sampling event) are in the vicinity of a treatment opportunity or an outcome. By so doing, we increase the model’s statistical power at deciphering the relationship between the two. In contrast, if most outcomes happen during only 10% of the sample time period, we most of our observations will be during the “boring” portion of the timeline when no events of interest are occurring.

generate_importance_events helps you configure what time periods are “interesting.” You configure how many records to randomly sample for each timeline, which timestamps or events you want to increase your sampling around, the distribution (shape and scale) around each event from which you would like to randomly sample, and the probability you would like to draw from the background uniform distribution (e.g., a random point in the timeline).

In summary, one way to generate negative (non-outcome) records would be to simply draw uniformly between the start and end of the timeline’s observation window. However, we can improve upon that by instructing the extractor to generate these negatives by a configurable time-importance-weighted sampling methodology around times of timeline events.

tql.get_udf(project_id: int, function_name: str) → zeenk.tql.udf.UDF

Retrieve a specific UDF from a Project

Parameters

project_id (int) – ID of the Project to retrieve UDF on
function_name (str) – Name of the UDF to retrieve

Returns

A UDF or throws TQLAnalysisNotFound exception

Return type

UDF

tql.kernelize(features, name: Optional[str] = None, treatment_filter_event: Optional[str] = None, event_group_where_event: Optional[str] = None, intended_action: Optional[list] = None, actual_action: Optional[list] = None, treatment_model: Optional[str] = None, kernel_types: Optional[list] = None, opportunity_conf: Optional[zeenk.tql.opportunity_conf.OpportunityConf] = None, kernel_parameters: Optional[str] = None, kernel_distribution: Optional[str] = None, filters: Optional[dict] = None, parse_kv: bool = False)

Takes a list of opportunity-level features based on user/opportunity/treatment/proxy-outcome-level fields and create a transformed feature with SUM_OPPORTUNITIES() accumulating the effects across each opportunity. These “kernelized” features are for use in a Causmos timeline-based causal model where each observation record in a dataset is an outcome or a potential outcome (e.g., a moment in time when an outcome could have occurred). For example, we can power dataset generation using a treatment-propensity model to reduce the potential bias if treatment and the outcome are correlated. Apply this function to each list of features to be kernelized. There are several types of transformations described as (KD = ‘treatment’, BKD = ‘baseline’, GKD = ‘ghost’, NKD = ‘nonrandom’).

Parameters

features (list) – List of col() to be kernelized (KD,BKD,GKD,NKD). Examples: ["1.0", "IF(1.0,'L','R')"], [{"name":"constant", "expression":"1.0", "type":"NUMERICAL"}]). Columns can be “CATEGORICAL” or “NUMERICAL”. Each categorical expression creates an expansion of features; each numerical expression multiplies/reweights the opportunity’s contribution to the SUM_OPPORTUNITIES() sum.
name (str) – Base string for the feature sets. Will append type of kernel and prepend “w” to denote non-incrementality features for BKD, GKD, and NKD. If name=’’ (default), just return the list of features with the default prefix: AKD_..., wBKD_....
treatment_filter_event (str) – Column that defines whether a treatment opportunity resulted in treatment. For example, was the advertiser’s impression shown after bidding? In a sample dataset, request.impression is the relevant field.
event_group_where_event (str) – second where condition within passed EventGroup (currently this is set to be output from filtering like dedupe - but should be made more explicit).
intended_action – List of numerical/categorical expressions defining the intended/optimal decisions, e.g., bid_amount or eligible_treatment_groups.
actual_action – List of numerical/categorical expressions defining the actual action/decision taken, e.g., IF(ghost_bid, 0, bid_amount) or assigned_treatment_group.
treatment_model (str) – String for the deployed/published treatment prediction model (GKD, NKD-only). In practice, this will be the win-rate model based on the leaves.
kernel_types (list) – List of kernels types to include in the feature set. Subset of ['KD','BKD','GKD','NKD'].
opportunity_conf – An OpportunityConf() object that contains the defaults for opportunity_filter, kernel_parameters, and kernel_distribution.
kernel_parameters (str) – Calibration for kernel.days in the kernel parameters. Arrays of kernel features are created with suffixes using values such as seconds (s), minutes (m), hours (h), and days (d) in combination with a number, such as '15m,4h,3d': 'name_feature' becomes 'name_feature-15m', etc.
kernel_distribution (str) – Positive-support distributions (short-form abbreviations): exponential (e, exp), uniform (u, unif), triangular (t, tri), halfnormal (h, hnorm), and halflogistic (l, hlog). Positive- & negative-support distributions are the symmetric analogs of the positive-support distributions: laplace (a, lap), rectangular (r, rect), symmetrictriangular (s, stri), normal (n, norm), and logistic (o, log). Time-independent constant kernel (c, const) is also useful for various use cases such as static models in order to accumulate all opportunities and treatments for each outcome.
filters (dict) – column filters to apply to all of the features.
parse_kv (boolean) – Try to parse ‘NUMERICAL’ input features as key-value pairs ‘k:v’. This has extra overhead, but more complex mixed ‘categorical:numerical’ input features.

Returns

A list of feature sets for each kernel_type, each containing a list of kernelized features.

Return type

list

tql.label(expr, name: str = '_label') → zeenk.tql.column.Column

Creates a label column from a TQL expression. The expression provided to label() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default label value of 0.0. Label columns will be automatically be named “_label”. It is expected that a dataset will have at most one label column.

Parameters

expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

tql.list_udfs(project_id: int) → zeenk.tql.udf.UDF

Retrieves all UDFs defined for a Project

Parameters: project_id (int) – ID of the Project to retrieve UDFs on
Returns: All existing UDFs for the Project
Return type: UDF

tql.load_project(project_identifier, fail_if_not_found: bool = True) → zeenk.tql.timelines.Project

Loads the specified project by name, or throws TQLAnalysisException if not found

Parameters

project_identifier – The name (str) or ID (int) of the Project
fail_if_not_found (bool) – Whether to throw an exception if the Project is not found

Returns

The Project with the specified name or ID

Return type

tql.load_query(id: int) → zeenk.tql.query.Query

Loads the query from the given ID

Parameters: id (int) – The ID of the query
Returns: A new Query instance
Return type: Query

tql.load_resultset(id: int) → zeenk.tql.resultset.ResultSet

Loads the ResultSet with the given ID

Parameters: id (int) – The ID of the ResultSet to be loaded
Returns: The ResultSet
Return type: ResultSet

tql.metadata(expr, name: Optional[str] = None) → zeenk.tql.column.Column

Creates a metadata column from a TQL expression. Metadata columns will return the expression values “as is”, meaning they will not be post-processed with charset filtering or expansion of numerical columns. Metadata columns are also not subject to column filters. Metadata columns are the default column type and are often the correct choice for arbitrary datasets that are not specifically intended to be consumed by a ML training package.

Parameters

expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

tql.numerical(expr, name: Optional[str] = None, filters: Optional[dict] = None) → zeenk.tql.column.FeatureColumn

Creates a numerical feature column from a TQL expression, optionally providing name and filters

Parameters

expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A FeatureColumn object to be provided to select(...)

Return type

tql.query(project='lethe4')

A simple example query with event_metadata() and a limit of 10

Parameters: project – The Project or project ID to run the Query against
Returns: An example Query object
Return type: Query

tql.random_partition(partition_var: str = 'timeline.id', seed: str = '', shares: list = [0.8, 0.2], names: list = ['train', 'test'])

Use the arguments to create a random partition TQL expression like col("MD5_PARTITIONS(timeline.id, 'my hashing seed', [0.008, 0.002], ['train','test'])"). This can be used as a column or with .partition_by().

Parameters

partition_var (str) – Single-line TQL subexpression
seed (str) – String to serve as the seed for the hash partitioning
shares (list) – List of the relative shares of each partition. Relative shares do not need to sum to one
names (list) – List of the names for each partition

Returns

A metadata column with the random_partition expression for use in .partition_by() or as a column

Return type

tql.select(*cols) → zeenk.tql.query.Query

Create a new query object from one or more columns. Columns can be defined as TQL expression strings, or wrapped using one of the provided TQL column functions label(), weight(), tag(), categorical(), numerical(), or metadata(). Typically select(...) will immediately be followed by .from_timelines(...) or .from_events(...) during query construction.

Parameters: cols – One or more TQL column objects or expressions
Returns: A new tql.query.Query instance for further chaining/modification
Return type: Query

tql.show_projects(): Shows the available projects

tql.tag(expr, name: str = '_tag') → zeenk.tql.column.Column

Creates a tag column from a TQL expression. The expression provided to tag() is expected to return a non-null value for all rows. Typically this expression will uniquely identify the row, which is useful for debugging and tracing datasets later. Uniqueness is not required for the return value, but highly encouraged. Tag columns will automatically be named “_tag”. It is expected that a dataset will have at most one tag column.

Parameters

expr – A TQL expression string that returns a unique identifier for the row
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

tql.update_udf(project_id: int, *function_src)

Updates UDF(s) for the specified Project

Parameters

project_id (int) – ID of the Project to upload UDFs to
function_src – One or more function’s source string

Throws TQLAnalysisException if function doesn’t exist in the Project yet

tql.upload_udf(project_id: int, *function_src)

Uploads udf(s) to the specified Project

Parameters

project_id (int) – ID of the Project to upload UDFs to
function_src – One or more function’s source string

Throws TQLAnalysisException if function already exists in the Project

tql.validate_udf(project_id: int, *function_src)

Validates UDF(s)

Parameters

project_id (int) – ID of the Project to validate UDFs on
function_src – One or more function’s source string

Throws TQLAnalysisException and printout the error message and location if there’s any compilation error in udf

tql.weight(expr, name: str = '_weight') → zeenk.tql.column.Column

Creates a weight column from a TQL expression. The expression provided to weight() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default weight value of 1.0. Weight columns will be automatically be named “_weight”. It is expected that a dataset will have at most one weight column.

Parameters

expr – A TQL expression string that returns a numeric value
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

Subpackages

tql.modeling package

Submodules

tql.column module

class tql.column.Column(name: str, expression: str, type: str)

Bases: object

A column is composed of a selectable TQL expression, a friendly name, and a data type. Columns can be of type: label, weight, tag, or metadata

copy()

Makes a copy of this Column

Returns: A copy of this Column
Return type: Column

debugger(project, theme: str = 'light')

Creates an expression debugger widget - default row limit of 10 is applied automatically Only works in Jupyter Notebooks or Jupyter Labs

Parameters

project – The project number or name to run the expression against
theme (str) – The editor theme, ‘light’ or ‘dark’

Returns

A Jupyter Notebook/Lab widget

get_expression() → str

Gets the expression of this Column

Returns: The expression of this Column
Return type: str

get_name() → str

Gets the name of this Column

Returns: The name of this Column
Return type: str

get_type() → str

Gets the type of this Column

Returns: The type of this Column
Return type: str

is_feature() → bool

Gets whether this Column is a feature [Column](tql.html#tql.column.FeatureColumn)

Returns: Whether this Column is a feature
Return type: bool

is_label() → bool

Gets whether this Column is a label [Column](tql.html#module-tql.column)

Returns: Whether this Column is a label
Return type: bool

is_weight() → bool

Gets whether this Column is a weight [Column](tql.html#module-tql.column)

Returns: Whether this Column is a weight
Return type: bool

json() → dict

Gets a Python dict representation of this Column

Returns: A Python dict representation of this Column
Return type: dict

keys() → tuple

Gets the keys in the Python dict that represents a Column object

Returns: A tuple of the keys in the Python dict that represents a Column object
Return type: tuple

name(name: str)

Sets the name of this Column

Parameters: name (str) – The name of this Column
Returns: This Column
Return type: Column

type(type: str)

Sets the type of this Column. Available types: 'CATEGORICAL’, ‘NUMERICAL’, ‘LABEL’, ‘WEIGHT’, ‘TAG’, ‘METADATA’

Parameters: type (str) – One of the available Column types
Returns: This Column
Return type: Column

validate(project_id: int)

Validates that the Column is syntactically correct before execution

Parameters: project_id (int) – The project ID to validate against
Returns: Throws an error if the column is not valid
Return type: None

class tql.column.ColumnFilters(**kwargs)

Bases: object

A ColumnFilter is a filter that can be applied to a Column

json() → dict

Gets a Python dict representation of this ColumnFilters

Returns: A Python dict representation of this ColumnFilters

min_cardinality(val: int)

Filter columns that have a cardinality less than val

Parameters: val (int) – The minimum cardinality
Returns: This ColumnFilters
Return type: ColumnFilters

min_label_sum(val)

Filter label columns that have a sum less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

min_negative_count(val: int)

Filter negative columns that have a count less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

min_negative_sum(val: int)

Filter negative columns that have a sum less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

min_negative_weighted_count(val: int)

Filter negative weighted columns that have a count less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

min_positive_count(val: int)

Filter positive columns that have a count less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

min_positive_sum(val: int)

Filter positive columns that have a sum less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

min_positive_weighted_count(val: int)

Filter positive weighted columns that have a count less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

min_total_count(val: int)

Filter columns that have a total count less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

min_total_weighted_count(val: int)

Filter weighted columns that have a total count less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

min_weighted_label_sum(val: int)

Filter weighted label columns that have a sum less than val

Parameters: val (int) – The minimum value
Returns: This ColumnFilters
Return type: ColumnFilters

class tql.column.FeatureColumn(name: str, expression: str, type: str, filters: Optional[dict] = None)

Bases: tql.column.Column

A FeatureColumn is Column of type numerical or categorical

When a Column is defined as a categorical() or numerical() FeatureColumn, the following additional filters are available: global_min_total_count, apply_charset_filter, drop_empty_rows option, expand_numerical_feature, drop_numerical_zero_features, etc.

copy()

Makes a copy of the FeatureColumn

Returns: A copy of this FeatureColumn
Return type: FeatureColumn

filter(filters: Optional[dict] = None, **kwargs)

Adds filter(s) to the FeatureColumn

Parameters: filters (dict) – The filter(s) for the FeatureColumn
Returns: This FeatureColumn
Return type: FeatureColumn

get_filters() → dict

Gets a Python dict of the FeatureColumn’s filters

Returns: A Python dict of this FeatureColumn’s filters
Return type: dict

json() → dict

Gets a Python dict representation of this FeatureColumn

Returns: A Python dict representation of this FeatureColumn
Return type: dict

keys() → tuple

Gets the keys in the Python dict that represents this FeatureColumn object

Returns: A tuple of the keys in the Python dict that represents this FeatureColumn object
Return type: tuple

tql.column.categorical(expr, name: Optional[str] = None, filters: Optional[dict] = None) → tql.column.FeatureColumn

Creates a categorical feature column from a TQL expression, optionally providing name and filters

Parameters

expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A FeatureColumn object to be provided to select(...)

Return type

tql.column.col(c: any, name: Optional[str] = None, type: Optional[str] = None, filters: Optional[dict] = None)

Creates a column from a TQL expression or given input, optionally providing name, type, and filters. The first argument can be a variety of formats, including:

classes or subclasses of type Column

dictionary containing keys for name, expression, and type

raw strings, which will be interpreted as TQL expressions.

The created column is by default unnamed, and will be assigned a name if used in a TQL select(...) statement. The default type of columns is ‘METADATA’.

Parameters

c (any) – A TQL expression string, column object, dictionary, list, tuple, or FeatureColumn object
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
type (str) – Optionally provide a type for the column. If not provided, the column type will be METADATA
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A Column or FeatureColumn object to be provided to select(...)

tql.column.label(expr, name: str = '_label') → tql.column.Column

Creates a label column from a TQL expression. The expression provided to label() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default label value of 0.0. Label columns will be automatically be named “_label”. It is expected that a dataset will have at most one label column.

Parameters

expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

tql.column.metadata(expr, name: Optional[str] = None) → tql.column.Column

Creates a metadata column from a TQL expression. Metadata columns will return the expression values “as is”, meaning they will not be post-processed with charset filtering or expansion of numerical columns. Metadata columns are also not subject to column filters. Metadata columns are the default column type and are often the correct choice for arbitrary datasets that are not specifically intended to be consumed by a ML training package.

Parameters

expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

tql.column.numerical(expr, name: Optional[str] = None, filters: Optional[dict] = None) → tql.column.FeatureColumn

Creates a numerical feature column from a TQL expression, optionally providing name and filters

Parameters

expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters

Returns

A FeatureColumn object to be provided to select(...)

Return type

tql.column.spaces(string: str, number: int = 2, pad: str = ' '): Strip leading and lagging newlines and pad newlines with ‘number’ of ‘pad’ characters.

tql.column.tag(expr, name: str = '_tag') → tql.column.Column

Creates a tag column from a TQL expression. The expression provided to tag() is expected to return a non-null value for all rows. Typically this expression will uniquely identify the row, which is useful for debugging and tracing datasets later. Uniqueness is not required for the return value, but highly encouraged. Tag columns will automatically be named “_tag”. It is expected that a dataset will have at most one tag column.

Parameters

expr – A TQL expression string that returns a unique identifier for the row
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

tql.column.weight(expr, name: str = '_weight') → tql.column.Column

Creates a weight column from a TQL expression. The expression provided to weight() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default weight value of 1.0. Weight columns will be automatically be named “_weight”. It is expected that a dataset will have at most one weight column.

Parameters

expr – A TQL expression string that returns a numeric value
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned

Returns

A Column object to be provided to select(...)

Return type

tql.column_utils module

tql.column_utils.parse_feature(s: str) → tuple

Parse a feature string into it’s categorical and numerical components. For example, parse_feature(‘key:1.0’) produces (‘key’, 1.0)

Parameters: s (str) – A feature string
Returns: A tuple of (categorical_value, numerical_value)
Return type: tuple

tql.columnset module

tql.columnset.constant()

Used in select(…) statements to select the FeatureColumn “1.0” with name “constant”

Returns: A tuple with the FeatureColumn “1.0” with name “constant”
Return type: tuple

tql.columnset.event_metadata() → tuple

To be used in select(…) statements for returning timeline.id, id, datetime, and type

Returns: A tuple of Columns
Return type: tuple

tql.columnset.event_time() → tuple

To be used in select(…) statements for returning timestamp and duration

Returns: A tuple of Columns
Return type: tuple

tql.columnset.history_value(project_id=None, name='', filter_type='request', event_fields='impression', custom_value='', custom_function='', days=7, offset='0', cumulative=True, recent_k=None, oldest_k=None, aggregation='COUNT', custom_agg='', weight='1', rate=False, return_value='COALESCE(value,0)', bins='[0,1,2,3]', feature_type='', filters={}, type='INCREMENTALITY', verbose=False)

Aggregate over a history of events, by:

filtering to a specific type and time window of events,
extracting a field value for each event,
summarizing the events’ values using an aggregation function,
(optional) applying daily-rate transformations,
(optional) and/or binning transformations.

Parameters

name (str) – Name of the feature set. If name=’’ (default), hash the inputs to create a name.
filter_type (str) – Filter timeline.events to only events with type = '{filter_type}', e.g., a Python string, e.g., request.
event_fields (list of str) – Create features that extract event_type.field for each field in event_fields (e.g., request.impression extracted using string "impression" or list ["impression", "timestamp"]).
custom_value (str) – An optional custom inline/lambda expression to apply to extract arbitrary functions of values instead of using event_type.field. For example, IF(GET_PROPERTY(x,'request.impression'),5,0) will return 5 for all records with truthy values of request.impression. Use x or ${@x} to reference the inline/lambda function variable.
custom_function (str) – An optional custom inline/lambda expression to apply to the extracted values from event_type.field or custom_value. For example, COALESCE(x,0) will replace all null values with zeros. Use x or ${@x} to reference the inline/lambda function variable. custom_function may be unnecessary or redundant if custom_value is used.
days (list) – A list of at least two histogram knots/bin edges, e.g., [0, 1.5, 3, 7] in units of days. A scalar is also allowed, implying a list with zero prepended: 7 means [0,7].
offset (float/str) – Number of days to shift the edges away from the sample time. Typically small quantities such as 60 seconds = 60/(24*3600).
cumulative (boolean) – Should the histograms start at days[0] (cumulative=True) or days[d-1] (cumulative=False)?
recent_k=None – Input k = 0,1,2,… to truncate the time-filtered event list to just k oldest/recent events within the days time window?
oldest_k=None – Input k = 0,1,2,… to truncate the time-filtered event list to just k oldest/recent events within the days time window?
aggregation (str) – Aggregation function to apply to the event_fields values: COUNT, AVG, SUM, MIN/AG_MIN, MAX/AG_MAX, MODE, IS_NULL, RECENT, OLD, RECENT_#, OLD_#, CUSTOM. To ignore nulls, these additional aggregations can be used: COUNT_NS, AVG_NS, SUM_NS, MIN_NS/AG_MIN_NS, MAX_NS/AG_MAX_NS, MODE_NS.
custom_agg (str) – If aggregation=='CUSTOM', use this custom expression on events_value to define the aggregate value to return. For example, AVG(MAP(events_value, (x) -> EXP(x))) would average the exponential of each event’s value.
weight – (Testing) Event-level weight expression. Defaults to 1.
rate (boolean) – True transforms the post-aggregation value to a daily rate based on the difference in days bins.
return_value (str) – An optional custom final transformation to apply to post-aggregation and post-rate value. For example, if you would like to return a log transformation of the value, LOG(value). This will be applied prior to bins (if enabled).
bins (str) – Bin the aggregate value using bins, e.g., [0,1,2,3]. If empty, skip binning/return value.
feature_type (str) – Defaults to be empty '', then if binning this becomes “CATEGORICAL”; otherwise it defaults to “NUMERICAL”. Can explicitly override defaults by setting this to “NUMERICAL” or “CATEGORICAL”.
filters (dict) – Feature filters to apply to all of the features.
verbose (boolean) – Print out each feature’s name as it is created by for-loops of the list-compatible inputs field, agg, and days.

Returns

List of history features.

Return type

list

tql.columnset.kernelize(features, name: Optional[str] = None, treatment_filter_event: Optional[str] = None, event_group_where_event: Optional[str] = None, intended_action: Optional[list] = None, actual_action: Optional[list] = None, treatment_model: Optional[str] = None, kernel_types: Optional[list] = None, opportunity_conf: Optional[zeenk.tql.opportunity_conf.OpportunityConf] = None, kernel_parameters: Optional[str] = None, kernel_distribution: Optional[str] = None, filters: Optional[dict] = None, parse_kv: bool = False)

Takes a list of opportunity-level features based on user/opportunity/treatment/proxy-outcome-level fields and create a transformed feature with SUM_OPPORTUNITIES() accumulating the effects across each opportunity. These “kernelized” features are for use in a Causmos timeline-based causal model where each observation record in a dataset is an outcome or a potential outcome (e.g., a moment in time when an outcome could have occurred). For example, we can power dataset generation using a treatment-propensity model to reduce the potential bias if treatment and the outcome are correlated. Apply this function to each list of features to be kernelized. There are several types of transformations described as (KD = ‘treatment’, BKD = ‘baseline’, GKD = ‘ghost’, NKD = ‘nonrandom’).

Parameters

features (list) – List of col() to be kernelized (KD,BKD,GKD,NKD). Examples: ["1.0", "IF(1.0,'L','R')"], [{"name":"constant", "expression":"1.0", "type":"NUMERICAL"}]). Columns can be “CATEGORICAL” or “NUMERICAL”. Each categorical expression creates an expansion of features; each numerical expression multiplies/reweights the opportunity’s contribution to the SUM_OPPORTUNITIES() sum.
name (str) – Base string for the feature sets. Will append type of kernel and prepend “w” to denote non-incrementality features for BKD, GKD, and NKD. If name=’’ (default), just return the list of features with the default prefix: AKD_..., wBKD_....
treatment_filter_event (str) – Column that defines whether a treatment opportunity resulted in treatment. For example, was the advertiser’s impression shown after bidding? In a sample dataset, request.impression is the relevant field.
event_group_where_event (str) – second where condition within passed EventGroup (currently this is set to be output from filtering like dedupe - but should be made more explicit).
intended_action – List of numerical/categorical expressions defining the intended/optimal decisions, e.g., bid_amount or eligible_treatment_groups.
actual_action – List of numerical/categorical expressions defining the actual action/decision taken, e.g., IF(ghost_bid, 0, bid_amount) or assigned_treatment_group.
treatment_model (str) – String for the deployed/published treatment prediction model (GKD, NKD-only). In practice, this will be the win-rate model based on the leaves.
kernel_types (list) – List of kernels types to include in the feature set. Subset of ['KD','BKD','GKD','NKD'].
opportunity_conf – An OpportunityConf() object that contains the defaults for opportunity_filter, kernel_parameters, and kernel_distribution.
kernel_parameters (str) – Calibration for kernel.days in the kernel parameters. Arrays of kernel features are created with suffixes using values such as seconds (s), minutes (m), hours (h), and days (d) in combination with a number, such as '15m,4h,3d': 'name_feature' becomes 'name_feature-15m', etc.
kernel_distribution (str) – Positive-support distributions (short-form abbreviations): exponential (e, exp), uniform (u, unif), triangular (t, tri), halfnormal (h, hnorm), and halflogistic (l, hlog). Positive- & negative-support distributions are the symmetric analogs of the positive-support distributions: laplace (a, lap), rectangular (r, rect), symmetrictriangular (s, stri), normal (n, norm), and logistic (o, log). Time-independent constant kernel (c, const) is also useful for various use cases such as static models in order to accumulate all opportunities and treatments for each outcome.
filters (dict) – column filters to apply to all of the features.
parse_kv (boolean) – Try to parse ‘NUMERICAL’ input features as key-value pairs ‘k:v’. This has extra overhead, but more complex mixed ‘categorical:numerical’ input features.

Returns

A list of feature sets for each kernel_type, each containing a list of kernelized features.

Return type

list

tql.columnset.random_partition(partition_var: str = 'timeline.id', seed: str = '', shares: list = [0.8, 0.2], names: list = ['train', 'test'])

Use the arguments to create a random partition TQL expression like col("MD5_PARTITIONS(timeline.id, 'my hashing seed', [0.008, 0.002], ['train','test'])"). This can be used as a column or with .partition_by().

Parameters

partition_var (str) – Single-line TQL subexpression
seed (str) – String to serve as the seed for the hash partitioning
shares (list) – List of the relative shares of each partition. Relative shares do not need to sum to one
names (list) – List of the names for each partition

Returns

A metadata column with the random_partition expression for use in .partition_by() or as a column

Return type

tql.demo_projects module

tql.demo_projects.build_acheron_1(force: bool = False): Builds Project 1: Synthetic data from the original Acheron simulator Loads noumena-public dataset for 2000 users, 5 requests per user Code for generating Acheron data is not included in the TQL product. Acheron was a purely statistical data generator,

tql.demo_projects.build_acheron_2(force: bool = False): Builds Project 2: Synthetic data from the original Acheron2 simulator Same as old projects 2, 99 Loads noumena-public dataset for 200 users, over 10 days Acheron2 is the original name for Lethe; this dataset is equivalent to Lethe data built using the “reasonably_rich” config. This dataset has a bunch of features but was designed to exploit the configuration options of the simulator and memorialize realistic traffic distributions more than to set up specific incrementality behaviors.

tql.demo_projects.build_lethe_3(force: bool = False)

Builds Project 3: Synthetic data from the Lethe simulator with treatment-level “ghosting” every 2-hours per user Runs Lethe simulation for the config “demo_set”, a config with the following populations:

A set of very active users who rarely convert
A set of users who convert well but are not at all influenced by ads
A set of users who visit due to ads but do not convert due to them
A set of highly incremental users

The population each user belongs to is captured in the “group” feature, but is influenced by the other demographic features. Additionally, there is an ad stock difference in incrementality based on Ad Size.

The idea is that anything short of an incremental conversion model will make inefficient decisions here.

Time-activity patterns are based on reasonably_rich, which drew them from RTB auction logs. The auction is always-win.

tql.demo_projects.build_lethe_4(force: bool = False)

Builds Project 4: Synthetic data from the Lethe simulator with user-level randomization Runs Lethe simulation for the config “demo_set”, a config with the following populations:

A set of very active users who rarely convert
A set of users who convert well but are not at all influenced by ads
A set of users who visit due to ads but do not convert due to them
A set of highly incremental users

The population each user belongs to is captured in the “group” feature, but is influenced by the other demographic features. Additionally, there is an ad stock difference in incrementality based on Ad Size.

The idea is that anything short of an incremental conversion model will make inefficient decisions here.

Time-activity patterns are based on reasonably_rich, which drew them from RTB auction logs. The auction is always-win.

tql.demo_projects.create_acheron2_time_series(force: bool = False): Builds Acheron 2 TimeSeries - request, conversion, user

tql.demo_projects.create_acheron_time_series(force: bool = False): Creates the Acheron TimeSeries - request, conversion, user

tql.demo_projects.create_lethe_time_series(data_url, force: bool = False): Builds Lethe TimeSeries - bid, activity, user

tql.demo_projects.get_project1_files_dir()

tql.demo_projects.get_project2_files_dir()

tql.demo_projects.get_project3_files_dir()

tql.demo_projects.get_project4_files_dir()

tql.demo_projects.rebuild_demo_projects(force: bool = False)

Rebuilds all demo projects

Parameters: force (bool) – If the projects already exist, overwrite the data

tql.expression_debugger module

tql.expression_debugger.debugger(project, expression: str = '', theme: str = 'light')

Create an expression debugger widget - default row limit of 10 is applied automatically. Only works in Jupyter Notebooks or Jupyter Labs

Parameters

project – The Project ID or name to run the expression against
expression (str) – The initial value of the expression, if any
theme (str) – The editor theme, ‘light’ or ‘dark’

Returns

A Jupyter Notebook/Lab widget

tql.expression_utils module

tql.expression_utils.add_lambda(expr, varname='x', project_id=None)

Add leading inline ‘varname.’ to attribute variables in a string expression or list of string expressions.

Note: Expressions like '${type}' should return '${@x.type}' but instead returns ${x.type}. For now, do not reference values using the deprecated delimiter '${type}'—just use 'type'.

tql.expression_utils.as_variable(expr, varname) → str

Save an expression to a DSL variable, e.g., ‘varname=expression;’

Parameters

expr (str) – The expression to transform and save to variable ‘varname’
varname (str) – The variable name which the expression will be saved to

Returns

expression with the final line set equal to varname

Return type

str

Example:

as_variable("var1=2000; ${@var1}", 'input_name')
# Returns "var1=2000; input_name=${@var1};"

tql.expression_utils.remove_lambda(expr, varname='x') → str

Remove leading inline @?varname. from variables in a string expression or list of string expressions. Regex handles many cases outlined in test_remove_lambda().

Note: Expressions like ${x.var} should return itself but returns ${var} when the inline variable collides with one of the input record types.

tql.expression_utils.snippet(string: str, length: int = 20, postfix: str = '...'): Truncate ‘string’ to ‘length’ and append ‘postfix’.

tql.expression_utils.spaces(string: str, number: int = 2, pad: str = ' '): Strip leading and lagging newlines and pad newlines with ‘number’ of ‘pad’ characters.

tql.expression_utils.to_list(x): If x is not a list, make it a list

tql.expression_utils.validate_dynamics(dynamics): Validates that dynamics is a dict and has valid: scale, shape, epsilon, and filters

tql.expression_utils.validate_scale(scale): Validate the dynamic scale format.

tql.expression_utils.which_distribution(distribution): Format the distribution name to use the ‘short’ version that SUM_OPPORTUNITIES() accepts.

tql.function_doc module

class tql.function_doc.FunctionDoc(function_name=None, project_id=None): Bases: object

tql.function_doc.find_function(pattern: str, project_id: Optional[int] = None, descriptions: bool = False) → list

Returns DSL function names matching a regexp pattern

Parameters

pattern (str) – The pattern to search for
project_id (int) – The Project ID
descriptions (bool) – Also search description fields

Returns

A list of FunctionDoc

Return type

list

tql.function_doc.function_usage(function_name: str, project_id: Optional[int] = None)

Shows help on a DSL function (or return an info dictionary)

Parameters

function_name (str) – The name of the function to get usage
project_id (int) – The Project ID

Returns

The function documentation

tql.opportunity_conf module

class tql.opportunity_conf.OpportunityConf(opportunity_filter_expressions: list, kernels: str = '5m,15m,1h,4h,1d,3d,7d', decay_function: str = 'exp', sum_opportunities_epsilon: float = 1e-09)

Bases: object

An OpportunityConf object is used to configure the dataset extractor for incrementality datasets when building features that use the DSL function SUM_OPPORTUNITIES().

When modeling the relationship between the time series of treatment opportunities and outcomes, assertions about timing and the shapes of those relationships must be made. For example, are opportunities typically followed by an increase in the number (or value) of outcomes due to the causal effect of treatment or merely due to temporal correlations (e.g., “activity bias” or other spurious form of selection bias)? Are opportunities preceded by a “run-up” in outcomes due to outcomes being a requirement for opportunities (e.g., retargeting, upselling, etc.)?

We define opportunities by asserting the time or events that represent an “opportunity to treat.” Then, when building features to model the treatment effects and control for sources of bias, we can accumulate the contributions of each opportunity and treatment across CATEGORICAL and NUMERICAL features in accordance with a hypothesized effect distribution shape. We do not have to know the shape in advance, but we have to establish boundaries on the shape by defining:

What is an opportunity?
What is the hypothesized time range relevant to the effect of the opportunity?
What set of basis shapes or distributions should we use to build up a mixture distribution of the dynamic effects/relationships between opportunities, treatments, and outcomes?

The parameters of this class encode the answers to these questions to facilitate computationally efficient strategies for modeling with SUM_OPPORTUNITIES().

See also: nanml.dsl.usage(‘SUM_OPPORTUNITIES’).

Extraction settings for incrementality datasets. OpportunityConf.SPEC has reasonable defaults which can be overridden:

Parameters

opportunity_filter_expressions (list) – [] - A list of opportunity filter expressions Filter that defines opportunities to treat a user.
kernels (str) – ex: “5m,15m,1h” - Specify a window for effects stemming from opportunities. In the case of the exponential decay_function, this value controls rate. Examples of valid inputs ‘15m’, ‘2h’, ‘7d’ (15 minutes, 2 hours, 7 days). Multiple kernels are specified as a single string ‘15m, 2h, 7d’. Numbers must be integers or decimals. Letters are case insensitive. See also PARSE_KERNELS() function. Each kernel can optionally set a distribution to override the default decay_function. For example, ‘15m-u’ designates a 15-minute scale parameter on a uniform distribution.
decay_function (str) – ex: “exp” - Effects from opportunities may take a particular shape. Positive-support distributions (short-form abbreviations): exponential (e, exp), uniform (u, unif), triangular (t, tri), half-normal (h, hnorm), and half-logistic (l, hlog). Positive- & negative-support distributions are the symmetric analogs of the positive-support distributions: laplace (a, lap), rectangular (r, rect), symmetric triangular (s, stri), normal (n, norm), and logistic (o, log).
sum_opportunities_epsilon (float) – ex: 0.0 - Set a tolerance on the magnitude of the distribution’s effect below which the feature’s value will be rounded to zero to increase sparsity and improve computational performance. Larger values increase computational efficiency at the potential cost of bias due to omitting potential effects due to round-off.

describe(): Get a description of the object. In Jupyter notebooks, this returns a set of HTML tables. In a regular python interactive shell or script, this will default to the String representation of the OpportunityConf.

classmethod for_type(event_type: str): A very common case is to want the treatment filter expression to match a specific type This builds and returns an appropriate OpportunityConf object

get_decay_function(): Gets the decay function from this OpportunityConf

get_kernels(): Gets the kernels from this OpportunityConf

get_opportunity_filter_expressions(): Gets the filters from this OpportunityConf

get_sum_opportunities_epsilon(): Gets the epsilon from this OpportunityConf

json() → dict

Gets a dictionary representation of this object Used when sending to Icarus (the web server) for evaluation.

Returns: The current configuration instance as a dict
Return type: dict

tql.query module

class tql.query.Query

Bases: object

The primary object of TQL is to use the Query API to extract machine-learning ready ResultSets from Timeline data. Each row in a ResultSet is an event on a timeline, and each column is a piece of data extracted from the event using the TQL Expression Language.

abort(): Request this Query be aborted

static abortById(query_id)

Request Query with the given ID be aborted

Parameters: query_id – to identify the query to be aborted

copy(): Returns a copy of this Query object

dataframe(print_payloads=False, limit=None)

An alias for results().dataframe(). If there are multiple partitions, the rows from all partitions will be concatenated together.

Returns: The ResultSet
Return type: ResultSet

debugger(theme: str = 'light')

Creates an expression debugger widget - default row limit of 10 is applied automatically Only works in Jupyter Notebooks or Jupyter Labs

Parameters: theme (str) – The editor theme, ‘light’ or ‘dark’
Returns: A Jupyter Notebook/Lab widget

describe()

Gets a description of the query object. In Jupyter notebooks, this returns a set of HTML tables. In a regular python interactive shell or script, this will default to the String representation of the query.

Returns: A description of this Query object

downsample_by(sample_rate: float = 1, pos_sample_rate: float = 1, pos_sampling_seed: str = 'label_downsample', neg_sample_rate: float = 1, neg_sampling_seed: str = 'label_downsample', key_expression: str = 'CONCAT(id, timeline.id)', salt: str = '', reweight: bool = True, max_records: Optional[int] = None, neg_pos_ratio: Optional[float] = None, interactive: Optional[bool] = None)

Downsample data specified by a query

Parameters

sample_rate – sample rate for all records
pos_sample_rate – sample rate for positive records
pos_sampling_seed – seed for positive sampling expression
neg_sample_rate – sample rate for negative records
neg_sampling_seed – seed for negative sampling expression
key_expression – key for generating down sampling expression
salt – seed for generating down sampling expression
max_records – an approximate maximum number of records to return, if max_records is set then also need to set interactive flag to get an accurate estimate
neg_pos_ratio – Desired negative-to-positive ratio which typically should be between 1 and 10 to maximize statistical performance of model estimation. neg_pos_ratio=1 yields balanced samples, whereas neg_pos_ratio=3 has 2x as much data but only 33% lower variance and neg_pos_ratio=10 has 5.5x as much data but only (roughly) 50% lower variance than a balanced sample.
interactive – specify if the query is interactive or not, needs to be provided if max_records is set to get an accurate estimate
reweight – indicate if need to adjust positive/negative records weight

Returns

Query with down sampling where clause

Return type

event_var(name: str, expr: str)

Defines a single precomputed event-level variable that can be retrieved with the expression language function EVENT_VAR('name'). These variables will only be computed once per event and can be reused across multiple features as a speed optimization.

Parameters

name (str) – An event variable name to use
expr (str) – An expression to be pre-computed

Returns

The current TQL Query instance for further chaining/modification

Return type

event_vars(vars: dict)

Defines a set of precomputed event-level variables that can be retrieved with the expression language function EVENT_VAR('name'). These variables will only be computed once per event, and can be reused across multiple features as a speed optimization.

Parameters: vars (dict) – A dictionary of event variables to precompute
Returns: The current TQL Query instance for further chaining/modification
Return type: Query

external_timelines(timelines: list)

Instead of selecting from the pre-built timelines for the project specified in the .from() clause, process the query ONLY against the supplied example timelines. This is useful for unit testing or developing simple examples.

Example timelines:

timelines = [{
    'id': 'timeline_id1',
    'events': [{
        'id': 1,
        'timestamp': 1587772800000,
        'type': 'request',
        'request': {
            'bid': 2.0,
            'impression': 1
        }
    }]
}]

Parameters: timelines (list) – A list of dictionaries representing timeline objects to run the query against
Returns: The current TQL Query instance for further chaining/modification
Return type: Query

format(format)

Specify a format for the output of this query. This operator is only applicable for non-interactive queries where the results will be written to disk.

Parameters: format – One of parquet, csv, or json
Returns: The current TQL Query instance for further chaining/modification
Return type: Query

from_events(identifier, *types)

Selects from the given timelines, emitting one row per event. Identifier can either be a project id or project name. Optionally a list of event types can be provided. For example, if your timelines are composed of bid, activity, and user events, then .from_events(id, ‘user’) will select only user events from your timelines. This from clause is useful if you wish to build a dataset from your timeline events, and is the most common ‘from’ clause used when constructing machine learning datasets.

Parameters

identifier – Project ID or name
*types – One or more event types to select from

Returns

The current TQL Query instance for further chaining/modification

Return type

from_timelines(identifier)

Selects from the given timelines, emitting one row per timeline. Identifier can either be a project id or project name. Functionally, this is accomplished by injecting one sampled (fake) event per timeline with timestamp at epoch time 0, filtering out all other events. This from clause is useful to compute a table of summary statistics per timeline, such as “how many click events has each user had in the last 28 days”.

Parameters: identifier – Project ID or name
Returns: The current TQL Query instance for further chaining/modification
Return type: Query

from_union(sampling: zeenk.tql.sampling.SamplingConf, append: bool = False)

Adds a sampling configuration object in the from_events() clause to this query. Sampling objects add new events to the timeline during execution with generated timestamps. This function can be called multiple times, in which case multiple blocks of generated events will be added to each timeline.

Parameters

sampling (SamplingConf) – A SamplingConf object to append
append (bool) – Append the where conditions to the query instead of replacing

Returns

The current TQL Query instance for further chaining/modification

Return type

get_columns() → list

Returns the Columns from the TQL query’s select() statement

Returns: A list with a copy of the Column objects currently stored in this Query
Return type: list

get_event_vars() → dict

Gets the event_vars of this Query if any exist

Returns: The event_vars of this Query
Return type: dict

get_external_timelines() → list

Returns the external timelines configured for this query, or None

Returns: A list of timeline dictionaries, or None
Return type: list

get_format() → str

Gets the format output (one of parquet, csv, or json) for this Query. This operator is only applicable for non-interactive queries where the results will be written to disk.

Returns: The format of the current query
Return type: str

get_global_vars() → dict: Get the global variables from this Query if any exists

get_id() → int

Gets the ID of this Query

Returns: The ID of this Query
Return type: int

get_interactive() → bool

Gets whether this Query is interactive

Returns: Whether this Query is interactive
Return type: bool

get_label_column()

Gets the label Column if it exists

Returns: The first weight Column
Return type: Column

get_limit() → tuple

Gets the row limit and the timeline limit from this Query

Returns: The row_limit and the timeline limit
Return type: tuple

get_opportunities()

Returns the OpportunityConf object in the current query

Returns: OpportunityConf object in the current query
Return type: OpportunityConf

get_options() → dict

Gets the options for this Query

Returns: The options for this Query
Return type: dict

get_partition_by() → str

Gets the partition key expression for this Query

Returns: The expression that this Query was partitioned by
Return type: str

get_project()

Gets the Project with which this Query is associated

Returns: The Project
Return type: Project

get_project_id() → int

Gets the Project’s ID with which this Query is associated

Returns: The Project’s ID
Return type: int

get_row_limit() → int

Gets the row limit of this Query

Returns: The row limit
Return type: int

get_sampling() → list

Returns the SampingConf object(s) in the current query

Returns: The SampingConf object(s) in the current query
Return type: list

get_timeline_limit() → int

Gets the timeline limit from this Query

Returns: The timeline limit
Return type: int

get_timeline_sample_rate()

Gets the timeline sample rate

Returns: The timeline sample rate

get_timeline_vars() → dict

Gets the timeline vars of this Query if any exists

Returns: The timeline vars
Return type: dict

get_vars() → dict

Gets a dictionary of global_vars, timeline_vars, and event_vars

Returns: A dictionary of global_vars, timeline_vars, and event_vars
Return type: dict

get_weight_column()

Gets the weight Column if it exists

Returns: The first weight Column
Return type: Column

get_where()

Gets the where filters of this Query

Returns: The where filters of this Query

global_var(name: str, expr: str)

Defines a single precomputed global variable that can be retrieved with the expression language function GLOBAL_VAR('name'). These variables will only be computed once per query and can be reused across multiple rows and multiple features per row as a speed optimization. The global variable ‘timeline_stats’ is pre-defined for all queries as a dictionary/map object with the following keys: [min_timestamp,max_timestamp,timeline_count, event_count,event_min_timestamp,event_max_timestamp,event_count_min,event_count_max,event_counts_by_type, time_series_min_max_timestamps]

Parameters

name (str) – A variable name to use
expr (str) – An expression to be pre-computed

Returns

The current TQL Query instance for further chaining/modification

Return type

global_vars(vars)

Defines a set of precomputed global variables that can be retrieved with the expression language function GLOBAL_VAR('name'). These variables will only be computed once per query, and can be reused across multiple rows and multiple features per row as a speed optimization. The global variable ‘timeline_stats’ is pre-defined for all queries as a dictionary/map object with the following keys: [min_timestamp,max_timestamp,timeline_count, event_count,event_min_timestamp,event_max_timestamp,event_count_min,event_count_max,event_counts_by_type, time_series_min_max_timestamps]

Parameters: vars (dict) – A dict of global variables to precompute
Returns: The current TQL Query instance for further chaining/modification
Return type: Query

json() → dict

Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation

Returns: The current TQL Query as a dict

limit(rows=None, timelines=None)

Imposes a limit on the number of timelines iterated over, or rows returned. If timelines=<N> is specified, then at most N timelines will be evaluated in the query. If rows=<N> is specified, the at most N rows will be returned in the results. .limit(5) is a typical usage of this operator, and would be the closest analog to the traditional SQL operator. For interactive queries, imposing a timeline limit is not required, as specifying limit(rows=N) is sufficient to shortcut the evaluation. However, for asynchronous queries executed in a distributed environment such as Spark, it is often useful to specify both timeline and row limit, as adding timeline limit could result in less data being read from disk, and hence faster execution time.

Parameters

rows (int) – The maximum number of rows to return
timelines (int) – The maximum number of timelines to evaluate.

Returns

The current TQL Query instance for further chaining/modification

Return type

classmethod load(query_id)

Loads a Query object from the given ID

Parameters: query_id (int) – The project ID from which to load the Query
Returns: The Query from the given ID
Return type: Query

opportunities(filters, distribution=None, scale=None, epsilon=None)

Defines events that constitute “opportunity to treat” for this dataset. For example, in the online advertising space, opportunities to treat would constitute all bid request events, i.e. “we had an opportunity to buy an ad on the user.” In a clinical drug trial, opportunity would constitute all persons who apply for a drug trial study. Filters will be evaluated as booleans (see docs on the .where() operator). Optionally a distribution function can also be supplied, which defines the hypothesized shape of the causal effect of the opportunities over time. Short, medium, or long string representations of the distribution functions are accepted:

'c', 'const', 'constant'
'e', 'exp', 'exponential'
'l', 'lap', 'laplace'
'u', 'unif', 'uniform'
'r', 'rect', 'rectangular'
't', 'tri', 'triangular'
's', 'stri', 'symmetrictriangular'
'h', 'hnorm', 'halfnormal'
'n', 'norm', 'normal'
'l', 'hlog', 'halflogistic'
'o', 'log', 'logistic'

Also optionally, a string list of time scales can be applied, defining the points in time at which the distribution should be evaluated at. Any numerical value of time is allowed, at the scale of seconds (s), minutes(m), hours(h), and days(d). For example, this is a valid scale string: ‘.5m,4h,1d,3d’, which would correspond to 30 seconds, 4 hours, 1 day, and 3 days.

Also optionally, an epsilon can be provided, which defines a minimal precision below which a feature should be rounded down to zero, e.g., 1e-8 would lead a feature value of 5e-9 to be returned as zero.

Parameters

filters – One or more TQL filter expressions
distribution – The decay function shape (default exponential)
scale – The time scale over which to evaluate the causal effect of opportunities (default 5m,1h,4h,1d,3d,7d)
epsilon – The numerical epsilon to use (default 1E-6)

Returns

The current TQL Query instance for further chaining/modification

Return type

options(max_columns=None, global_min_total_count=None, apply_feature_filters=None, apply_charset_filter=None, drop_empty_rows=None, expand_numerical_features=None, drop_numerical_zero_features=None, throw_expression_errors=None, debug_expressions=None, fill_na=None, numerical_feature_precision=None, numerical_feature_epsilon=None, drop_constant_feature_columns=None, fix_column_names=None, allow_invalid_column_expressions=None)

Specify options for this query.

Parameters

max_columns – the maximum number of columns to return, top N columns is computed from count of non-null values.
global_min_total_count – require that at least this many rows contain a non-null value, or drop the column.
apply_feature_filters – flag on/off applying feature filtering (default true)
apply_charset_filter – flag on/off cleaning the values of numerical and categorical feature columns (default true)
drop_empty_rows – if on, remove rows that have no non-null values. (default false)
expand_numerical_features – expand numerical feature arrays into multiple columns. (default false)
drop_numerical_zero_features – drop numerical feature columns that contain all zeros. (default false)
drop_constant_feature_columns – drop numerical feature columns that are constant (default false)
throw_expression_errors – use “fail fast” behavior with invalid expressions (default false)
debug_expressions – return extended debugging information about TQL expression evaluation with the result set.
fill_na – replace all numerical features with non-numeric values with 0.0.
numerical_feature_precision – how many decimal places to return.
numerical_feature_epsilon – abs(val) < eps will be rounded down to zero.
fix_column_names – specify if backend should rename duplicate column names. (default true)
allow_invalid_column_expressions – return errors about invalid columns exppressions

Returns

The current TQL Query instance for further chaining/modification

Return type

partition_by(expr=None, partition_var: str = 'timeline.id', seed: str = '', shares: list = [0.008, 0.002], names: list = ['train', 'test'])

Specify a partition-key TQL expression for this dataset. Partition-key TQL expressions MUST return a non-null value for every row, and the value should have low cardinality, since a separate block of data (i.e. folder of files for asynchronous queries) will be created for each distinct partition value. One common use of partitioning is to generate reproducible train/test splits of your data for machine learning training. For example, .partition_by(“IF(MD5_MOD(timestamp, 10) > 8, ‘train’, ‘test’)”) creates a reproducible 80/20 split of your data, which can be read separately by your training and testing routines. It is reproducible because the split is computed based on an attribute of the event, such as timestamp. Another common use of partition_by() is to split your dataset into logical groupings such as by day: .partition_by(‘date(timestamp)’) to be read with reporting/BI software.

If the ‘expr’ argument is not set, use other arguments(partition_var, seed, shares, names) to create an expression like Expression("MD5_PARTITIONS(timeline.id, 'the dead sea', [0.008, 0.002], ['train','test'])") and set ‘partition_key_expression’ to that expression.

Parameters

expr – the partition key expression to use.
partition_var – Single-line TQL subexpression.
seed – String to serve as the seed for the hash partitioning.
shares – List of the relative shares of each partition.
names – List of the names for each partition.

Returns

The current TQL Query instance for further chaining/modification

Return type

sampling(sampling: zeenk.tql.sampling.SamplingConf, append: bool = False)

Adds a sampling configuration object to this query. Sampling objects add new events to the timeline during execution with generated timestamps. This function can be called multiple times, in which case multiple blocks of generated events will be added to each timeline.

Parameters

sampling – A SamplingConf object to append
**append – Append the where conditions to the query instead of replacing.

Returns

The current TQL Query instance for further chaining/modification

Return type

select(*cols, append: bool = False)

Defines the columns on this TQL query object. This replaces any pre-existing columns that were set with previous calls to select().

Parameters

cols – One or more TQL column objects or expressions
append (bool) – Append cols to the query’s columns instead of replacing the query’s columns

Returns

The current TQL Query instance for further chaining/modification

Return type

show(): Shows the ResultSet dataframe as a pretty printed table on stdout.

state(): Gets the current state of this Query

submit(interactive=True, wait=True, print_payloads=False, analyze=False, spark=None)

Executes this query in Icarus (via http request) and fetches the results. If the interactive flag is true, this query will be executed immediately and the returned ResultSet will have .rows(). This is suitable for interactive development of TQL expressions, as the results will be immediate. If the interactive flag is false, the query will be executed asynchronously, possibly in a distributed computing environment such as on a Spark cluster. In this case, ResultSet will not have .rows(), but will instead have .data_path() which will be a directory in which the query results will be written to upon successful execution.

Parameters

interactive (bool) – Whether this query should be executed synchronously or asynchronously
wait (bool) – Wait for the dataset compile to complete and show progress/status. Otherwise just make the webservice call and return.
print_payloads (bool) – Prints the request and response payloads from Icarus
analyze (bool) – Display extended ResultSetMetrics
spark (bool) – Whether to submit the query to a Spark Cluster

Returns

A ResultSet object

Return type

ResultSet

timeline_sample_rate(sample_rate)

Sets the sample rate on the Timelines

Parameters: sample_rate – A floating point value between (0,1]
Returns: The current TQL Query instance for further chaining/modification
Return type: Query

timeline_var(name: str, expr: str)

Defines a single precomputed timeline-level variable that can be retrieved with the expression language function TIMELINE_VAR('name'). These variables will only be computed once per timeline and can be reused across multiple rows and multiple features per row as a speed optimization.

Parameters

name (str) – a variable name to use
expr (str) – an expression to be pre-computed

Returns

The current TQL Query instance for further chaining/modification

Return type

timeline_vars(vars: dict)

Defines a set of precomputed timeline-level variables that can be retrieved with the expression language function timeline_var(‘name’). These variables will only be computed once per timeline, and can be reused across multiple rows and multiple features per row as a speed optimization.

Parameters: vars (dict) – A dictionary of timeline variables to precompute
Returns: The current TQL Query instance for further chaining/modification
Return type: Query

treatments(filters, scale=None, distribution=None, epsilon=None): An alias for opportunities(…) to treat.

udf(*udfs, append: bool = False)

Attaches UDF (user-defined function) to a Query

Parameters

udfs – One or more UDF, each udf could be either a string or a TQL Column object
append (bool) – Append cols to the query’s columns instead of replacing the query’s columns

Returns

The current TQL Query instance for further chaining/modification

Return type

union(sampling: zeenk.tql.sampling.SamplingConf, append: bool = False)

Adds a sampling configuration object to this query. Sampling objects add new events to the timeline during execution with generated timestamps. This function can be called multiple times, in which case multiple blocks of generated events will be added to each timeline.

Parameters

sampling (SamplingConf) – A SamplingConf object to append
append (bool) – Append the where conditions to the query instead of replacing.

Returns

The current TQL Query instance for further chaining/modification

Return type

validate()

Validates that the Query is runnable before execution

Returns: Throws an error if the query is not valid
Return type: None

vars(global_vars: Optional[dict] = None, timeline_vars: Optional[dict] = None, event_vars: Optional[dict] = None)

Convenience method for passing a dictionary of all of a query’s precomputed variables. A dictionary with keys. Optional keys: ‘global_vars’, ‘timeline_vars’, and/or ‘event_vars’

Can be used in conjunction with .get_vars():

>>> q = select("1").from_events(1).limit(1).vars(global_vars={'foo':'bar'})
>>> q.get_vars()
{'global_vars': {'foo': 'bar'}, 'timeline_vars': {}, 'event_vars': {}}

Parameters

global_vars (dict) – A dictionary of key-values with valid inputs to .global_vars()
timeline_vars (dict) – A dictionary of key-values with valid inputs to .timeline_vars()
event_vars (dict) – A dictionary of key-values with valid inputs to .event_vars()

Returns

The current TQL Query instance for further chaining/modification

Return type

visualize(size='', shape='', opacity='', jitter=0.2, rows=None, timelines=None, down_sampling=None, size_rescale=7, shape_dict=None, opacity_rescale=None, include_style_columns=True, ignore_nulls=True)

Visualize a timeline dataframe from query execution.

Parameters

rows – the maximum number of rows to return
timelines – the maximum number of timelines to evaluate.
down_sampling – down sampling expression
size (str) – Name of column to use to encode each record’s marker’s size
shape (str) – Name of column to use to encode each record’s marker’s shape
opacity (str) – Name of column to use to encode each record’s marker’s opacity
jitter (float) – Amount of ‘jitter’ (random vertical offset to improve visibility)
size_rescale (float) – Multiplier to adjust the marker size. Defaults to 7

shape_dict (dict) –

Dictionary to transform a column into shapes, e.g.:

{'outcome':'star', 'opportunity':'square', 'treatment':'triangle',
 'default':'circle'}

opacity_rescale (float) – Multiplier to adjust the marker opacity. None defaults to rescaling by the column’s min/max to 0 and 1
include_style_columns (bool) – Should the event data display include style columns?
ignore_nulls (bool) – Do not show key-value pairs that have null as the value

Returns

A timeline viewer widget

where(*filters, append=False)

Add one or more where clauses to the query, filtering the timeline events to only those who satisfy the given conditions. The given TQL expression(s) are evaluated as booleans. Non-boolean return values are evaluated in the following manner:

numerical values greater than 0 evaluates to true, otherwise false

The strings ‘true’ and ‘false’ (non-case sensitive) are evaluated to their boolean equivalents

Other non-numerical, non-boolean return values are considered true if value is not null

In case of multiple where clauses, only the last one will be honored, i.e. the following constructs are logically equivalent: .where('cond1').where('cond2') == .where('cond2') Multiple filters in a single where clause are applied using AND, i.e. the following constructs are logically equivalent: .where('cond1', 'cond2') == where('cond1 AND cond2'). Use the following TQL syntax to define an OR clause: .where('cond1 OR cond2')

Parameters

filters – One or more TQL columns or expressions that returns a boolean
append (bool) – Append the where conditions to the query instead of replacing

Returns

The current TQL Query instance for further chaining/modification

Return type

tql.query.load_query(id: int) → tql.query.Query

Loads the query from the given ID

Parameters: id (int) – The ID of the query
Returns: A new Query instance
Return type: Query

tql.query.select(*cols) → tql.query.Query

Create a new query object from one or more columns. Columns can be defined as TQL expression strings, or wrapped using one of the provided TQL column functions label(), weight(), tag(), categorical(), numerical(), or metadata(). Typically select(...) will immediately be followed by .from_timelines(...) or .from_events(...) during query construction.

Parameters: cols – One or more TQL column objects or expressions
Returns: A new tql.query.Query instance for further chaining/modification
Return type: Query

tql.query_templates module

tql.query_templates.query(project='lethe4')

A simple example query with event_metadata() and a limit of 10

Parameters: project – The Project or project ID to run the Query against
Returns: An example Query object
Return type: Query

tql.resultset module

class tql.resultset.Partition(json, columns, data_format, interactive=True)

Bases: object

A Partition is a subset of a ResultSet. If the ResultSet was partitioned, there will be a list of Partitions in the Resultset. If the ResultSet was not partitioned, there will only be the default Partition in the ResultSet.

columns() → list

Gets the Columns from the Query

Returns: A list of Columns in the Query
Return type: list

data_path() → str

Gets the data path of this Partition. This is only applicable for non-interactive Queries.

Returns: The path to the data
Return type: str

dataframe(limit=None)

Gets this Partition’s Dataframe

Parameters: limit (int) – The maximum amount of rows to return
Returns: A list of rows from a Spark Dataframes
Return type: list

name() → str

Gets the name of this Partition

Returns: The name of this Partition
Return type: str

pandas_dataframe()

Loads this Partition’s data into a pandas dataframe and return it

Returns: A Pandas Dataframe

positive_row_count() → int

Gets the number of positive rows in this Partition

Returns: The number of positive rows in this Partition
Return type: int

row_count() → int

Gets the number of rows in the Partition

Returns: The number of rows in this Partition
Return type: int

spark_dataframe(infer: bool = True)

Load this partition data into a spark dataframe and return it

Parameters: infer (bool) – Use inferSchema with CSV output
Returns: A Spark Dataframe

class tql.resultset.ResultSet(json, query=None, execution_time=None, analyze=False, interactive=True)

Bases: object

A ResultSet is the result of a Query. It holds data as columns and rows.

column_names() → list

Gets the column names from this ResultSet as a list of strings

Returns: A list of column names as strings
Return type: list

columns()

Gets the columns from this ResultSet as a list of Columns

Returns: A list of Columns

dataframe(limit=None)

Gets a dataframe from the ResultSet

Parameters: limit (int) – The upper limit of rows to return
Returns: A dataframe of rows and columns

default_partition()

Gets the default partition from this ResultSet

Returns: The default Partition
Return type: Partition

get_id()

Gets the ID of this ResultSet

Returns: The ID of this ResultSet
Return type: int

get_query()

Gets the Query that generated this ResultSet

Returns: The Query that generated this ResultSet
Return type: Query

classmethod load(result_set_id, analyze: bool = False)

Loads a new ResultSet from the given ID

Parameters

result_set_id (int) – The ResultSet ID
analyze (bool) – Whether to return detailed information about the ResultSet

Returns

The ResultSet with the given ID

Return type

ResultSet

metrics()

Gets an HTML representation of metrics about this ResultSet

Returns: An HTML representation of metrics about this ResultSet

pandas_dataframe()

Gets this ResultSet as a Pandas Dataframe

Returns: A Pandas Dataframe

partition(partition_name='_default')

Gets a specific partition from this Resultset

Parameters: partition_name (str) – The name of the Partition
Returns: The partition with the given name
Return type: Partition

partition_names()

Gets a list of partition names from this ResultSet

Returns: A list of partition names
Return type: list

partitions()

Gets a list of Partitions from this Resultset

Returns: A list of partitions
Return type: list

positive_row_count()

Gets the number of positive rows in this ResultSet

Returns: The number of positive rows
Return type: int

refresh(): Populates the non-interactive ResultSet fields with execution details

row_count()

Gets the number of rows in the ResultSet

Returns: The number of rows in the ResultSet
Return type: int

spark_dataframe(infer=True)

Gets this ResultSet as a Spark Dataframe

Parameters: infer (bool) – Whether to use Spark to infer the datatypes
Returns: A Spark Dataframe

class tql.resultset.ResultSetMetrics(metrics)

Bases: object

column_metrics(pandas=False)

column_value_metrics(pandas=False): Getter for column_value_metrics from ResultSetMetrics.ColumnMetrics

debug(pandas=False)

dropped_expanded_columns(pandas=False)

event_runtime(pandas=False)

expression_compile_errors(pandas=False)

expression_timing_stats(pandas=False)

get_timelines_processed()

json()

query_summary(pandas=False)

timeline_runtime(pandas=False)

tql.resultset.load_resultset(id: int) → tql.resultset.ResultSet

Loads the ResultSet with the given ID

Parameters: id (int) – The ID of the ResultSet to be loaded
Returns: The ResultSet
Return type: ResultSet

tql.sampling module

class tql.sampling.SamplingConf(sample_generator_expression: str, functions: Optional[list] = None, variables: Optional[dict] = None, attribute_expressions: Optional[dict] = None, inherit_attributes: bool = False)

Bases: object

Allows you to manually override parameters during sample generation. See generate_events and generate_importance_events

Variables and functions both get registered in the expression context prior to the execution of the sampling expression and event attribute expressions.

Parameters

sample_generator_expression (str) –
ex. "[MILLIS(TO_DATETIME('2021-12-01'))]". An expression providing a list of values at which to generate samples for each timeline. Typically, these will be a list of timestamps. The following example creates midnight timestamps from 2021-12-01 through 2021-12-10 for the timestamps at which we will sample:
```
days = 10; millis_in_day = 24*3600*1000;
min_date = MILLIS(TO_DATETIME('2021-12-01')); iter = RANGE(0, days+1);
MAP(iter, (x) -> min_date + millis_in_day * x)
```
A simple example would be just "[0]" (timestamp of 1970-01-01 00:00:00.000) if you just wanted to generate a single record per timeline where dynamics are not relevant for any columns.
functions (list) – ex: ["function bar() { 2.0 + 2.0 }"] - can be used like any other function. User-Defined Functions (UDFs) that get attached to the sample_generator_expressions.
variables (dict) – ex: {"foo" : "3.3"} - could be referenced in the sampling/event attribute expressions as foo. Key-value pairs used to define variables to parameterize advanced forms of sampling, such as Importance Sampling.
attribute_expressions (dict) –
ex: {'timestamp': '${@sample}'} A dictionary of key-value pairs to replace dot-notation expressions such as ‘id’, ‘timestamp’, and ‘type’. This is flexible enough to accommodate the development of more complex forms of sampling over time, space, and distributions of conversions (e.g., conversion name/type).
- id: Identifier for each sample, e.g., MONOTONICALLY_INCREASING_ID().
- timestamp: An expression to define the timestamp associated with the sample generated. If sample_generator_expressions generates timestamps, this field may be specified as sample.
- type: User provided string to identify the type of sample generated (e.g., 'my_manual_sample').
inherit_attributes (bool) – If the generator expression is a list of events, rather than just timestamps, initialize attribute_expressions with all of the properties of the event via GET_PROPERTY().

describe(): Get a description of the object. In Jupyter notebooks, this returns a set of HTML tables. In a regular python interactive shell or script, this will default to the String representation of the SamplingConf.

get_attribute_expressions(): Gets the attributes of this SamplingConf

get_functions(): Gets the functions of this SamplingConf

get_inherit_attributes(): Gets if this SamplingConf should inherit attributes

get_sample_generator_expression(): Gets the generator expression of this SamplingConf

get_variables(): Gets the variables of this SamplingConf

json() → dict

Returns a dictionary representation of this object Used when sending to Icarus (the web server) for evaluation.

Returns: The current settings instance as a dict
Return type: dict

set_attribute_expressions_type(expr_type)

Sets the type of the attribute expressions

Parameters: expr_type – The type of the attribute expression

tql.sampling.generate_events(sample_generator_expression: str, functions: Optional[list] = None, variables: Optional[dict] = None, attribute_expressions: Optional[dict] = None, inherit_attributes: bool = False) → tql.sampling.SamplingConf: Manual sampling is similar to importance sampling but allows the user to manually override parameters such as the list of sample events (e.g., timestamps at which sampling occurs) and other properties of each generated sample event. See generate_importance_events() for an advanced use case.

tql.sampling.generate_importance_events(num_samples: float = - 1.0, min_ts: Optional[str] = None, max_ts: Optional[str] = None, time_shift_factor: float = 0.0, fraction_uniform_time: float = 0.0, sampling_distribution: str = 'exponential', sampling_events_expression: str = 'FILTER(timeline.events, (x) -> x.type=null)', sampling_kernels: str = '5m,15m,1h,4h,1d,3d,7d') → tql.sampling.SamplingConf

Importance sampling can be used to retain modeling unbiasedness (avoid introducing selection bias when sampling records) while still increasing the number of records where the modeling is most interesting. For example, when modeling the causal effect of a treatment on an outcome, we would like to ensure that most of our records (whether ‘positive’, an outcome event, or ‘negative’, a non-outcome sampling event) are in the vicinity of a treatment opportunity or an outcome. By so doing, we increase the model’s statistical power at deciphering the relationship between the two. In contrast, if most outcomes happen during only 10% of the sample time period, we most of our observations will be during the “boring” portion of the timeline when no events of interest are occurring.

generate_importance_events helps you configure what time periods are “interesting.” You configure how many records to randomly sample for each timeline, which timestamps or events you want to increase your sampling around, the distribution (shape and scale) around each event from which you would like to randomly sample, and the probability you would like to draw from the background uniform distribution (e.g., a random point in the timeline).

In summary, one way to generate negative (non-outcome) records would be to simply draw uniformly between the start and end of the timeline’s observation window. However, we can improve upon that by instructing the extractor to generate these negatives by a configurable time-importance-weighted sampling methodology around times of timeline events.

tql.timelines module

class tql.timelines.Project

Bases: object

A Project is an object that contains: a name, a description, UDFs, and links to TimeSeries. It is a conglomeration of one or more sets of timeseries data and timeline data. Projects under 10 are reserved as demonstration projects.

static all(): Gets all Projects

build_timelines(wait=True): Builds this Project’s timelines

delete(): Deletes this Project from the database

description(description): Sets a description of the Project

from_timeseries(*ts, append=False)

Created a Project from a TimeSeries

Returns: A new project based on the TimeSeries
Return type: Project

get_annotations(): Gets this Project’s column annotations

get_description(): Gets a description of the Project

get_id(): Gets the ID of this Project

get_metadata(): Gets this Project’s metadata

get_name(): Gets the name of this Project

get_status(): Gets this Project’s status: SUCCESS, PENDING, COMPILING, RUNNING

get_timelines(): Gets this Project’s Timelines

get_timeseries(): Gets this Project’s TimeSeries

get_timeseries_names(): Gets this Project’s TimeSeries names

json() → dict

Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation.

Returns: The current Project as a dict
Return type: dict

save(): Saves the metadata (description, etc.) of this Project to the database

class tql.timelines.TimelineStats(json)

Bases: object

Stats about the corresponding timeline

get_event_count()

get_event_counts_by_type()

get_largest_timeline_size()

get_max_timestamp()

get_min_timestamp()

get_smallest_timeline_size()

get_timeline_count()

get_timeseries_stats()

class tql.timelines.Timelines(json, timeseries=None, annotations=None)

Bases: object

A Timeline is a collection of events identified by a common join key, for example a logged in user_id or web cookie, and sorted by timestamp

get_annotation(attr)

get_attributes()

get_attributes_table()

get_created()

get_data_path(for_spark=False)

get_example()

get_id()

get_sample_data_path(for_spark=False)

get_schema()

get_state()

get_statistics()

get_updated()

pandas_dataframe()

spark_dataframe()

tql.timelines.create_project(name: str, or_update: bool = False) → tql.timelines.Project

Creates a new project. This project will require further configuration before timelines can be created. If the project already exists and you wish to update it, use load_project(name), or give the or_update=True option here.

Parameters

name (str) – The name of the Project
or_update (bool) – Whether to update the Project

Returns

The Project that has just been created

Return type

tql.timelines.drop_project(name_or_id, if_exists: bool = False)

Deletes the project by name or ID

Parameters

name_or_id – The name (str) or ID (int) of the Project
if_exists (bool) – Only drop the Project if it exists

tql.timelines.load_project(project_identifier, fail_if_not_found: bool = True) → tql.timelines.Project

Loads the specified project by name, or throws TQLAnalysisException if not found

Parameters

project_identifier – The name (str) or ID (int) of the Project
fail_if_not_found (bool) – Whether to throw an exception if the Project is not found

Returns

The Project with the specified name or ID

Return type