tql package
- tql.categorical(expr, name: Optional[str] = None, filters: Optional[dict] = None) zeenk.tql.column.FeatureColumn
Creates a categorical feature column from a TQL expression, optionally providing name and filters
- Parameters
expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters
- Returns
A FeatureColumn object to be provided to
select(...)
- Return type
- tql.col(c: any, name: Optional[str] = None, type: Optional[str] = None, filters: Optional[dict] = None)
Creates a column from a TQL expression or given input, optionally providing name, type, and filters. The first argument can be a variety of formats, including:
classes or subclasses of type Column
dictionary containing keys for name, expression, and type
raw strings, which will be interpreted as TQL expressions.
The created column is by default unnamed, and will be assigned a name if used in a TQL
select(...)
statement. The default type of columns is ‘METADATA’.- Parameters
c (any) – A TQL expression string, column object, dictionary, list, tuple, or FeatureColumn object
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
type (str) – Optionally provide a type for the column. If not provided, the column type will be METADATA
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters
- Returns
A Column or FeatureColumn object to be provided to
select(...)
- tql.constant()
Used in select(…) statements to select the FeatureColumn “1.0” with name “constant”
- Returns
A tuple with the FeatureColumn “1.0” with name “constant”
- Return type
tuple
- tql.create_project(name: str, or_update: bool = False) zeenk.tql.timelines.Project
Creates a new project. This project will require further configuration before timelines can be created. If the project already exists and you wish to update it, use
load_project(name)
, or give theor_update=True
option here.- Parameters
name (str) – The name of the Project
or_update (bool) – Whether to update the Project
- Returns
The Project that has just been created
- Return type
- tql.create_timeseries(name: str) zeenk.tql.timeseries.TimeSeries
Creates a TimeSeries with the given name
- Parameters
name (str) – The name of the TimeSeries
- Returns
A new TimeSeries with the given name
- Return type
- tql.debugger(project, expression: str = '', theme: str = 'light')
Create an expression debugger widget - default row limit of 10 is applied automatically. Only works in Jupyter Notebooks or Jupyter Labs
- Parameters
project – The Project ID or name to run the expression against
expression (str) – The initial value of the expression, if any
theme (str) – The editor theme, ‘light’ or ‘dark’
- Returns
A Jupyter Notebook/Lab widget
- tql.delete_udf(project_id: int, function_name: str)
Deletes a UDF for a specific Project by function name
- Parameters
project_id (int) – ID of the Project to delete udf from
function_name (str) – Name of the udf to be deleted
- tql.describe_project(project_identifier, fail_if_not_found: bool = True) zeenk.tql.timelines.Project
Loads the specified project by name, or throws TQLAnalysisException if not found
- Parameters
project_identifier – The name (str) or ID (int) of the Project
fail_if_not_found (bool) – Whether to throw an exception if the Project is not found
- Returns
The Project with the specified name or ID
- Return type
- tql.drop_project(name_or_id, if_exists: bool = False)
Deletes the project by name or ID
- Parameters
name_or_id – The name (str) or ID (int) of the Project
if_exists (bool) – Only drop the Project if it exists
- tql.event_metadata() tuple
To be used in select(…) statements for returning timeline.id, id, datetime, and type
- Returns
A tuple of Columns
- Return type
tuple
- tql.event_time() tuple
To be used in select(…) statements for returning timestamp and duration
- Returns
A tuple of Columns
- Return type
tuple
- tql.generate_events(sample_generator_expression: str, functions: Optional[list] = None, variables: Optional[dict] = None, attribute_expressions: Optional[dict] = None, inherit_attributes: bool = False) zeenk.tql.sampling.SamplingConf
Manual sampling is similar to importance sampling but allows the user to manually override parameters such as the list of sample events (e.g., timestamps at which sampling occurs) and other properties of each generated sample event. See generate_importance_events() for an advanced use case.
- tql.generate_importance_events(num_samples: float = - 1.0, min_ts: Optional[str] = None, max_ts: Optional[str] = None, time_shift_factor: float = 0.0, fraction_uniform_time: float = 0.0, sampling_distribution: str = 'exponential', sampling_events_expression: str = 'FILTER(timeline.events, (x) -> x.type=null)', sampling_kernels: str = '5m,15m,1h,4h,1d,3d,7d') zeenk.tql.sampling.SamplingConf
Importance sampling can be used to retain modeling unbiasedness (avoid introducing selection bias when sampling records) while still increasing the number of records where the modeling is most interesting. For example, when modeling the causal effect of a treatment on an outcome, we would like to ensure that most of our records (whether ‘positive’, an outcome event, or ‘negative’, a non-outcome sampling event) are in the vicinity of a treatment opportunity or an outcome. By so doing, we increase the model’s statistical power at deciphering the relationship between the two. In contrast, if most outcomes happen during only 10% of the sample time period, we most of our observations will be during the “boring” portion of the timeline when no events of interest are occurring.
generate_importance_events helps you configure what time periods are “interesting.” You configure how many records to randomly sample for each timeline, which timestamps or events you want to increase your sampling around, the distribution (shape and scale) around each event from which you would like to randomly sample, and the probability you would like to draw from the background uniform distribution (e.g., a random point in the timeline).
In summary, one way to generate negative (non-outcome) records would be to simply draw uniformly between the start and end of the timeline’s observation window. However, we can improve upon that by instructing the extractor to generate these negatives by a configurable time-importance-weighted sampling methodology around times of timeline events.
- tql.get_udf(project_id: int, function_name: str) zeenk.tql.udf.UDF
Retrieve a specific UDF from a Project
- Parameters
project_id (int) – ID of the Project to retrieve UDF on
function_name (str) – Name of the UDF to retrieve
- Returns
A UDF or throws TQLAnalysisNotFound exception
- Return type
- tql.kernelize(features, name: Optional[str] = None, treatment_filter_event: Optional[str] = None, event_group_where_event: Optional[str] = None, intended_action: Optional[list] = None, actual_action: Optional[list] = None, treatment_model: Optional[str] = None, kernel_types: Optional[list] = None, opportunity_conf: Optional[zeenk.tql.opportunity_conf.OpportunityConf] = None, kernel_parameters: Optional[str] = None, kernel_distribution: Optional[str] = None, filters: Optional[dict] = None, parse_kv: bool = False)
Takes a list of opportunity-level features based on user/opportunity/treatment/proxy-outcome-level fields and create a transformed feature with
SUM_OPPORTUNITIES()
accumulating the effects across each opportunity. These “kernelized” features are for use in a Causmos timeline-based causal model where each observation record in a dataset is an outcome or a potential outcome (e.g., a moment in time when an outcome could have occurred). For example, we can power dataset generation using a treatment-propensity model to reduce the potential bias if treatment and the outcome are correlated. Apply this function to each list of features to be kernelized. There are several types of transformations described as (KD = ‘treatment’, BKD = ‘baseline’, GKD = ‘ghost’, NKD = ‘nonrandom’).- Parameters
features (list) – List of col() to be kernelized (KD,BKD,GKD,NKD). Examples:
["1.0", "IF(1.0,'L','R')"]
,[{"name":"constant", "expression":"1.0", "type":"NUMERICAL"}]
). Columns can be “CATEGORICAL” or “NUMERICAL”. Each categorical expression creates an expansion of features; each numerical expression multiplies/reweights the opportunity’s contribution to theSUM_OPPORTUNITIES()
sum.name (str) – Base string for the feature sets. Will append type of kernel and prepend “w” to denote non-incrementality features for BKD, GKD, and NKD. If name=’’ (default), just return the list of features with the default prefix:
AKD_...
,wBKD_...
.treatment_filter_event (str) – Column that defines whether a treatment opportunity resulted in treatment. For example, was the advertiser’s impression shown after bidding? In a sample dataset,
request.impression
is the relevant field.event_group_where_event (str) – second where condition within passed EventGroup (currently this is set to be output from filtering like dedupe - but should be made more explicit).
intended_action – List of numerical/categorical expressions defining the intended/optimal decisions, e.g.,
bid_amount
oreligible_treatment_groups
.actual_action – List of numerical/categorical expressions defining the actual action/decision taken, e.g.,
IF(ghost_bid, 0, bid_amount)
orassigned_treatment_group
.treatment_model (str) – String for the deployed/published treatment prediction model (GKD, NKD-only). In practice, this will be the win-rate model based on the leaves.
kernel_types (list) – List of kernels types to include in the feature set. Subset of
['KD','BKD','GKD','NKD']
.opportunity_conf – An OpportunityConf() object that contains the defaults for
opportunity_filter
,kernel_parameters
, andkernel_distribution
.kernel_parameters (str) – Calibration for
kernel.days
in the kernel parameters. Arrays of kernel features are created with suffixes using values such as seconds (s), minutes (m), hours (h), and days (d) in combination with a number, such as'15m,4h,3d'
:'name_feature'
becomes'name_feature-15m'
, etc.kernel_distribution (str) – Positive-support distributions (short-form abbreviations):
exponential
(e
,exp
),uniform
(u
,unif
),triangular
(t
,tri
),halfnormal
(h
,hnorm
), andhalflogistic
(l
,hlog
). Positive- & negative-support distributions are the symmetric analogs of the positive-support distributions:laplace
(a
,lap
),rectangular
(r
,rect
),symmetrictriangular
(s
,stri
),normal
(n
,norm
), andlogistic
(o
,log
). Time-independentconstant
kernel (c
,const
) is also useful for various use cases such as static models in order to accumulate all opportunities and treatments for each outcome.filters (dict) – column filters to apply to all of the features.
parse_kv (boolean) – Try to parse ‘NUMERICAL’ input
features
as key-value pairs ‘k:v’. This has extra overhead, but more complex mixed ‘categorical:numerical’ input features.
- Returns
A list of feature sets for each kernel_type, each containing a list of kernelized features.
- Return type
list
- tql.label(expr, name: str = '_label') zeenk.tql.column.Column
Creates a label column from a TQL expression. The expression provided to label() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default label value of 0.0. Label columns will be automatically be named “_label”. It is expected that a dataset will have at most one label column.
- Parameters
expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
- Returns
A Column object to be provided to
select(...)
- Return type
- tql.list_udfs(project_id: int) zeenk.tql.udf.UDF
Retrieves all UDFs defined for a Project
- Parameters
project_id (int) – ID of the Project to retrieve UDFs on
- Returns
All existing UDFs for the Project
- Return type
- tql.load_project(project_identifier, fail_if_not_found: bool = True) zeenk.tql.timelines.Project
Loads the specified project by name, or throws TQLAnalysisException if not found
- Parameters
project_identifier – The name (str) or ID (int) of the Project
fail_if_not_found (bool) – Whether to throw an exception if the Project is not found
- Returns
The Project with the specified name or ID
- Return type
- tql.load_query(id: int) zeenk.tql.query.Query
Loads the query from the given ID
- Parameters
id (int) – The ID of the query
- Returns
A new Query instance
- Return type
- tql.load_resultset(id: int) zeenk.tql.resultset.ResultSet
Loads the ResultSet with the given ID
- Parameters
id (int) – The ID of the ResultSet to be loaded
- Returns
The ResultSet
- Return type
- tql.metadata(expr, name: Optional[str] = None) zeenk.tql.column.Column
Creates a metadata column from a TQL expression. Metadata columns will return the expression values “as is”, meaning they will not be post-processed with charset filtering or expansion of numerical columns. Metadata columns are also not subject to column filters. Metadata columns are the default column type and are often the correct choice for arbitrary datasets that are not specifically intended to be consumed by a ML training package.
- Parameters
expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
- Returns
A Column object to be provided to
select(...)
- Return type
- tql.numerical(expr, name: Optional[str] = None, filters: Optional[dict] = None) zeenk.tql.column.FeatureColumn
Creates a numerical feature column from a TQL expression, optionally providing name and filters
- Parameters
expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters
- Returns
A FeatureColumn object to be provided to
select(...)
- Return type
- tql.query(project='lethe4')
A simple example query with event_metadata() and a limit of 10
- Parameters
project – The Project or project ID to run the Query against
- Returns
An example Query object
- Return type
- tql.random_partition(partition_var: str = 'timeline.id', seed: str = '', shares: list = [0.8, 0.2], names: list = ['train', 'test'])
Use the arguments to create a random partition TQL expression like
col("MD5_PARTITIONS(timeline.id, 'my hashing seed', [0.008, 0.002], ['train','test'])")
. This can be used as a column or with .partition_by().- Parameters
partition_var (str) – Single-line TQL subexpression
seed (str) – String to serve as the seed for the hash partitioning
shares (list) – List of the relative shares of each partition. Relative shares do not need to sum to one
names (list) – List of the names for each partition
- Returns
A metadata column with the random_partition expression for use in .partition_by() or as a column
- Return type
- tql.select(*cols) zeenk.tql.query.Query
Create a new query object from one or more columns. Columns can be defined as TQL expression strings, or wrapped using one of the provided TQL column functions
label()
,weight()
,tag()
,categorical()
,numerical()
, ormetadata()
. Typicallyselect(...)
will immediately be followed by.from_timelines(...)
or.from_events(...)
during query construction.- Parameters
cols – One or more TQL column objects or expressions
- Returns
A new tql.query.Query instance for further chaining/modification
- Return type
- tql.show_projects()
Shows the available projects
- tql.tag(expr, name: str = '_tag') zeenk.tql.column.Column
Creates a tag column from a TQL expression. The expression provided to tag() is expected to return a non-null value for all rows. Typically this expression will uniquely identify the row, which is useful for debugging and tracing datasets later. Uniqueness is not required for the return value, but highly encouraged. Tag columns will automatically be named “_tag”. It is expected that a dataset will have at most one tag column.
- Parameters
expr – A TQL expression string that returns a unique identifier for the row
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
- Returns
A Column object to be provided to
select(...)
- Return type
- tql.update_udf(project_id: int, *function_src)
Updates UDF(s) for the specified Project
- Parameters
project_id (int) – ID of the Project to upload UDFs to
function_src – One or more function’s source string
Throws TQLAnalysisException if function doesn’t exist in the Project yet
- tql.upload_udf(project_id: int, *function_src)
Uploads udf(s) to the specified Project
- Parameters
project_id (int) – ID of the Project to upload UDFs to
function_src – One or more function’s source string
Throws TQLAnalysisException if function already exists in the Project
- tql.validate_udf(project_id: int, *function_src)
Validates UDF(s)
- Parameters
project_id (int) – ID of the Project to validate UDFs on
function_src – One or more function’s source string
Throws TQLAnalysisException and printout the error message and location if there’s any compilation error in udf
- tql.weight(expr, name: str = '_weight') zeenk.tql.column.Column
Creates a weight column from a TQL expression. The expression provided to weight() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default weight value of 1.0. Weight columns will be automatically be named “_weight”. It is expected that a dataset will have at most one weight column.
- Parameters
expr – A TQL expression string that returns a numeric value
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
- Returns
A Column object to be provided to
select(...)
- Return type
Subpackages
Submodules
tql.column module
- class tql.column.Column(name: str, expression: str, type: str)
Bases:
object
A column is composed of a selectable TQL expression, a friendly name, and a data type. Columns can be of type: label, weight, tag, or metadata
- debugger(project, theme: str = 'light')
Creates an expression debugger widget - default row limit of 10 is applied automatically Only works in Jupyter Notebooks or Jupyter Labs
- Parameters
project – The project number or name to run the expression against
theme (str) – The editor theme, ‘light’ or ‘dark’
- Returns
A Jupyter Notebook/Lab widget
- get_expression() str
Gets the expression of this Column
- Returns
The expression of this Column
- Return type
str
- get_name() str
Gets the name of this Column
- Returns
The name of this Column
- Return type
str
- get_type() str
Gets the type of this Column
- Returns
The type of this Column
- Return type
str
- is_feature() bool
Gets whether this Column is a feature [Column](tql.html#tql.column.FeatureColumn)
- Returns
Whether this Column is a feature
- Return type
bool
- is_label() bool
Gets whether this Column is a label [Column](tql.html#module-tql.column)
- Returns
Whether this Column is a label
- Return type
bool
- is_weight() bool
Gets whether this Column is a weight [Column](tql.html#module-tql.column)
- Returns
Whether this Column is a weight
- Return type
bool
- json() dict
Gets a Python dict representation of this Column
- Returns
A Python dict representation of this Column
- Return type
dict
- keys() tuple
Gets the keys in the Python dict that represents a Column object
- Returns
A tuple of the keys in the Python dict that represents a Column object
- Return type
tuple
- name(name: str)
Sets the name of this Column
- Parameters
name (str) – The name of this Column
- Returns
This Column
- Return type
- type(type: str)
Sets the type of this Column. Available types:
'CATEGORICAL’, ‘NUMERICAL’, ‘LABEL’, ‘WEIGHT’, ‘TAG’, ‘METADATA’
- Parameters
type (str) – One of the available Column types
- Returns
This Column
- Return type
- validate(project_id: int)
Validates that the Column is syntactically correct before execution
- Parameters
project_id (int) – The project ID to validate against
- Returns
Throws an error if the column is not valid
- Return type
None
- class tql.column.ColumnFilters(**kwargs)
Bases:
object
A ColumnFilter is a filter that can be applied to a Column
- json() dict
Gets a Python dict representation of this ColumnFilters
- Returns
A Python dict representation of this ColumnFilters
- min_cardinality(val: int)
Filter columns that have a cardinality less than val
- Parameters
val (int) – The minimum cardinality
- Returns
This ColumnFilters
- Return type
- min_label_sum(val)
Filter label columns that have a sum less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- min_negative_count(val: int)
Filter negative columns that have a count less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- min_negative_sum(val: int)
Filter negative columns that have a sum less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- min_negative_weighted_count(val: int)
Filter negative weighted columns that have a count less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- min_positive_count(val: int)
Filter positive columns that have a count less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- min_positive_sum(val: int)
Filter positive columns that have a sum less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- min_positive_weighted_count(val: int)
Filter positive weighted columns that have a count less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- min_total_count(val: int)
Filter columns that have a total count less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- min_total_weighted_count(val: int)
Filter weighted columns that have a total count less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- min_weighted_label_sum(val: int)
Filter weighted label columns that have a sum less than val
- Parameters
val (int) – The minimum value
- Returns
This ColumnFilters
- Return type
- class tql.column.FeatureColumn(name: str, expression: str, type: str, filters: Optional[dict] = None)
Bases:
tql.column.Column
A FeatureColumn is Column of type numerical or categorical
When a Column is defined as a
categorical()
ornumerical()
FeatureColumn, the following additional filters are available:global_min_total_count
,apply_charset_filter
,drop_empty_rows option
,expand_numerical_feature
,drop_numerical_zero_features
, etc.- copy()
Makes a copy of the FeatureColumn
- Returns
A copy of this FeatureColumn
- Return type
- filter(filters: Optional[dict] = None, **kwargs)
Adds filter(s) to the FeatureColumn
- Parameters
filters (dict) – The filter(s) for the FeatureColumn
- Returns
This FeatureColumn
- Return type
- get_filters() dict
Gets a Python dict of the FeatureColumn’s filters
- Returns
A Python dict of this FeatureColumn’s filters
- Return type
dict
- json() dict
Gets a Python dict representation of this FeatureColumn
- Returns
A Python dict representation of this FeatureColumn
- Return type
dict
- keys() tuple
Gets the keys in the Python dict that represents this FeatureColumn object
- Returns
A tuple of the keys in the Python dict that represents this FeatureColumn object
- Return type
tuple
- tql.column.categorical(expr, name: Optional[str] = None, filters: Optional[dict] = None) tql.column.FeatureColumn
Creates a categorical feature column from a TQL expression, optionally providing name and filters
- Parameters
expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters
- Returns
A FeatureColumn object to be provided to
select(...)
- Return type
- tql.column.col(c: any, name: Optional[str] = None, type: Optional[str] = None, filters: Optional[dict] = None)
Creates a column from a TQL expression or given input, optionally providing name, type, and filters. The first argument can be a variety of formats, including:
classes or subclasses of type Column
dictionary containing keys for name, expression, and type
raw strings, which will be interpreted as TQL expressions.
The created column is by default unnamed, and will be assigned a name if used in a TQL
select(...)
statement. The default type of columns is ‘METADATA’.- Parameters
c (any) – A TQL expression string, column object, dictionary, list, tuple, or FeatureColumn object
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
type (str) – Optionally provide a type for the column. If not provided, the column type will be METADATA
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters
- Returns
A Column or FeatureColumn object to be provided to
select(...)
- tql.column.label(expr, name: str = '_label') tql.column.Column
Creates a label column from a TQL expression. The expression provided to label() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default label value of 0.0. Label columns will be automatically be named “_label”. It is expected that a dataset will have at most one label column.
- Parameters
expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
- Returns
A Column object to be provided to
select(...)
- Return type
- tql.column.metadata(expr, name: Optional[str] = None) tql.column.Column
Creates a metadata column from a TQL expression. Metadata columns will return the expression values “as is”, meaning they will not be post-processed with charset filtering or expansion of numerical columns. Metadata columns are also not subject to column filters. Metadata columns are the default column type and are often the correct choice for arbitrary datasets that are not specifically intended to be consumed by a ML training package.
- Parameters
expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
- Returns
A Column object to be provided to
select(...)
- Return type
- tql.column.numerical(expr, name: Optional[str] = None, filters: Optional[dict] = None) tql.column.FeatureColumn
Creates a numerical feature column from a TQL expression, optionally providing name and filters
- Parameters
expr – A TQL expression string
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
filters (dict) – Optionally provide a dictionary of column filter conditions for categorical or numerical filters
- Returns
A FeatureColumn object to be provided to
select(...)
- Return type
- tql.column.spaces(string: str, number: int = 2, pad: str = ' ')
Strip leading and lagging newlines and pad newlines with ‘number’ of ‘pad’ characters.
- tql.column.tag(expr, name: str = '_tag') tql.column.Column
Creates a tag column from a TQL expression. The expression provided to tag() is expected to return a non-null value for all rows. Typically this expression will uniquely identify the row, which is useful for debugging and tracing datasets later. Uniqueness is not required for the return value, but highly encouraged. Tag columns will automatically be named “_tag”. It is expected that a dataset will have at most one tag column.
- Parameters
expr – A TQL expression string that returns a unique identifier for the row
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
- Returns
A Column object to be provided to
select(...)
- Return type
- tql.column.weight(expr, name: str = '_weight') tql.column.Column
Creates a weight column from a TQL expression. The expression provided to weight() is expected to return a numeric value for all rows. If a numeric or NaN/infinite value is not returned for any row, it will be replaced with the default weight value of 1.0. Weight columns will be automatically be named “_weight”. It is expected that a dataset will have at most one weight column.
- Parameters
expr – A TQL expression string that returns a numeric value
name (str) – Optionally provide a name for the column. If not provided a default name will be assigned
- Returns
A Column object to be provided to
select(...)
- Return type
tql.column_utils module
- tql.column_utils.parse_feature(s: str) tuple
Parse a feature string into it’s categorical and numerical components. For example, parse_feature(‘key:1.0’) produces (‘key’, 1.0)
- Parameters
s (str) – A feature string
- Returns
A tuple of (categorical_value, numerical_value)
- Return type
tuple
tql.columnset module
- tql.columnset.constant()
Used in select(…) statements to select the FeatureColumn “1.0” with name “constant”
- Returns
A tuple with the FeatureColumn “1.0” with name “constant”
- Return type
tuple
- tql.columnset.event_metadata() tuple
To be used in select(…) statements for returning timeline.id, id, datetime, and type
- Returns
A tuple of Columns
- Return type
tuple
- tql.columnset.event_time() tuple
To be used in select(…) statements for returning timestamp and duration
- Returns
A tuple of Columns
- Return type
tuple
- tql.columnset.history_value(project_id=None, name='', filter_type='request', event_fields='impression', custom_value='', custom_function='', days=7, offset='0', cumulative=True, recent_k=None, oldest_k=None, aggregation='COUNT', custom_agg='', weight='1', rate=False, return_value='COALESCE(value,0)', bins='[0,1,2,3]', feature_type='', filters={}, type='INCREMENTALITY', verbose=False)
- Aggregate over a history of events, by:
filtering to a specific type and time window of events,
extracting a field value for each event,
summarizing the events’ values using an aggregation function,
(optional) applying daily-rate transformations,
(optional) and/or binning transformations.
- Parameters
name (str) – Name of the feature set. If name=’’ (default), hash the inputs to create a name.
filter_type (str) – Filter
timeline.events
to only events withtype = '{filter_type}'
, e.g., a Python string, e.g.,request
.event_fields (list of str) – Create features that extract
event_type.field
for each field inevent_fields
(e.g.,request.impression
extracted using string"impression"
or list["impression", "timestamp"]
).custom_value (str) – An optional custom inline/lambda expression to apply to extract arbitrary functions of values instead of using
event_type.field
. For example,IF(GET_PROPERTY(x,'request.impression'),5,0)
will return5
for all records with truthy values ofrequest.impression
. Usex
or${@x}
to reference the inline/lambda function variable.custom_function (str) – An optional custom inline/lambda expression to apply to the extracted values from
event_type.field
orcustom_value
. For example,COALESCE(x,0)
will replace allnull
values with zeros. Usex
or${@x}
to reference the inline/lambda function variable.custom_function
may be unnecessary or redundant ifcustom_value
is used.days (list) – A list of at least two histogram knots/bin edges, e.g.,
[0, 1.5, 3, 7]
in units of days. A scalar is also allowed, implying a list with zero prepended:7
means[0,7]
.offset (float/str) – Number of days to shift the edges away from the sample time. Typically small quantities such as 60 seconds =
60/(24*3600)
.cumulative (boolean) – Should the histograms start at
days[0]
(cumulative=True
) ordays[d-1]
(cumulative=False
)?recent_k=None – Input k = 0,1,2,… to truncate the time-filtered event list to just k oldest/recent events within the days time window?
oldest_k=None – Input k = 0,1,2,… to truncate the time-filtered event list to just k oldest/recent events within the days time window?
aggregation (str) – Aggregation function to apply to the event_fields values: COUNT, AVG, SUM, MIN/AG_MIN, MAX/AG_MAX, MODE, IS_NULL, RECENT, OLD, RECENT_#, OLD_#, CUSTOM. To ignore nulls, these additional aggregations can be used: COUNT_NS, AVG_NS, SUM_NS, MIN_NS/AG_MIN_NS, MAX_NS/AG_MAX_NS, MODE_NS.
custom_agg (str) – If
aggregation=='CUSTOM'
, use this custom expression onevents_value
to define the aggregate value to return. For example,AVG(MAP(events_value, (x) -> EXP(x)))
would average the exponential of each event’s value.weight – (Testing) Event-level weight expression. Defaults to
1
.rate (boolean) –
True
transforms the post-aggregation value to a daily rate based on the difference in days bins.return_value (str) – An optional custom final transformation to apply to post-aggregation and post-rate
value
. For example, if you would like to return a log transformation of the value,LOG(value)
. This will be applied prior tobins
(if enabled).bins (str) – Bin the aggregate value using bins, e.g., [0,1,2,3]. If empty, skip binning/return value.
feature_type (str) – Defaults to be empty
''
, then if binning this becomes “CATEGORICAL”; otherwise it defaults to “NUMERICAL”. Can explicitly override defaults by setting this to “NUMERICAL” or “CATEGORICAL”.filters (dict) – Feature filters to apply to all of the features.
verbose (boolean) – Print out each feature’s name as it is created by for-loops of the list-compatible inputs field, agg, and days.
- Returns
List of history features.
- Return type
list
- tql.columnset.kernelize(features, name: Optional[str] = None, treatment_filter_event: Optional[str] = None, event_group_where_event: Optional[str] = None, intended_action: Optional[list] = None, actual_action: Optional[list] = None, treatment_model: Optional[str] = None, kernel_types: Optional[list] = None, opportunity_conf: Optional[zeenk.tql.opportunity_conf.OpportunityConf] = None, kernel_parameters: Optional[str] = None, kernel_distribution: Optional[str] = None, filters: Optional[dict] = None, parse_kv: bool = False)
Takes a list of opportunity-level features based on user/opportunity/treatment/proxy-outcome-level fields and create a transformed feature with
SUM_OPPORTUNITIES()
accumulating the effects across each opportunity. These “kernelized” features are for use in a Causmos timeline-based causal model where each observation record in a dataset is an outcome or a potential outcome (e.g., a moment in time when an outcome could have occurred). For example, we can power dataset generation using a treatment-propensity model to reduce the potential bias if treatment and the outcome are correlated. Apply this function to each list of features to be kernelized. There are several types of transformations described as (KD = ‘treatment’, BKD = ‘baseline’, GKD = ‘ghost’, NKD = ‘nonrandom’).- Parameters
features (list) – List of col() to be kernelized (KD,BKD,GKD,NKD). Examples:
["1.0", "IF(1.0,'L','R')"]
,[{"name":"constant", "expression":"1.0", "type":"NUMERICAL"}]
). Columns can be “CATEGORICAL” or “NUMERICAL”. Each categorical expression creates an expansion of features; each numerical expression multiplies/reweights the opportunity’s contribution to theSUM_OPPORTUNITIES()
sum.name (str) – Base string for the feature sets. Will append type of kernel and prepend “w” to denote non-incrementality features for BKD, GKD, and NKD. If name=’’ (default), just return the list of features with the default prefix:
AKD_...
,wBKD_...
.treatment_filter_event (str) – Column that defines whether a treatment opportunity resulted in treatment. For example, was the advertiser’s impression shown after bidding? In a sample dataset,
request.impression
is the relevant field.event_group_where_event (str) – second where condition within passed EventGroup (currently this is set to be output from filtering like dedupe - but should be made more explicit).
intended_action – List of numerical/categorical expressions defining the intended/optimal decisions, e.g.,
bid_amount
oreligible_treatment_groups
.actual_action – List of numerical/categorical expressions defining the actual action/decision taken, e.g.,
IF(ghost_bid, 0, bid_amount)
orassigned_treatment_group
.treatment_model (str) – String for the deployed/published treatment prediction model (GKD, NKD-only). In practice, this will be the win-rate model based on the leaves.
kernel_types (list) – List of kernels types to include in the feature set. Subset of
['KD','BKD','GKD','NKD']
.opportunity_conf – An OpportunityConf() object that contains the defaults for
opportunity_filter
,kernel_parameters
, andkernel_distribution
.kernel_parameters (str) – Calibration for
kernel.days
in the kernel parameters. Arrays of kernel features are created with suffixes using values such as seconds (s), minutes (m), hours (h), and days (d) in combination with a number, such as'15m,4h,3d'
:'name_feature'
becomes'name_feature-15m'
, etc.kernel_distribution (str) – Positive-support distributions (short-form abbreviations):
exponential
(e
,exp
),uniform
(u
,unif
),triangular
(t
,tri
),halfnormal
(h
,hnorm
), andhalflogistic
(l
,hlog
). Positive- & negative-support distributions are the symmetric analogs of the positive-support distributions:laplace
(a
,lap
),rectangular
(r
,rect
),symmetrictriangular
(s
,stri
),normal
(n
,norm
), andlogistic
(o
,log
). Time-independentconstant
kernel (c
,const
) is also useful for various use cases such as static models in order to accumulate all opportunities and treatments for each outcome.filters (dict) – column filters to apply to all of the features.
parse_kv (boolean) – Try to parse ‘NUMERICAL’ input
features
as key-value pairs ‘k:v’. This has extra overhead, but more complex mixed ‘categorical:numerical’ input features.
- Returns
A list of feature sets for each kernel_type, each containing a list of kernelized features.
- Return type
list
- tql.columnset.random_partition(partition_var: str = 'timeline.id', seed: str = '', shares: list = [0.8, 0.2], names: list = ['train', 'test'])
Use the arguments to create a random partition TQL expression like
col("MD5_PARTITIONS(timeline.id, 'my hashing seed', [0.008, 0.002], ['train','test'])")
. This can be used as a column or with .partition_by().- Parameters
partition_var (str) – Single-line TQL subexpression
seed (str) – String to serve as the seed for the hash partitioning
shares (list) – List of the relative shares of each partition. Relative shares do not need to sum to one
names (list) – List of the names for each partition
- Returns
A metadata column with the random_partition expression for use in .partition_by() or as a column
- Return type
tql.demo_projects module
- tql.demo_projects.build_acheron_1(force: bool = False)
Builds Project 1: Synthetic data from the original Acheron simulator Loads noumena-public dataset for 2000 users, 5 requests per user Code for generating Acheron data is not included in the TQL product. Acheron was a purely statistical data generator,
- tql.demo_projects.build_acheron_2(force: bool = False)
Builds Project 2: Synthetic data from the original Acheron2 simulator Same as old projects 2, 99 Loads noumena-public dataset for 200 users, over 10 days Acheron2 is the original name for Lethe; this dataset is equivalent to Lethe data built using the “reasonably_rich” config. This dataset has a bunch of features but was designed to exploit the configuration options of the simulator and memorialize realistic traffic distributions more than to set up specific incrementality behaviors.
- tql.demo_projects.build_lethe_3(force: bool = False)
Builds Project 3: Synthetic data from the Lethe simulator with treatment-level “ghosting” every 2-hours per user Runs Lethe simulation for the config “demo_set”, a config with the following populations:
A set of very active users who rarely convert
A set of users who convert well but are not at all influenced by ads
A set of users who visit due to ads but do not convert due to them
A set of highly incremental users
The population each user belongs to is captured in the “group” feature, but is influenced by the other demographic features. Additionally, there is an ad stock difference in incrementality based on Ad Size.
The idea is that anything short of an incremental conversion model will make inefficient decisions here.
Time-activity patterns are based on reasonably_rich, which drew them from RTB auction logs. The auction is always-win.
- tql.demo_projects.build_lethe_4(force: bool = False)
Builds Project 4: Synthetic data from the Lethe simulator with user-level randomization Runs Lethe simulation for the config “demo_set”, a config with the following populations:
A set of very active users who rarely convert
A set of users who convert well but are not at all influenced by ads
A set of users who visit due to ads but do not convert due to them
A set of highly incremental users
The population each user belongs to is captured in the “group” feature, but is influenced by the other demographic features. Additionally, there is an ad stock difference in incrementality based on Ad Size.
The idea is that anything short of an incremental conversion model will make inefficient decisions here.
Time-activity patterns are based on reasonably_rich, which drew them from RTB auction logs. The auction is always-win.
- tql.demo_projects.create_acheron2_time_series(force: bool = False)
Builds Acheron 2 TimeSeries - request, conversion, user
- tql.demo_projects.create_acheron_time_series(force: bool = False)
Creates the Acheron TimeSeries - request, conversion, user
- tql.demo_projects.create_lethe_time_series(data_url, force: bool = False)
Builds Lethe TimeSeries - bid, activity, user
- tql.demo_projects.get_project1_files_dir()
- tql.demo_projects.get_project2_files_dir()
- tql.demo_projects.get_project3_files_dir()
- tql.demo_projects.get_project4_files_dir()
- tql.demo_projects.rebuild_demo_projects(force: bool = False)
Rebuilds all demo projects
- Parameters
force (bool) – If the projects already exist, overwrite the data
tql.expression_debugger module
- tql.expression_debugger.debugger(project, expression: str = '', theme: str = 'light')
Create an expression debugger widget - default row limit of 10 is applied automatically. Only works in Jupyter Notebooks or Jupyter Labs
- Parameters
project – The Project ID or name to run the expression against
expression (str) – The initial value of the expression, if any
theme (str) – The editor theme, ‘light’ or ‘dark’
- Returns
A Jupyter Notebook/Lab widget
tql.expression_utils module
- tql.expression_utils.add_lambda(expr, varname='x', project_id=None)
Add leading inline ‘varname.’ to attribute variables in a string expression or list of string expressions.
Note: Expressions like
'${type}'
should return'${@x.type}'
but instead returns${x.type}
. For now, do not reference values using the deprecated delimiter'${type}'
—just use'type'
.
- tql.expression_utils.as_variable(expr, varname) str
Save an expression to a DSL variable, e.g., ‘varname=expression;’
- Parameters
expr (str) – The expression to transform and save to variable ‘varname’
varname (str) – The variable name which the expression will be saved to
- Returns
expression with the final line set equal to varname
- Return type
str
Example:
as_variable("var1=2000; ${@var1}", 'input_name') # Returns "var1=2000; input_name=${@var1};"
- tql.expression_utils.remove_lambda(expr, varname='x') str
Remove leading inline @?varname. from variables in a string expression or list of string expressions. Regex handles many cases outlined in test_remove_lambda().
Note: Expressions like ${x.var} should return itself but returns ${var} when the inline variable collides with one of the input record types.
- tql.expression_utils.snippet(string: str, length: int = 20, postfix: str = '...')
Truncate ‘string’ to ‘length’ and append ‘postfix’.
- tql.expression_utils.spaces(string: str, number: int = 2, pad: str = ' ')
Strip leading and lagging newlines and pad newlines with ‘number’ of ‘pad’ characters.
- tql.expression_utils.to_list(x)
If x is not a list, make it a list
- tql.expression_utils.validate_dynamics(dynamics)
Validates that dynamics is a dict and has valid: scale, shape, epsilon, and filters
- tql.expression_utils.validate_scale(scale)
Validate the dynamic scale format.
- tql.expression_utils.which_distribution(distribution)
Format the distribution name to use the ‘short’ version that SUM_OPPORTUNITIES() accepts.
tql.function_doc module
- class tql.function_doc.FunctionDoc(function_name=None, project_id=None)
Bases:
object
- tql.function_doc.find_function(pattern: str, project_id: Optional[int] = None, descriptions: bool = False) list
Returns DSL function names matching a regexp pattern
- Parameters
pattern (str) – The pattern to search for
project_id (int) – The Project ID
descriptions (bool) – Also search description fields
- Returns
A list of FunctionDoc
- Return type
list
- tql.function_doc.function_usage(function_name: str, project_id: Optional[int] = None)
Shows help on a DSL function (or return an info dictionary)
- Parameters
function_name (str) – The name of the function to get usage
project_id (int) – The Project ID
- Returns
The function documentation
tql.opportunity_conf module
- class tql.opportunity_conf.OpportunityConf(opportunity_filter_expressions: list, kernels: str = '5m,15m,1h,4h,1d,3d,7d', decay_function: str = 'exp', sum_opportunities_epsilon: float = 1e-09)
Bases:
object
An OpportunityConf object is used to configure the dataset extractor for incrementality datasets when building features that use the DSL function SUM_OPPORTUNITIES().
When modeling the relationship between the time series of treatment opportunities and outcomes, assertions about timing and the shapes of those relationships must be made. For example, are opportunities typically followed by an increase in the number (or value) of outcomes due to the causal effect of treatment or merely due to temporal correlations (e.g., “activity bias” or other spurious form of selection bias)? Are opportunities preceded by a “run-up” in outcomes due to outcomes being a requirement for opportunities (e.g., retargeting, upselling, etc.)?
We define opportunities by asserting the time or events that represent an “opportunity to treat.” Then, when building features to model the treatment effects and control for sources of bias, we can accumulate the contributions of each opportunity and treatment across CATEGORICAL and NUMERICAL features in accordance with a hypothesized effect distribution shape. We do not have to know the shape in advance, but we have to establish boundaries on the shape by defining:
What is an opportunity?
What is the hypothesized time range relevant to the effect of the opportunity?
What set of basis shapes or distributions should we use to build up a mixture distribution of the dynamic effects/relationships between opportunities, treatments, and outcomes?
The parameters of this class encode the answers to these questions to facilitate computationally efficient strategies for modeling with SUM_OPPORTUNITIES().
See also: nanml.dsl.usage(‘SUM_OPPORTUNITIES’).
Extraction settings for incrementality datasets. OpportunityConf.SPEC has reasonable defaults which can be overridden:
- Parameters
opportunity_filter_expressions (list) – [] - A list of opportunity filter expressions Filter that defines opportunities to treat a user.
kernels (str) – ex: “5m,15m,1h” - Specify a window for effects stemming from opportunities. In the case of the exponential decay_function, this value controls rate. Examples of valid inputs ‘15m’, ‘2h’, ‘7d’ (15 minutes, 2 hours, 7 days). Multiple kernels are specified as a single string ‘15m, 2h, 7d’. Numbers must be integers or decimals. Letters are case insensitive. See also
PARSE_KERNELS()
function. Each kernel can optionally set a distribution to override the defaultdecay_function
. For example, ‘15m-u’ designates a 15-minute scale parameter on a uniform distribution.decay_function (str) – ex: “exp” - Effects from opportunities may take a particular shape. Positive-support distributions (short-form abbreviations): exponential (e, exp), uniform (u, unif), triangular (t, tri), half-normal (h, hnorm), and half-logistic (l, hlog). Positive- & negative-support distributions are the symmetric analogs of the positive-support distributions: laplace (a, lap), rectangular (r, rect), symmetric triangular (s, stri), normal (n, norm), and logistic (o, log).
sum_opportunities_epsilon (float) – ex: 0.0 - Set a tolerance on the magnitude of the distribution’s effect below which the feature’s value will be rounded to zero to increase sparsity and improve computational performance. Larger values increase computational efficiency at the potential cost of bias due to omitting potential effects due to round-off.
- describe()
Get a description of the object. In Jupyter notebooks, this returns a set of HTML tables. In a regular python interactive shell or script, this will default to the String representation of the OpportunityConf.
- classmethod for_type(event_type: str)
A very common case is to want the treatment filter expression to match a specific type This builds and returns an appropriate OpportunityConf object
- get_decay_function()
Gets the decay function from this OpportunityConf
- get_kernels()
Gets the kernels from this OpportunityConf
- get_opportunity_filter_expressions()
Gets the filters from this OpportunityConf
- get_sum_opportunities_epsilon()
Gets the epsilon from this OpportunityConf
- json() dict
Gets a dictionary representation of this object Used when sending to Icarus (the web server) for evaluation.
- Returns
The current configuration instance as a dict
- Return type
dict
tql.query module
- class tql.query.Query
Bases:
object
The primary object of TQL is to use the Query API to extract machine-learning ready ResultSets from Timeline data. Each row in a ResultSet is an event on a timeline, and each column is a piece of data extracted from the event using the TQL Expression Language.
- abort()
Request this Query be aborted
- static abortById(query_id)
Request Query with the given ID be aborted
- Parameters
query_id – to identify the query to be aborted
- copy()
Returns a copy of this Query object
- dataframe(print_payloads=False, limit=None)
An alias for results().dataframe(). If there are multiple partitions, the rows from all partitions will be concatenated together.
- Returns
The ResultSet
- Return type
- debugger(theme: str = 'light')
Creates an expression debugger widget - default row limit of 10 is applied automatically Only works in Jupyter Notebooks or Jupyter Labs
- Parameters
theme (str) – The editor theme, ‘light’ or ‘dark’
- Returns
A Jupyter Notebook/Lab widget
- describe()
Gets a description of the query object. In Jupyter notebooks, this returns a set of HTML tables. In a regular python interactive shell or script, this will default to the String representation of the query.
- Returns
A description of this Query object
- downsample_by(sample_rate: float = 1, pos_sample_rate: float = 1, pos_sampling_seed: str = 'label_downsample', neg_sample_rate: float = 1, neg_sampling_seed: str = 'label_downsample', key_expression: str = 'CONCAT(id, timeline.id)', salt: str = '', reweight: bool = True, max_records: Optional[int] = None, neg_pos_ratio: Optional[float] = None, interactive: Optional[bool] = None)
Downsample data specified by a query
- Parameters
sample_rate – sample rate for all records
pos_sample_rate – sample rate for positive records
pos_sampling_seed – seed for positive sampling expression
neg_sample_rate – sample rate for negative records
neg_sampling_seed – seed for negative sampling expression
key_expression – key for generating down sampling expression
salt – seed for generating down sampling expression
max_records – an approximate maximum number of records to return, if max_records is set then also need to set interactive flag to get an accurate estimate
neg_pos_ratio – Desired negative-to-positive ratio which typically should be between 1 and 10 to maximize statistical performance of model estimation.
neg_pos_ratio=1
yields balanced samples, whereasneg_pos_ratio=3
has 2x as much data but only 33% lower variance andneg_pos_ratio=10
has 5.5x as much data but only (roughly) 50% lower variance than a balanced sample.interactive – specify if the query is interactive or not, needs to be provided if max_records is set to get an accurate estimate
reweight – indicate if need to adjust positive/negative records weight
- Returns
Query with down sampling where clause
- Return type
- event_var(name: str, expr: str)
Defines a single precomputed event-level variable that can be retrieved with the expression language function
EVENT_VAR('name')
. These variables will only be computed once per event and can be reused across multiple features as a speed optimization.- Parameters
name (str) – An event variable name to use
expr (str) – An expression to be pre-computed
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- event_vars(vars: dict)
Defines a set of precomputed event-level variables that can be retrieved with the expression language function
EVENT_VAR('name')
. These variables will only be computed once per event, and can be reused across multiple features as a speed optimization.- Parameters
vars (dict) – A dictionary of event variables to precompute
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- external_timelines(timelines: list)
Instead of selecting from the pre-built timelines for the project specified in the .from() clause, process the query ONLY against the supplied example timelines. This is useful for unit testing or developing simple examples.
Example timelines:
timelines = [{ 'id': 'timeline_id1', 'events': [{ 'id': 1, 'timestamp': 1587772800000, 'type': 'request', 'request': { 'bid': 2.0, 'impression': 1 } }] }]
- Parameters
timelines (list) – A list of dictionaries representing timeline objects to run the query against
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- format(format)
Specify a format for the output of this query. This operator is only applicable for non-interactive queries where the results will be written to disk.
- Parameters
format – One of parquet, csv, or json
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- from_events(identifier, *types)
Selects from the given timelines, emitting one row per event. Identifier can either be a project id or project name. Optionally a list of event types can be provided. For example, if your timelines are composed of bid, activity, and user events, then .from_events(id, ‘user’) will select only user events from your timelines. This from clause is useful if you wish to build a dataset from your timeline events, and is the most common ‘from’ clause used when constructing machine learning datasets.
- Parameters
identifier – Project ID or name
*types – One or more event types to select from
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- from_timelines(identifier)
Selects from the given timelines, emitting one row per timeline. Identifier can either be a project id or project name. Functionally, this is accomplished by injecting one sampled (fake) event per timeline with timestamp at epoch time 0, filtering out all other events. This from clause is useful to compute a table of summary statistics per timeline, such as “how many click events has each user had in the last 28 days”.
- Parameters
identifier – Project ID or name
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- from_union(sampling: zeenk.tql.sampling.SamplingConf, append: bool = False)
Adds a sampling configuration object in the from_events() clause to this query. Sampling objects add new events to the timeline during execution with generated timestamps. This function can be called multiple times, in which case multiple blocks of generated events will be added to each timeline.
- Parameters
sampling (SamplingConf) – A SamplingConf object to append
append (bool) – Append the where conditions to the query instead of replacing
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- get_columns() list
Returns the Columns from the TQL query’s select() statement
- Returns
A list with a copy of the Column objects currently stored in this Query
- Return type
list
- get_event_vars() dict
Gets the event_vars of this Query if any exist
- Returns
The event_vars of this Query
- Return type
dict
- get_external_timelines() list
Returns the external timelines configured for this query, or None
- Returns
A list of timeline dictionaries, or None
- Return type
list
- get_format() str
Gets the format output (one of parquet, csv, or json) for this Query. This operator is only applicable for non-interactive queries where the results will be written to disk.
- Returns
The format of the current query
- Return type
str
- get_global_vars() dict
Get the global variables from this Query if any exists
- get_id() int
Gets the ID of this Query
- Returns
The ID of this Query
- Return type
int
- get_interactive() bool
Gets whether this Query is interactive
- Returns
Whether this Query is interactive
- Return type
bool
- get_label_column()
Gets the label Column if it exists
- Returns
The first weight Column
- Return type
- get_limit() tuple
Gets the row limit and the timeline limit from this Query
- Returns
The row_limit and the timeline limit
- Return type
tuple
- get_opportunities()
Returns the OpportunityConf object in the current query
- Returns
OpportunityConf object in the current query
- Return type
- get_options() dict
Gets the options for this Query
- Returns
The options for this Query
- Return type
dict
- get_partition_by() str
Gets the partition key expression for this Query
- Returns
The expression that this Query was partitioned by
- Return type
str
- get_project()
Gets the Project with which this Query is associated
- Returns
The Project
- Return type
- get_project_id() int
Gets the Project’s ID with which this Query is associated
- Returns
The Project’s ID
- Return type
int
- get_row_limit() int
Gets the row limit of this Query
- Returns
The row limit
- Return type
int
- get_sampling() list
Returns the SampingConf object(s) in the current query
- Returns
The SampingConf object(s) in the current query
- Return type
list
- get_timeline_limit() int
Gets the timeline limit from this Query
- Returns
The timeline limit
- Return type
int
- get_timeline_sample_rate()
Gets the timeline sample rate
- Returns
The timeline sample rate
- get_timeline_vars() dict
Gets the timeline vars of this Query if any exists
- Returns
The timeline vars
- Return type
dict
- get_vars() dict
Gets a dictionary of global_vars, timeline_vars, and event_vars
- Returns
A dictionary of global_vars, timeline_vars, and event_vars
- Return type
dict
- get_weight_column()
Gets the weight Column if it exists
- Returns
The first weight Column
- Return type
- get_where()
Gets the where filters of this Query
- Returns
The where filters of this Query
- global_var(name: str, expr: str)
Defines a single precomputed global variable that can be retrieved with the expression language function
GLOBAL_VAR('name')
. These variables will only be computed once per query and can be reused across multiple rows and multiple features per row as a speed optimization. The global variable ‘timeline_stats’ is pre-defined for all queries as a dictionary/map object with the following keys: [min_timestamp,max_timestamp,timeline_count, event_count,event_min_timestamp,event_max_timestamp,event_count_min,event_count_max,event_counts_by_type, time_series_min_max_timestamps]- Parameters
name (str) – A variable name to use
expr (str) – An expression to be pre-computed
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- global_vars(vars)
Defines a set of precomputed global variables that can be retrieved with the expression language function
GLOBAL_VAR('name')
. These variables will only be computed once per query, and can be reused across multiple rows and multiple features per row as a speed optimization. The global variable ‘timeline_stats’ is pre-defined for all queries as a dictionary/map object with the following keys: [min_timestamp,max_timestamp,timeline_count, event_count,event_min_timestamp,event_max_timestamp,event_count_min,event_count_max,event_counts_by_type, time_series_min_max_timestamps]- Parameters
vars (dict) – A dict of global variables to precompute
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- json() dict
Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation
- Returns
The current TQL Query as a dict
- limit(rows=None, timelines=None)
Imposes a limit on the number of timelines iterated over, or rows returned. If timelines=<N> is specified, then at most N timelines will be evaluated in the query. If rows=<N> is specified, the at most N rows will be returned in the results. .limit(5) is a typical usage of this operator, and would be the closest analog to the traditional SQL operator. For interactive queries, imposing a timeline limit is not required, as specifying limit(rows=N) is sufficient to shortcut the evaluation. However, for asynchronous queries executed in a distributed environment such as Spark, it is often useful to specify both timeline and row limit, as adding timeline limit could result in less data being read from disk, and hence faster execution time.
- Parameters
rows (int) – The maximum number of rows to return
timelines (int) – The maximum number of timelines to evaluate.
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- classmethod load(query_id)
Loads a Query object from the given ID
- Parameters
query_id (int) – The project ID from which to load the Query
- Returns
The Query from the given ID
- Return type
- opportunities(filters, distribution=None, scale=None, epsilon=None)
Defines events that constitute “opportunity to treat” for this dataset. For example, in the online advertising space, opportunities to treat would constitute all bid request events, i.e. “we had an opportunity to buy an ad on the user.” In a clinical drug trial, opportunity would constitute all persons who apply for a drug trial study. Filters will be evaluated as booleans (see docs on the .where() operator). Optionally a distribution function can also be supplied, which defines the hypothesized shape of the causal effect of the opportunities over time. Short, medium, or long string representations of the distribution functions are accepted:
'c', 'const', 'constant' 'e', 'exp', 'exponential' 'l', 'lap', 'laplace' 'u', 'unif', 'uniform' 'r', 'rect', 'rectangular' 't', 'tri', 'triangular' 's', 'stri', 'symmetrictriangular' 'h', 'hnorm', 'halfnormal' 'n', 'norm', 'normal' 'l', 'hlog', 'halflogistic' 'o', 'log', 'logistic'
Also optionally, a string list of time scales can be applied, defining the points in time at which the distribution should be evaluated at. Any numerical value of time is allowed, at the scale of seconds (s), minutes(m), hours(h), and days(d). For example, this is a valid scale string: ‘.5m,4h,1d,3d’, which would correspond to 30 seconds, 4 hours, 1 day, and 3 days.
Also optionally, an epsilon can be provided, which defines a minimal precision below which a feature should be rounded down to zero, e.g., 1e-8 would lead a feature value of 5e-9 to be returned as zero.
- Parameters
filters – One or more TQL filter expressions
distribution – The decay function shape (default exponential)
scale – The time scale over which to evaluate the causal effect of opportunities (default 5m,1h,4h,1d,3d,7d)
epsilon – The numerical epsilon to use (default 1E-6)
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- options(max_columns=None, global_min_total_count=None, apply_feature_filters=None, apply_charset_filter=None, drop_empty_rows=None, expand_numerical_features=None, drop_numerical_zero_features=None, throw_expression_errors=None, debug_expressions=None, fill_na=None, numerical_feature_precision=None, numerical_feature_epsilon=None, drop_constant_feature_columns=None, fix_column_names=None, allow_invalid_column_expressions=None)
Specify options for this query.
- Parameters
max_columns – the maximum number of columns to return, top N columns is computed from count of non-null values.
global_min_total_count – require that at least this many rows contain a non-null value, or drop the column.
apply_feature_filters – flag on/off applying feature filtering (default true)
apply_charset_filter – flag on/off cleaning the values of numerical and categorical feature columns (default true)
drop_empty_rows – if on, remove rows that have no non-null values. (default false)
expand_numerical_features – expand numerical feature arrays into multiple columns. (default false)
drop_numerical_zero_features – drop numerical feature columns that contain all zeros. (default false)
drop_constant_feature_columns – drop numerical feature columns that are constant (default false)
throw_expression_errors – use “fail fast” behavior with invalid expressions (default false)
debug_expressions – return extended debugging information about TQL expression evaluation with the result set.
fill_na – replace all numerical features with non-numeric values with 0.0.
numerical_feature_precision – how many decimal places to return.
numerical_feature_epsilon – abs(val) < eps will be rounded down to zero.
fix_column_names – specify if backend should rename duplicate column names. (default true)
allow_invalid_column_expressions – return errors about invalid columns exppressions
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- partition_by(expr=None, partition_var: str = 'timeline.id', seed: str = '', shares: list = [0.008, 0.002], names: list = ['train', 'test'])
Specify a partition-key TQL expression for this dataset. Partition-key TQL expressions MUST return a non-null value for every row, and the value should have low cardinality, since a separate block of data (i.e. folder of files for asynchronous queries) will be created for each distinct partition value. One common use of partitioning is to generate reproducible train/test splits of your data for machine learning training. For example, .partition_by(“IF(MD5_MOD(timestamp, 10) > 8, ‘train’, ‘test’)”) creates a reproducible 80/20 split of your data, which can be read separately by your training and testing routines. It is reproducible because the split is computed based on an attribute of the event, such as timestamp. Another common use of partition_by() is to split your dataset into logical groupings such as by day: .partition_by(‘date(timestamp)’) to be read with reporting/BI software.
If the ‘expr’ argument is not set, use other arguments(partition_var, seed, shares, names) to create an expression like
Expression("MD5_PARTITIONS(timeline.id, 'the dead sea', [0.008, 0.002], ['train','test'])")
and set ‘partition_key_expression’ to that expression.- Parameters
expr – the partition key expression to use.
partition_var – Single-line TQL subexpression.
seed – String to serve as the seed for the hash partitioning.
shares – List of the relative shares of each partition.
names – List of the names for each partition.
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- sampling(sampling: zeenk.tql.sampling.SamplingConf, append: bool = False)
Adds a sampling configuration object to this query. Sampling objects add new events to the timeline during execution with generated timestamps. This function can be called multiple times, in which case multiple blocks of generated events will be added to each timeline.
- Parameters
sampling – A SamplingConf object to append
**append – Append the where conditions to the query instead of replacing.
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- select(*cols, append: bool = False)
Defines the columns on this TQL query object. This replaces any pre-existing columns that were set with previous calls to select().
- Parameters
cols – One or more TQL column objects or expressions
append (bool) – Append cols to the query’s columns instead of replacing the query’s columns
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- show()
Shows the ResultSet dataframe as a pretty printed table on stdout.
- state()
Gets the current state of this Query
- submit(interactive=True, wait=True, print_payloads=False, analyze=False, spark=None)
Executes this query in Icarus (via http request) and fetches the results. If the interactive flag is true, this query will be executed immediately and the returned ResultSet will have .rows(). This is suitable for interactive development of TQL expressions, as the results will be immediate. If the interactive flag is false, the query will be executed asynchronously, possibly in a distributed computing environment such as on a Spark cluster. In this case, ResultSet will not have .rows(), but will instead have .data_path() which will be a directory in which the query results will be written to upon successful execution.
- Parameters
interactive (bool) – Whether this query should be executed synchronously or asynchronously
wait (bool) – Wait for the dataset compile to complete and show progress/status. Otherwise just make the webservice call and return.
print_payloads (bool) – Prints the request and response payloads from Icarus
analyze (bool) – Display extended ResultSetMetrics
spark (bool) – Whether to submit the query to a Spark Cluster
- Returns
A ResultSet object
- Return type
- timeline_sample_rate(sample_rate)
Sets the sample rate on the Timelines
- Parameters
sample_rate – A floating point value between (0,1]
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- timeline_var(name: str, expr: str)
Defines a single precomputed timeline-level variable that can be retrieved with the expression language function
TIMELINE_VAR('name')
. These variables will only be computed once per timeline and can be reused across multiple rows and multiple features per row as a speed optimization.- Parameters
name (str) – a variable name to use
expr (str) – an expression to be pre-computed
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- timeline_vars(vars: dict)
Defines a set of precomputed timeline-level variables that can be retrieved with the expression language function timeline_var(‘name’). These variables will only be computed once per timeline, and can be reused across multiple rows and multiple features per row as a speed optimization.
- Parameters
vars (dict) – A dictionary of timeline variables to precompute
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- treatments(filters, scale=None, distribution=None, epsilon=None)
An alias for opportunities(…) to treat.
- udf(*udfs, append: bool = False)
Attaches UDF (user-defined function) to a Query
- Parameters
udfs – One or more UDF, each udf could be either a string or a TQL Column object
append (bool) – Append cols to the query’s columns instead of replacing the query’s columns
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- union(sampling: zeenk.tql.sampling.SamplingConf, append: bool = False)
Adds a sampling configuration object to this query. Sampling objects add new events to the timeline during execution with generated timestamps. This function can be called multiple times, in which case multiple blocks of generated events will be added to each timeline.
- Parameters
sampling (SamplingConf) – A SamplingConf object to append
append (bool) – Append the where conditions to the query instead of replacing.
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- validate()
Validates that the Query is runnable before execution
- Returns
Throws an error if the query is not valid
- Return type
None
- vars(global_vars: Optional[dict] = None, timeline_vars: Optional[dict] = None, event_vars: Optional[dict] = None)
Convenience method for passing a dictionary of all of a query’s precomputed variables. A dictionary with keys. Optional keys: ‘global_vars’, ‘timeline_vars’, and/or ‘event_vars’
Can be used in conjunction with
.get_vars()
:>>> q = select("1").from_events(1).limit(1).vars(global_vars={'foo':'bar'}) >>> q.get_vars() {'global_vars': {'foo': 'bar'}, 'timeline_vars': {}, 'event_vars': {}}
- Parameters
global_vars (dict) – A dictionary of key-values with valid inputs to
.global_vars()
timeline_vars (dict) – A dictionary of key-values with valid inputs to
.timeline_vars()
event_vars (dict) – A dictionary of key-values with valid inputs to
.event_vars()
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- visualize(size='', shape='', opacity='', jitter=0.2, rows=None, timelines=None, down_sampling=None, size_rescale=7, shape_dict=None, opacity_rescale=None, include_style_columns=True, ignore_nulls=True)
Visualize a timeline dataframe from query execution.
- Parameters
rows – the maximum number of rows to return
timelines – the maximum number of timelines to evaluate.
down_sampling – down sampling expression
size (str) – Name of column to use to encode each record’s marker’s size
shape (str) – Name of column to use to encode each record’s marker’s shape
opacity (str) – Name of column to use to encode each record’s marker’s opacity
jitter (float) – Amount of ‘jitter’ (random vertical offset to improve visibility)
size_rescale (float) – Multiplier to adjust the marker size. Defaults to 7
shape_dict (dict) –
Dictionary to transform a column into shapes, e.g.:
{'outcome':'star', 'opportunity':'square', 'treatment':'triangle', 'default':'circle'}
opacity_rescale (float) – Multiplier to adjust the marker opacity. None defaults to rescaling by the column’s min/max to 0 and 1
include_style_columns (bool) – Should the event data display include style columns?
ignore_nulls (bool) – Do not show key-value pairs that have
null
as the value
- Returns
A timeline viewer widget
- where(*filters, append=False)
Add one or more where clauses to the query, filtering the timeline events to only those who satisfy the given conditions. The given TQL expression(s) are evaluated as booleans. Non-boolean return values are evaluated in the following manner:
numerical values greater than 0 evaluates to
true
, otherwisefalse
The strings ‘true’ and ‘false’ (non-case sensitive) are evaluated to their boolean equivalents
Other non-numerical, non-boolean return values are considered true if value is not null
In case of multiple where clauses, only the last one will be honored, i.e. the following constructs are logically equivalent:
.where('cond1').where('cond2') == .where('cond2')
Multiple filters in a single where clause are applied using AND, i.e. the following constructs are logically equivalent:.where('cond1', 'cond2') == where('cond1 AND cond2')
. Use the following TQL syntax to define an OR clause:.where('cond1 OR cond2')
- Parameters
filters – One or more TQL columns or expressions that returns a boolean
append (bool) – Append the where conditions to the query instead of replacing
- Returns
The current TQL Query instance for further chaining/modification
- Return type
- tql.query.load_query(id: int) tql.query.Query
Loads the query from the given ID
- Parameters
id (int) – The ID of the query
- Returns
A new Query instance
- Return type
- tql.query.select(*cols) tql.query.Query
Create a new query object from one or more columns. Columns can be defined as TQL expression strings, or wrapped using one of the provided TQL column functions
label()
,weight()
,tag()
,categorical()
,numerical()
, ormetadata()
. Typicallyselect(...)
will immediately be followed by.from_timelines(...)
or.from_events(...)
during query construction.- Parameters
cols – One or more TQL column objects or expressions
- Returns
A new tql.query.Query instance for further chaining/modification
- Return type
tql.query_templates module
tql.resultset module
- class tql.resultset.Partition(json, columns, data_format, interactive=True)
Bases:
object
A Partition is a subset of a ResultSet. If the ResultSet was partitioned, there will be a list of Partitions in the Resultset. If the ResultSet was not partitioned, there will only be the default Partition in the ResultSet.
- columns() list
Gets the Columns from the Query
- Returns
A list of Columns in the Query
- Return type
list
- data_path() str
Gets the data path of this Partition. This is only applicable for non-interactive Queries.
- Returns
The path to the data
- Return type
str
- dataframe(limit=None)
Gets this Partition’s Dataframe
- Parameters
limit (int) – The maximum amount of rows to return
- Returns
A list of rows from a Spark Dataframes
- Return type
list
- name() str
Gets the name of this Partition
- Returns
The name of this Partition
- Return type
str
- pandas_dataframe()
Loads this Partition’s data into a pandas dataframe and return it
- Returns
A Pandas Dataframe
- positive_row_count() int
Gets the number of positive rows in this Partition
- Returns
The number of positive rows in this Partition
- Return type
int
- row_count() int
Gets the number of rows in the Partition
- Returns
The number of rows in this Partition
- Return type
int
- spark_dataframe(infer: bool = True)
Load this partition data into a spark dataframe and return it
- Parameters
infer (bool) – Use inferSchema with CSV output
- Returns
A Spark Dataframe
- class tql.resultset.ResultSet(json, query=None, execution_time=None, analyze=False, interactive=True)
Bases:
object
A ResultSet is the result of a Query. It holds data as columns and rows.
- column_names() list
Gets the column names from this ResultSet as a list of strings
- Returns
A list of column names as strings
- Return type
list
- columns()
Gets the columns from this ResultSet as a list of Columns
- Returns
A list of Columns
- dataframe(limit=None)
Gets a dataframe from the ResultSet
- Parameters
limit (int) – The upper limit of rows to return
- Returns
A dataframe of rows and columns
- default_partition()
Gets the default partition from this ResultSet
- Returns
The default Partition
- Return type
- get_id()
Gets the ID of this ResultSet
- Returns
The ID of this ResultSet
- Return type
int
- get_query()
Gets the Query that generated this ResultSet
- Returns
The Query that generated this ResultSet
- Return type
- classmethod load(result_set_id, analyze: bool = False)
Loads a new ResultSet from the given ID
- Parameters
result_set_id (int) – The ResultSet ID
analyze (bool) – Whether to return detailed information about the ResultSet
- Returns
The ResultSet with the given ID
- Return type
- metrics()
Gets an HTML representation of metrics about this ResultSet
- Returns
An HTML representation of metrics about this ResultSet
- pandas_dataframe()
Gets this ResultSet as a Pandas Dataframe
- Returns
A Pandas Dataframe
- partition(partition_name='_default')
Gets a specific partition from this Resultset
- Parameters
partition_name (str) – The name of the Partition
- Returns
The partition with the given name
- Return type
- partition_names()
Gets a list of partition names from this ResultSet
- Returns
A list of partition names
- Return type
list
- partitions()
Gets a list of Partitions from this Resultset
- Returns
A list of partitions
- Return type
list
- positive_row_count()
Gets the number of positive rows in this ResultSet
- Returns
The number of positive rows
- Return type
int
- refresh()
Populates the non-interactive ResultSet fields with execution details
- row_count()
Gets the number of rows in the ResultSet
- Returns
The number of rows in the ResultSet
- Return type
int
- spark_dataframe(infer=True)
Gets this ResultSet as a Spark Dataframe
- Parameters
infer (bool) – Whether to use Spark to infer the datatypes
- Returns
A Spark Dataframe
- class tql.resultset.ResultSetMetrics(metrics)
Bases:
object
- column_metrics(pandas=False)
- column_value_metrics(pandas=False)
Getter for column_value_metrics from ResultSetMetrics.ColumnMetrics
- debug(pandas=False)
- dropped_expanded_columns(pandas=False)
- event_runtime(pandas=False)
- expression_compile_errors(pandas=False)
- expression_timing_stats(pandas=False)
- get_timelines_processed()
- json()
- query_summary(pandas=False)
- timeline_runtime(pandas=False)
- tql.resultset.load_resultset(id: int) tql.resultset.ResultSet
Loads the ResultSet with the given ID
- Parameters
id (int) – The ID of the ResultSet to be loaded
- Returns
The ResultSet
- Return type
tql.sampling module
- class tql.sampling.SamplingConf(sample_generator_expression: str, functions: Optional[list] = None, variables: Optional[dict] = None, attribute_expressions: Optional[dict] = None, inherit_attributes: bool = False)
Bases:
object
Allows you to manually override parameters during sample generation. See
generate_events
andgenerate_importance_events
Variables and functions both get registered in the expression context prior to the execution of the sampling expression and event attribute expressions.
- Parameters
sample_generator_expression (str) –
ex.
"[MILLIS(TO_DATETIME('2021-12-01'))]"
. An expression providing a list of values at which to generate samples for each timeline. Typically, these will be a list of timestamps. The following example creates midnight timestamps from 2021-12-01 through 2021-12-10 for the timestamps at which we will sample:days = 10; millis_in_day = 24*3600*1000; min_date = MILLIS(TO_DATETIME('2021-12-01')); iter = RANGE(0, days+1); MAP(iter, (x) -> min_date + millis_in_day * x)
A simple example would be just
"[0]"
(timestamp of 1970-01-01 00:00:00.000) if you just wanted to generate a single record per timeline where dynamics are not relevant for any columns.functions (list) – ex:
["function bar() { 2.0 + 2.0 }"]
- can be used like any other function. User-Defined Functions (UDFs) that get attached to the sample_generator_expressions.variables (dict) – ex:
{"foo" : "3.3"}
- could be referenced in the sampling/event attribute expressions asfoo
. Key-value pairs used to define variables to parameterize advanced forms of sampling, such as Importance Sampling.attribute_expressions (dict) –
ex:
{'timestamp': '${@sample}'}
A dictionary of key-value pairs to replace dot-notation expressions such as ‘id’, ‘timestamp’, and ‘type’. This is flexible enough to accommodate the development of more complex forms of sampling over time, space, and distributions of conversions (e.g., conversion name/type).id: Identifier for each sample, e.g.,
MONOTONICALLY_INCREASING_ID()
.timestamp: An expression to define the timestamp associated with the sample generated. If sample_generator_expressions generates timestamps, this field may be specified as
sample
.type: User provided string to identify the type of sample generated (e.g.,
'my_manual_sample'
).
inherit_attributes (bool) – If the generator expression is a list of events, rather than just timestamps, initialize attribute_expressions with all of the properties of the event via
GET_PROPERTY()
.
- describe()
Get a description of the object. In Jupyter notebooks, this returns a set of HTML tables. In a regular python interactive shell or script, this will default to the String representation of the SamplingConf.
- get_attribute_expressions()
Gets the attributes of this SamplingConf
- get_functions()
Gets the functions of this SamplingConf
- get_inherit_attributes()
Gets if this SamplingConf should inherit attributes
- get_sample_generator_expression()
Gets the generator expression of this SamplingConf
- get_variables()
Gets the variables of this SamplingConf
- json() dict
Returns a dictionary representation of this object Used when sending to Icarus (the web server) for evaluation.
- Returns
The current settings instance as a dict
- Return type
dict
- set_attribute_expressions_type(expr_type)
Sets the type of the attribute expressions
- Parameters
expr_type – The type of the attribute expression
- tql.sampling.generate_events(sample_generator_expression: str, functions: Optional[list] = None, variables: Optional[dict] = None, attribute_expressions: Optional[dict] = None, inherit_attributes: bool = False) tql.sampling.SamplingConf
Manual sampling is similar to importance sampling but allows the user to manually override parameters such as the list of sample events (e.g., timestamps at which sampling occurs) and other properties of each generated sample event. See generate_importance_events() for an advanced use case.
- tql.sampling.generate_importance_events(num_samples: float = - 1.0, min_ts: Optional[str] = None, max_ts: Optional[str] = None, time_shift_factor: float = 0.0, fraction_uniform_time: float = 0.0, sampling_distribution: str = 'exponential', sampling_events_expression: str = 'FILTER(timeline.events, (x) -> x.type=null)', sampling_kernels: str = '5m,15m,1h,4h,1d,3d,7d') tql.sampling.SamplingConf
Importance sampling can be used to retain modeling unbiasedness (avoid introducing selection bias when sampling records) while still increasing the number of records where the modeling is most interesting. For example, when modeling the causal effect of a treatment on an outcome, we would like to ensure that most of our records (whether ‘positive’, an outcome event, or ‘negative’, a non-outcome sampling event) are in the vicinity of a treatment opportunity or an outcome. By so doing, we increase the model’s statistical power at deciphering the relationship between the two. In contrast, if most outcomes happen during only 10% of the sample time period, we most of our observations will be during the “boring” portion of the timeline when no events of interest are occurring.
generate_importance_events helps you configure what time periods are “interesting.” You configure how many records to randomly sample for each timeline, which timestamps or events you want to increase your sampling around, the distribution (shape and scale) around each event from which you would like to randomly sample, and the probability you would like to draw from the background uniform distribution (e.g., a random point in the timeline).
In summary, one way to generate negative (non-outcome) records would be to simply draw uniformly between the start and end of the timeline’s observation window. However, we can improve upon that by instructing the extractor to generate these negatives by a configurable time-importance-weighted sampling methodology around times of timeline events.
tql.timelines module
- class tql.timelines.Project
Bases:
object
A Project is an object that contains: a name, a description, UDFs, and links to TimeSeries. It is a conglomeration of one or more sets of timeseries data and timeline data. Projects under 10 are reserved as demonstration projects.
- static all()
Gets all Projects
- build_timelines(wait=True)
Builds this Project’s timelines
- delete()
Deletes this Project from the database
- description(description)
Sets a description of the Project
- from_timeseries(*ts, append=False)
Created a Project from a TimeSeries
- Returns
A new project based on the TimeSeries
- Return type
- get_annotations()
Gets this Project’s column annotations
- get_description()
Gets a description of the Project
- get_id()
Gets the ID of this Project
- get_metadata()
Gets this Project’s metadata
- get_name()
Gets the name of this Project
- get_status()
Gets this Project’s status:
SUCCESS
,PENDING
,COMPILING
,RUNNING
- get_timelines()
Gets this Project’s Timelines
- get_timeseries()
Gets this Project’s TimeSeries
- get_timeseries_names()
Gets this Project’s TimeSeries names
- json() dict
Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation.
- Returns
The current Project as a dict
- Return type
dict
- save()
Saves the metadata (description, etc.) of this Project to the database
- class tql.timelines.TimelineStats(json)
Bases:
object
Stats about the corresponding timeline
- get_event_count()
- get_event_counts_by_type()
- get_largest_timeline_size()
- get_max_timestamp()
- get_min_timestamp()
- get_smallest_timeline_size()
- get_timeline_count()
- get_timeseries_stats()
- class tql.timelines.Timelines(json, timeseries=None, annotations=None)
Bases:
object
A Timeline is a collection of events identified by a common join key, for example a logged in user_id or web cookie, and sorted by timestamp
- get_annotation(attr)
- get_attributes()
- get_attributes_table()
- get_created()
- get_data_path(for_spark=False)
- get_example()
- get_id()
- get_sample_data_path(for_spark=False)
- get_schema()
- get_state()
- get_statistics()
- get_updated()
- pandas_dataframe()
- spark_dataframe()
- tql.timelines.create_project(name: str, or_update: bool = False) tql.timelines.Project
Creates a new project. This project will require further configuration before timelines can be created. If the project already exists and you wish to update it, use
load_project(name)
, or give theor_update=True
option here.- Parameters
name (str) – The name of the Project
or_update (bool) – Whether to update the Project
- Returns
The Project that has just been created
- Return type
- tql.timelines.drop_project(name_or_id, if_exists: bool = False)
Deletes the project by name or ID
- Parameters
name_or_id – The name (str) or ID (int) of the Project
if_exists (bool) – Only drop the Project if it exists
- tql.timelines.load_project(project_identifier, fail_if_not_found: bool = True) tql.timelines.Project
Loads the specified project by name, or throws TQLAnalysisException if not found
- Parameters
project_identifier – The name (str) or ID (int) of the Project
fail_if_not_found (bool) – Whether to throw an exception if the Project is not found
- Returns
The Project with the specified name or ID
- Return type
- tql.timelines.show_projects()
Shows the available projects
tql.timeseries module
- class tql.timeseries.TimeSeries(name)
Bases:
object
The mapping between your data source and TQL is called a TimeSeries. It is important to note that a TimeSeries object in TQL is not the data itself, but merely a script on how to read the data out of storage into TQL.
- analyze(columns=None, row_limit=- 1, sample_rate=1.0, top_n_limit=5, print_json=False, wait=True)
Analyzes this TimeSeries
- annotate_columns(annotations)
Sets the Column anootation of this TimeSeries
- from_files(data_path, format=None, has_csv_header=False)
Sets the files to be used for this TimeSeries
- Parameters
data_path – The path of data to load
format – What format the data is in
has_csv_header (bool) – Whether the csv data has column headers
- Returns
This TimeSeries for further chaining
- Return type
- from_sql(*sql_stmts)
Sets the SQL statements for this TimeSeries
- Parameters
*sql_stmts – SQl statements
- Returns
This TimeSeries for further chaining
- Return type
- from_url(data_url)
Sets the URL for this TimeSeries
- Parameters
data_url (str) – The URL at which the data is located
- Returns
This TimeSeries for further chaining
- Return type
- get_annotation(col_name: str)
Gets the specific annotation of this TimeSeries by name
- get_annotations()
Gets the column annotations of this TimeSeries
- get_data_path()
Gets the data path of this TimeSeries
- get_duration_col()
Gets the duration column of this TimeSeries
- get_example()
- get_format()
Gets the format of this TimeSeries data
- get_metadata()
Gets the metadata of this TimeSeries
- get_name()
Gets the name of this TimeSeries
- get_sql_statements()
Gets the SQL statements of this TimeSeries
- get_timeline_id_col()
Gets the timeline id column of this TimeSeries
- get_timestamp_col()
Gets the timestamp column of this TimeSeries
- identified_by(timeline_id_col, timestamp_col=None, duration_col=None)
Sets what the TimeSeries is identified by
- Parameters
timeline_id_col – The name of the id column
timestamp_col – The name of the timestamp column
duration_col – The name of the duration column
- Returns
This TimeSeries for further chaining
- Return type
- json() dict
Returns a dict representation of this object. Used when sending to Icarus (the web server) for evaluation.
- Returns
The current TQL TimeSeries as a dict
- metadata(meta)
Sets the metadata for this TimeSeries
- Parameters
meta – The metadata
- Returns
This TimeSeries for further chaining
- Return type
- pandas_dataframe()
Gets a Pandas Dataframe of this TimeSeries
- Returns
A Pandas Dataframe
- read_option(key, value)
Sets the Spark read options for this TimeSeries
- Parameters
key – The key read option
value – The value read option
- Returns
This TimeSeries for further chaining
- Return type
- spark_dataframe()
Gets a Spark Dataframe of this TimeSeries
- Returns
A Spark Dataframe
- validate()
Validates this TimeSeries
- visualize(size='', shape='', opacity='', jitter=0.2, rows=None, timelines=None, down_sampling=None, size_rescale=7, shape_dict=None, opacity_rescale=None, include_style_columns=True, ignore_nulls=True)
Shows a Jupyter Notebook/Labs visualizer widget
- class tql.timeseries.TimeSeriesStats(json)
Bases:
object
Statistics about the corresponding TimeSeries
- static load(id)
- tql.timeseries.create_timeseries(name: str) tql.timeseries.TimeSeries
Creates a TimeSeries with the given name
- Parameters
name (str) – The name of the TimeSeries
- Returns
A new TimeSeries with the given name
- Return type
tql.udf module
- class tql.udf.UDF(udfs)
Bases:
object
A User-Defined Function - written in TQL expression language and associated with a Project
- get_udf_map()
Gets the defined UDFs as a Python dictionary
- tql.udf.delete_udf(project_id: int, function_name: str)
Deletes a UDF for a specific Project by function name
- Parameters
project_id (int) – ID of the Project to delete udf from
function_name (str) – Name of the udf to be deleted
- tql.udf.get_udf(project_id: int, function_name: str) tql.udf.UDF
Retrieve a specific UDF from a Project
- Parameters
project_id (int) – ID of the Project to retrieve UDF on
function_name (str) – Name of the UDF to retrieve
- Returns
A UDF or throws TQLAnalysisNotFound exception
- Return type
- tql.udf.list_udfs(project_id: int) tql.udf.UDF
Retrieves all UDFs defined for a Project
- Parameters
project_id (int) – ID of the Project to retrieve UDFs on
- Returns
All existing UDFs for the Project
- Return type
- tql.udf.update_udf(project_id: int, *function_src)
Updates UDF(s) for the specified Project
- Parameters
project_id (int) – ID of the Project to upload UDFs to
function_src – One or more function’s source string
Throws TQLAnalysisException if function doesn’t exist in the Project yet
- tql.udf.upload_udf(project_id: int, *function_src)
Uploads udf(s) to the specified Project
- Parameters
project_id (int) – ID of the Project to upload UDFs to
function_src – One or more function’s source string
Throws TQLAnalysisException if function already exists in the Project
- tql.udf.validate_udf(project_id: int, *function_src)
Validates UDF(s)
- Parameters
project_id (int) – ID of the Project to validate UDFs on
function_src – One or more function’s source string
Throws TQLAnalysisException and printout the error message and location if there’s any compilation error in udf
tql.validation module
- exception tql.validation.TQLAnalysisException(reason)
Bases:
Exception
General TQL Exception
- get_expr_compile_errors()
Get expression compile errors from the backend
- get_message()
Get the exception’s error message
- get_readable_expr_compile_errors()
Return a pretty expression error message
- exception tql.validation.TQLAnalysisNotFound(msg)
Bases:
Exception
TQL not found error
- get_message()
Get the exception’s error message
- tql.validation.format_expression_error(expression, position, error, name=None)
Pretty print an expression error message
- Parameters
expression – Expression with error
position – Location of error message (line, column) tuple
error – Error message
name – Column name for the expression with error
- Returns
The expression with formatted error
- Return type
str
- tql.validation.hide_trace_back()
This method can be used to suppress traceback only for TQLAnalysisException
- tql.validation.show_trace_back()
This method can be used to show traceback for TQLAnalysisException
- tql.validation.validate_no_exception(lambda_fcn, msg=None)
Run the given lambda and rethrow any Exceptions as a TQLAnalysisException
- tql.validation.validate_tql_iterable_type(thing, type)
Check if the thing is an iterable type
- tql.validation.validate_tql_state(cond, msg: str)
Throw an exception if the truthy cond is false
- tql.validation.validate_tql_type(thing, type, msg: Optional[str] = None)
Make sure that ‘thing’ has ‘expected_type’ or throw the message.
tql.visualizer module
- class tql.visualizer.TimeSeriesVisualizer(query=None, timeseries=None)
Bases:
object
Visualizes a TimeSeries dataframe
- visualize(size='', shape='', opacity='', jitter=0.2, rows=None, timelines=None, down_sampling=None, size_rescale=7, shape_dict=None, opacity_rescale=None, include_style_columns=True, ignore_nulls=True)
Creates an interactive visualizer object from the given input, rendered as a jupyter extension.
- Parameters
rows – the maximum number of rows to return
timelines – the maximum number of timelines to evaluate
down_sampling – down sampling expression
size (str) – Name of column to use to encode each record’s marker’s size
shape (str) – Name of column to use to encode each record’s marker’s shape
opacity (str) – Name of column to use to encode each record’s marker’s opacity
jitter (float) – Amount of ‘jitter’ (random vertical offset to improve visibility)
size_rescale (float) – Multiplier to adjust the marker size. Defaults to 7
shape_dict (dict) – Dictionary to transform a column into shapes, e.g.:
{'outcome' – ‘star’, ‘opportunity’:’square’, ‘treatment’:’triangle’, ‘default’:’circle’}
opacity_rescale (float) – Multiplier to adjust the marker opacity. None defaults to rescaling by the column’s min/max to 0 and 1.
include_style_columns (bool) – Should the event data display include style columns?
ignore_nulls (bool) – Do not show key-value pairs that have
null
as the value
- Returns
Timeline viewer widget