Writing Queries
The primary object of TQL is to use the Query API to extract machine-learning ready ResultSets from Timeline data. Each row in a ResultSet is an event on a timeline, and each column is a piece of data extracted from the event using the TQL Expression Language.
Query Syntax
Similar to SQL, every query starts with a select
statement, which contains one or more TQL column expressions,
a from
clause which designates which set of timelines to select from, and optional operators that modify the
output, such as where()
or limit()
:

Each subsequent call to a Query function modifies the state of the Query itself, and also returns itself for subsequent chaining together of operators, in the fluent style of programming.
Once the Query has been built up to the desired state, the submit()
function is called, which returns a TQL
ResultSet containing the data or pointer to the data, as well as statistics about the query execution.
In the following example, we select the constant 1
from timelines of the demo project lethe4
:
from zeenk.tql import *
# select emits a Query and submit emits a ResultSet
query = select(1).from_events('lethe4').limit(1)
resultset = query.submit()
print(type(query))
print(type(resultset))
<class 'zeenk.tql.query.Query'>
<class 'zeenk.tql.resultset.ResultSet'>
Query Columns
Each string passed to select()
is actually a TQL expression, which describes how to
translate timeline events to pieces of derived data. In the following example, we select the string literals
“hello” and “world”. Since the selected data is constant, this query is going to return the same value for
every row:
select("'hello'", "'world'")\
.from_events('lethe4')\
.limit(2) # 'hello' and 'world', and runs it against the first 2 events of the project and avoids running against the full ~200,000 rows
Query results:
partition "_default"_c0 | _c1 | |
---|---|---|
0 | hello | world |
1 | hello | world |
Expression Types
Expressions have several types: integer, long, float, string, null, boolean, array, map
select(
123467890,
1.0000001,
"'string'",
None,
True,
'[1, 2, 3]',
"{'key':'value'}"
).from_events('lethe4').limit(2)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | _c5 | _c6 | |
---|---|---|---|---|---|---|---|
0 | 123467890 | 1.0000001 | string | None | true | [1,2,3] | {key=value} |
1 | 123467890 | 1.0000001 | string | None | true | [1,2,3] | {key=value} |
Selecting Event Attributes
Expressions are applied to every event. Each row corresponds to an event on the timeline with different types if you have a heterogeneous dataset.
timeline.id
and type
are universal variables that exist in all projects.
There are special variables like activity.*
that belongs to a specific project. For example, activity.conversion
and bid.ad_size
are specific to the timeline data and may not show up in every row.
select(
'timeline.id',
'timestamp',
'type',
'activity.conversion',
'bid.ad_size',
'user.gender'
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | _c5 | |
---|---|---|---|---|---|---|
0 | u252 | 1609459200000 | user | None | None | male |
1 | u252 | 1609545756000 | bid | None | big | None |
2 | u252 | 1609546601000 | bid | None | big | None |
The Select Wildcard Operator
The wildcard operator selects all event attributes.
select('*').from_events('lethe4').limit(2)
Query results:
partition "_default"activity_activity_id | activity_conversion | activity_event_time | activity_user_id | bid_ad_size | bid_bid | bid_event_time | bid_ghosted | bid_request_id | bid_user_id | ... | outcome | timeline_id | timestamp | type | user_age | user_gender | user_group | user_region | user_timestamp | user_user_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | None | None | None | None | None | None | None | None | None | ... | None | u252 | 1609459200000 | user | 55plus | male | influenced | south | 2021-01-01 | u252 |
1 | None | None | None | None | big | 1.0 | 2021-01-02 00:02:36 | false | r13260 | u252 | ... | None | u252 | 1609545756000 | bid | None | None | None | None | None | None |
2 rows × 26 columns
The wildcard operator can also select a subset of event attributes.
select('bid.*').from_events('lethe4').limit(3)
Query results:
partition "_default"bid_ad_size | bid_bid | bid_event_time | bid_ghosted | bid_request_id | bid_user_id | bid_won | |
---|---|---|---|---|---|---|---|
0 | None | None | None | None | None | None | None |
1 | big | 1.0 | 2021-01-02 00:02:36 | false | r13260 | u252 | true |
2 | big | 1.0 | 2021-01-02 00:16:41 | false | r13376 | u252 | true |
The Validation Operator
Using validate()
is a quick way to check if your query is valid. In the expression below, the expression validator is expecting to see a value assigned to a
.
select('a=').from_events('lethe4').validate()
Encountered 1 expression compilation error: Line 1:2 no viable alternative at input 'a=' (in column _c0)
Expression error: Line 1:2 no viable alternative at input 'a=' (in column _c0)
a=
--^
---------------------------------------------------------------------------
TQLAnalysisException Traceback (most recent call last)
<ipython-input-8-c1a43c6aef08> in <module>
----> 1 select('a=').from_events('lethe4').validate()
~/src/noumena/python/zeenk-tql/zeenk/tql/query.py in validate(self)
854 payload = self.json()
855 if payload != self._last_validated_spec:
--> 856 icarus.post('query/validate', json_data=payload)
857 self._last_validated_spec = payload
858
~/src/noumena/python/zeenk-tql/zeenk/tql/icarus.py in post(url, json_data, files_data, data, print_json)
14 print(json.dumps(json_data))
15 resp = requests.post(url, data=data, json=json_data, files=files_data)
---> 16 return _handle_response(url, resp, print_json=print_json)
17
18
~/src/noumena/python/zeenk-tql/zeenk/tql/icarus.py in _handle_response(url, resp, print_json)
70 else:
71 print(reason)
---> 72 raise TQLAnalysisException(reason)
73 else:
74 msg = ''
TQLAnalysisException: Encountered 1 expression compilation error: Line 1:2 no viable alternative at input 'a=' (in column _c0)Expression error: Line 1:2 no viable alternative at input 'a=' (in column _c0)
a=
--^
The Timeline Limit Operator
Here we are limiting the number of timelines that we run against. In the following query, we select a maximum of 2 events from a maximum of 1 timelines.
select('timeline.id').from_events('lethe4').limit(rows=2, timelines=1)
Query results:
partition "_default"_c0 | |
---|---|
0 | u252 |
1 | u252 |
The Where Operator
The where operator only emits rows where the given expression evaluated to true. It takes a truthy expression. Anything non-zero, non-empty, and non-null equaluate to true. The example below selects events that are of type activity and that activity was a purchase.
select(
'type',
'activity.conversion'
).from_events('lethe4')\
.where('type == "activity" and activity.conversion == "purchase"')\
.limit(2)
Query results:
partition "_default"_c0 | _c1 | |
---|---|---|
0 | activity | purchase |
1 | activity | purchase |
The From Timelines Operator
The from_timelines()
operator, is roughly equal to a “group by user/timeline_id” operation. The ResultSet of from_timelines()
has exactly one row per timeline. With the from_timelines()
operator, only timeline-level attributes can be selected. Operations that do aggregations of the timeline.events are useful, such as SIZE(timeline.events)
or other more complex transformations that would occur on a on-per-timeline level. More details are provided in the Python API Doc.
select(
# the timeline id
'timeline.id',
# the number of events on the timeline
'SIZE(timeline.events)',
# what was the timestamp of the first bid we made for every user?
'events = FILTER(timeline.events, (x) -> x.bid); events[1]',
).from_timelines('lethe4')\
.limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | |
---|---|---|---|
0 | u252 | 549.0 | {"timestamp":1609546601000,"id":51539616491,"type":"bid","activity":null,"bid":{"bid":1.0,"ghosted":false,"won":true,"request_id":"r13376","ad_size":"big","event_time":"2021-01-02 00:16:41","user_id":"u252"},"user":null,"generated":null,"index":null,"duration":null,"generated_weight":null} |
1 | u577 | 802.0 | {"timestamp":1609461055000,"id":25769817455,"type":"bid","activity":null,"bid":{"bid":1.0,"ghosted":true,"won":false,"request_id":"r267","ad_size":"big","event_time":"2021-01-01 00:30:55","user_id":"u577"},"user":null,"generated":null,"index":null,"duration":null,"generated_weight":null} |
2 | u582 | 461.0 | {"timestamp":1609460058000,"id":25769817308,"type":"bid","activity":null,"bid":{"bid":1.0,"ghosted":false,"won":true,"request_id":"r120","ad_size":"small","event_time":"2021-01-01 00:14:18","user_id":"u582"},"user":null,"generated":null,"index":null,"duration":null,"generated_weight":null} |
The Submit Operator
TQL Queries submitted to Spark also return ResultSet objects, but the difference is that each partition in the resultset will have a path to the data on local disk or a distributed file system, instead of the data itself. To read the data, see the load section of the ResultSet Guide.
select(
'type',
'timestamp'
).from_events('lethe4')\
.limit(4)\
.submit(spark=True) # we can submit jobs to a Spark cluster where the data is processed by worker nodes.
Click to Abort Job
NOTE: Abort link will open the result of the call in a new tab/window
SUCCESS: 8/8|██████████|, TOTAL ELAPSED=0:00:17, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:16
Query results:
partition "_default"key | value | |
---|---|---|
0 | name | _default |
1 | data_path | file:/Users/smaguire/.tql/files/datasets/2021-08-30/37066623867876/data/partition=_default |
2 | row_count | 32 |
3 | columns | [_c0, _c1] |
4 | positive_row_count | 0 |
The Format Operator for Spark ResultSets
Using format()
will output CSV, JSON, or Parquet
select(
'type',
'timestamp',
'activity.activity_id',
'activity.conversion'
).from_events('lethe4')\
.format('json')\
.limit(5)\
.submit(spark=True)
Click to Abort Job
NOTE: Abort link will open the result of the call in a new tab/window
SUCCESS: 8/8|██████████|, TOTAL ELAPSED=0:00:06, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:05
Query results:
partition "_default"key | value | |
---|---|---|
0 | name | _default |
1 | data_path | file:/Users/smaguire/.tql/files/datasets/2021-08-30/37093997083474/data/partition=_default |
2 | row_count | 40 |
3 | columns | [_c0, _c1, _c2, _c3] |
4 | positive_row_count | 0 |
Column
Columns have types. You must provide an expression, such as RANDOM()
, 2.0
, or a='5'
. You can optionally provide names or cast data by assigning the column a machine learning type.
select(
# A labeled column
label('RANDOM()', 'my_label'),
# A weight column is for a ML resultset. Values must be numeric.
weight(2.0, name='my_weight_col'),
# A tag column is a unique identifier for the row. Not required but highly encouraged.
tag("'tag'", name='my_tag_col'),
# A categorical feature
categorical('type', name='my_categorical_col'),
# A numerical feature. Can be a mixed numerical separated by ':'
# ex: foo:1.3
numerical("'foo:1.3'", name='my_numerical_col'),
# A metadata column.
metadata("'metadata'", name='my_metadata_col'),
# Universal casting operation from string, dict, tuple, etc to column.
# defaults to metadata if type is not provided
col('YEAR(MILLIS(TO_DATETIME("2020-05-19 03:17:02.000")))', name='my_date_col', type='label')
)\
.from_events('lethe4').limit(3)
Query results:
partition "_default"my_label | my_weight_col | my_tag_col | my_categorical_col | my_numerical_col | my_metadata_col | my_date_col | |
---|---|---|---|---|---|---|---|
0 | 2020.0 | 2.0 | tag | user | foo:1.300000000000000 | metadata | 2020.0 |
1 | 2020.0 | 2.0 | tag | bid | foo:1.300000000000000 | metadata | 2020.0 |
2 | 2020.0 | 2.0 | tag | bid | foo:1.300000000000000 | metadata | 2020.0 |
FeatureColumn
When a column is casted to a categorical()
or numerical()
feature, the following additional behavior is applied:
query option
global_min_total_count
can be appliedquery option
apply_charset_filter
can be appliedquery option
drop_empty_rows
can be appliedquery option
expand_numerical_feature
can be appliedquery option
drop_numerical_zero_features
can be appliedcolumn filters can be configured on the column
column value metrics are computed on the column
See the Query Options guide for more information on Query Options.
print(type(numerical('1')))
print(type(categorical('type')))
<class 'zeenk.tql.column.FeatureColumn'>
<class 'zeenk.tql.column.FeatureColumn'>
Filtering on Columns
exposition about likeness to where() possibly introduce FeatureColumn above
Filtering gives the ability to conditionally null-out a feature. If there is a feature that has a small amount of non-null values you can use a filter like min_total_count
to set a threshold on the minimum number of non-null rows. The min_total_count
filter only applies to categorical and numerical features.
select(
# min_total_count: must have at least non-null values for 3 rows or column will be nulled out for all rows
categorical('bid.ad_size').filter(min_total_count=3)
)\
.from_events('lethe4')\
.where("type == 'bid'")\
.limit(3)
Query results:
partition "_default"_c0 | |
---|---|
0 | big |
1 | big |
2 | small |
Using event_metadata() and event_time()
These functions define sets of default columns. Notice that they provide both a column name and an expression that contains a variable.
event_metadata()
(Column<name="_timeline_id", expression="timeline.id">,
Column<name="_event_id", expression="id">,
Column<name="_datetime", expression="datetime(timestamp)">,
Column<name="_type", expression="type">)
event_time()
(Column<name="_timestamp", expression="timestamp">,
Column<name="_duration", expression="duration">)
select(event_metadata(), event_time()) \
.from_events('lethe4') \
.limit(3)
Query results:
partition "_default"_timeline_id | _event_id | _datetime | _type | _timestamp | _duration | |
---|---|---|---|---|---|---|
0 | u252 | 252 | 2021-01-01 00:00:00 | user | 1609459200000 | None |
1 | u252 | 51539616375 | 2021-01-02 00:02:36 | bid | 1609545756000 | None |
2 | u252 | 51539616491 | 2021-01-02 00:16:41 | bid | 1609546601000 | None |
The Partition By Operator
The partition_by()
operator splits the data into n partitions. It takes an expression that emits an enumerated value which assigns the row to the partition.
In addition to simple operations like randomly assigning rows to a partition, you can also use an attribute of the event to more intelligently assign data to partitions.
In all previous examples, the _default
partition was selected. We will now use partition_by
to create named partitions. The following will partition the data such that only 80% makes it into the train partition and 20% makes it into the test partition.
select(
'type',
'timestamp'
).from_events('lethe4')\
.partition_by('IF(RANDOM() < 0.8, "train", "test")')\
.limit(5)
Query results:
partition "train"_c0 | _c1 | |
---|---|---|
0 | user | 1609459200000 |
1 | bid | 1609545756000 |
2 | bid | 1609546601000 |
3 | bid | 1609547455000 |
4 | activity | 1609549067000 |
# ex: put events of each ad_size into their own partition
select(
col('type', 'type'),
col('bid.ad_size', 'size')
).from_events('lethe4')\
.where("type=='bid'")\
.partition_by('IF(bid.ad_size == "small", "small", "big") ')\
.limit(7)
Query results:
partition "small"type | size | |
---|---|---|
0 | bid | small |
1 | bid | small |
type | size | |
---|---|---|
0 | bid | big |
1 | bid | big |
2 | bid | big |
3 | bid | big |
4 | bid | big |
The External Timelines Operator
The external_timelines()
operator can be used to run your query against custom timeline data you provide, instead of against the regular project timelines. Each timeline is a dictionary with two keys, id (string) and events (list), where each event is a dictionary with fields that match your Project’s schema. For example:
timeline = {
'id':'my_custom_timeline_id',
'events': [{
'bid': {
'ad_size': 'absolutely_enormous'
}
}
]
}
select(
'bid.ad_size'
)\
.from_events('lethe4')\
.external_timelines([timeline]) # This is an array. Multiple timelines can be provided here.
Query results:
partition "_default"column_name | result | |
---|---|---|
0 | _c0 | absolutely_enormous |
The Downsample By Operator
The downsample_by()
operator allows you to conditionally downstample a resultset depending on the parameters. For more information see the API reference on downsampling.
# Downsampling the data such that we have twice as many negative records as positive records
select(
label("IF(type == 'bid', 0, 1)")
).from_events('lethe4')\
.limit(3)\
.downsample_by(neg_pos_ratio=2, reweight=False)
Query results:
partition "_default"_label | |
---|---|
0 | 1.0 |
1 | 0.0 |
2 | 0.0 |
The Options Operator
More details about the options operator can be found in the Options Guide.
select(
numerical('SUM_OPPORTUNITIES("c")')
).from_events('lethe4')\
.options(
max_columns=2, # normally there are 7 kernels in the result. remove this line to see
expand_numerical_features=True
).limit(3)
Query results:
partition "_default"_c0_EQ_c-3d-e | _c0_EQ_c-7d-e | |
---|---|---|
0 | 1.0 | 1.0 |
1 | 1.0 | 1.0 |
2 | 1.0 | 1.0 |
The *_var Operators
These allow you to define scoped variables that can be referenced 1 or more times in your columns. They can be reused by using timeline_var
, etc. in TQL expressions.
There are 3 var operators:
event_var (once per event)
timeline_var (once per timeline)
global_var (run once per ResultSet)
Below, a timeline variable first_bid
filters the events to type bid
, therefore the select runs against only events that are of type bid.
# compute and memoize this expensive operation (avoid going through all events to extract first bid every time the query is run)
select(
'TIMELINE_VAR("first_bid").type', # retrieve precomputed timeline variable 'first_bid'
'TIMELINE_VAR("first_bid").bid.request_id',
)\
.from_events("lethe4")\
.timeline_var('first_bid', 'FILTER(timeline.events, (x) -> x.type == "bid")[0]')\
.limit(3)
Query results:
partition "_default"_c0 | _c1 | |
---|---|---|
0 | bid | r13260 |
1 | bid | r13260 |
2 | bid | r13260 |
The Opportunities Operator
Compare the results below. The default kernels are 5 minutes, 15 minutes, 1 hour, 4 hours, 1 day, 3 days, and 7 days (1 week).
In the first example, SUM_OPPORTUNITIES
aggregates bid.ad_size
via a time-difference weighted categorical-grouped sum where the type is bid
. It calculates the hypothesized value of that bid (opportunity) over the kernel (time).
In the second example, we do the same, only, where the type is activity
and we set a distribution of exponential decay to evaluate the “opportunity to treat” given the size of the ad.
# no opportunities() operator
select(
event_metadata(),
'bid.ad_size',
'SUM_OPPORTUNITIES(bid.ad_size)'
).from_events('lethe4')\
.where('type == "bid"')\
.limit(3)
Query results:
partition "_default"_timeline_id | _event_id | _datetime | _type | _c4 | _c5 | |
---|---|---|---|---|---|---|
0 | u252 | 51539616375 | 2021-01-02 00:02:36 | bid | big | [big-5m-e:1,big-15m-e:1,big-1h-e:1,big-4h-e:1,big-1d-e:1,big-3d-e:1,big-7d-e:1] |
1 | u252 | 51539616491 | 2021-01-02 00:16:41 | bid | big | [big-5m-e:1,big-15m-e:1,big-1h-e:1,big-4h-e:1,big-1d-e:1,big-3d-e:1,big-7d-e:1] |
2 | u252 | 51539616624 | 2021-01-02 00:30:55 | bid | small | [small-5m-e:1,small-15m-e:1,small-1h-e:1,small-4h-e:1,small-1d-e:1,small-3d-e:1,small-7d-e:1] |
# with opportunities() operator
select(
event_metadata(),
'bid.ad_size',
'SUM_OPPORTUNITIES(bid.ad_size)'
).from_events('lethe4')\
.where('type == "activity"')\
.opportunities('type == "bid"', distribution='exp')\
.limit(3)
# for the given outcome (activity) there were some number of opportunities that preceded it.
# sum_opportunities iterates over each oppr and looks at the delta
#
Query results:
partition "_default"_timeline_id | _event_id | _datetime | _type | _c4 | _c5 | |
---|---|---|---|---|---|---|
0 | u252 | 51539608944 | 2021-01-02 00:57:47 | activity | None | [small-1d-e:0.9815155646112473,small-4h-e:5.364563568032985,big-3d-e:0.6592795151474853,small-1h-e:15.33712028705239,small-5m-e:1.335929289646685,big-5m-e:0.08217085284903825,big-1h-e:21.66527965846466,big-7d-e:0.2843530487444983,big-1d-e:1.934265027978346,big-4h-e:9.823198048668358,big-15m-e:8.622850831823627,small-3d-e:0.3312667209981967,small-15m-e:16.01037709354400,small-7d-e:0.1424768864228655] |
1 | u252 | 51539609033 | 2021-01-02 02:15:43 | activity | None | [small-1d-e:1.881235048884746,small-4h-e:8.327589681452498,big-3d-e:0.9778850917217536,small-1h-e:11.44939465428303,small-5m-e:0.0001706822397623980,big-5m-e:0.1361897575453083,big-1h-e:18.59069078435085,big-7d-e:0.4244786628919324,big-1d-e:2.806128853481813,big-4h-e:12.21484985754475,big-15m-e:7.526981406797988,small-3d-e:0.6531907947592189,small-15m-e:0.8947315945806531,small-7d-e:0.2832241682457595] |
2 | u252 | 51539609120 | 2021-01-02 03:36:02 | activity | None | [small-1d-e:1.779180809669550,small-4h-e:5.959110843185616,big-3d-e:1.950292695357807,small-1h-e:3.002141144325396,big-5m-e:20.72168280223276,big-1h-e:44.28131765794558,big-7d-e:0.8479149361292611,big-1d-e:5.568841224025949,big-4h-e:23.96490852181065,big-15m-e:50.11016424082054,small-3d-e:0.6411589816600961,small-15m-e:0.004229481186244135,small-7d-e:0.2809764266104696] |
This feature can be used to write validation test cases for your data, or to build a real time scoring service, if used in conjunction with the PREDICT()
function.
The Union or Sampling Operator
The union or sampling operator injects new events into each timeline only for the scope of the current query. The generation of these events is described in the generate_events
function which consists of one or more timestamps and the option of specifying attributes on the generated events.
The sampling operator is a powerful construct that can construct panel datasets that would be difficult to create with SQL. An example of a query that is hard in SQL but easy in TQL: take a daily sample, count the number of events on the timeline, and count number of events in the previous day.
Below, events are generated at epochs 0ms and 1000ms. The union of the project’s events and those generated events is selected. A new type of conversion simulated_purchase
as been injected into the selection. It shows alongside the timeline’s real conversions. Generated events are of type generated_outcome
by default.
select(
'timeline.id',
'type',
'timestamp',
'activity.activity_id',
'activity.conversion'
).from_events('lethe4')\
.union(generate_events('[0, 1000]',
attribute_expressions={
'activity.conversion': '"simulated_purchase"',
}))\
.limit(5)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | |
---|---|---|---|---|---|
0 | u252 | generated_outcome | 0 | None | simulated_purchase |
1 | u252 | generated_outcome | 1000 | None | simulated_purchase |
2 | u252 | user | 1609459200000 | None | None |
3 | u252 | bid | 1609545756000 | None | None |
4 | u252 | bid | 1609546601000 | None | None |
select(
'timeline.id',
'type',
'timestamp',
'activity.activity_id',
'activity.conversion'
).from_events('lethe4')\
.union(generate_events('[0, 1000]',
attribute_expressions={
'activity.conversion': '"simulated_purchase"',
}))\
.limit(5)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | |
---|---|---|---|---|---|
0 | u252 | generated_outcome | 0 | None | simulated_purchase |
1 | u252 | generated_outcome | 1000 | None | simulated_purchase |
2 | u252 | user | 1609459200000 | None | None |
3 | u252 | bid | 1609545756000 | None | None |
4 | u252 | bid | 1609546601000 | None | None |
User-Defined Functions
More complex functions can be built on top of the expression language. User-defined functions are a superset of the expression language.
A simple example is provided below. For a more complete guide about User-Defined Functions refer to the User-Defined Functions Guide.
div2_func_src = '''
/**
* given an integer, long, or floating point number, return the value / 2
* @param x Number
* @return Double
*/
function div2(x) {
x / 2
}
'''
select(
'div2(4)'
).from_events('lethe4')\
.udf(div2_func_src)\
.limit(1)
Query results:
partition "_default"column_name | result | |
---|---|---|
0 | _c0 | 2.0 |