# Writing Queries The primary object of TQL is to use the Query API to extract machine-learning ready [ResultSets](resultsets) from Timeline data. Each row in a ResultSet is an event on a timeline, and each column is a piece of data extracted from the event using the [TQL Expression Language](expression-language). ## Query Syntax Similar to SQL, every query starts with a `select` statement, which contains one or more TQL column expressions, a `from` clause which designates which set of timelines to select from, and optional operators that modify the output, such as `where()` or `limit()`: Each subsequent call to a Query function modifies the state of the Query itself, and also returns itself for subsequent chaining together of operators, in the [fluent style](https://en.wikipedia.org/wiki/Fluent_interface) of programming. Once the Query has been built up to the desired state, the `submit()` function is called, which returns a TQL ResultSet containing the data or pointer to the data, as well as statistics about the query execution. In the following example, we select the constant `1` from timelines of the demo project `lethe4`: ```python from zeenk.tql import * # select emits a Query and submit emits a ResultSet query = select(1).from_events('lethe4').limit(1) resultset = query.submit() print(type(query)) print(type(resultset)) ``` ## Query Columns Each string passed to `select()` is actually a [TQL expression](expression-language), which describes how to translate timeline events to pieces of derived data. In the following example, we select the string literals "hello" and "world". Since the selected data is constant, this query is going to return the same value for every row: ```python select("'hello'", "'world'")\ .from_events('lethe4')\ .limit(2) # 'hello' and 'world', and runs it against the first 2 events of the project and avoids running against the full ~200,000 rows ```

Query results:

partition "_default"
_c0 _c1
0 hello world
1 hello world
query produced 2 rows x 2 columns in 0.45 seconds

## Expression Types Expressions have several types: integer, long, float, string, null, boolean, array, map ```python select( 123467890, 1.0000001, "'string'", None, True, '[1, 2, 3]', "{'key':'value'}" ).from_events('lethe4').limit(2) ```

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4 _c5 _c6
0 123467890 1.0000001 string None true [1,2,3] {key=value}
1 123467890 1.0000001 string None true [1,2,3] {key=value}
query produced 2 rows x 7 columns in 0.53 seconds

## Selecting Event Attributes Expressions are applied to every event. Each row corresponds to an event on the timeline with different types if you have a heterogeneous dataset. `timeline.id` and `type` are universal variables that exist in all projects. There are special variables like `activity.*` that belongs to a specific project. For example, `activity.conversion` and `bid.ad_size` are specific to the timeline data and may not show up in every row. ```python select( 'timeline.id', 'timestamp', 'type', 'activity.conversion', 'bid.ad_size', 'user.gender' ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4 _c5
0 u252 1609459200000 user None None male
1 u252 1609545756000 bid None big None
2 u252 1609546601000 bid None big None
query produced 3 rows x 6 columns in 0.61 seconds

## The Select Wildcard Operator ### The wildcard operator selects all event attributes. ```python select('*').from_events('lethe4').limit(2) ```

Query results:

partition "_default"
activity_activity_id activity_conversion activity_event_time activity_user_id bid_ad_size bid_bid bid_event_time bid_ghosted bid_request_id bid_user_id ... outcome timeline_id timestamp type user_age user_gender user_group user_region user_timestamp user_user_id
0 None None None None None None None None None None ... None u252 1609459200000 user 55plus male influenced south 2021-01-01 u252
1 None None None None big 1.0 2021-01-02 00:02:36 false r13260 u252 ... None u252 1609545756000 bid None None None None None None

2 rows × 26 columns

query produced 2 rows x 26 columns in 0.84 seconds

### The wildcard operator can also select a subset of event attributes. ```python select('bid.*').from_events('lethe4').limit(3) ```

Query results:

partition "_default"
bid_ad_size bid_bid bid_event_time bid_ghosted bid_request_id bid_user_id bid_won
0 None None None None None None None
1 big 1.0 2021-01-02 00:02:36 false r13260 u252 true
2 big 1.0 2021-01-02 00:16:41 false r13376 u252 true
query produced 3 rows x 7 columns in 0.61 seconds

## The Validation Operator Using `validate()` is a quick way to check if your query is valid. In the expression below, the expression validator is expecting to see a value assigned to `a`. ```python select('a=').from_events('lethe4').validate() ``` Encountered 1 expression compilation error: Line 1:2 no viable alternative at input 'a=' (in column _c0) Expression error: Line 1:2 no viable alternative at input 'a=' (in column _c0) a= --^ --------------------------------------------------------------------------- TQLAnalysisException Traceback (most recent call last) in ----> 1 select('a=').from_events('lethe4').validate() ~/src/noumena/python/zeenk-tql/zeenk/tql/query.py in validate(self) 854 payload = self.json() 855 if payload != self._last_validated_spec: --> 856 icarus.post('query/validate', json_data=payload) 857 self._last_validated_spec = payload 858 ~/src/noumena/python/zeenk-tql/zeenk/tql/icarus.py in post(url, json_data, files_data, data, print_json) 14 print(json.dumps(json_data)) 15 resp = requests.post(url, data=data, json=json_data, files=files_data) ---> 16 return _handle_response(url, resp, print_json=print_json) 17 18 ~/src/noumena/python/zeenk-tql/zeenk/tql/icarus.py in _handle_response(url, resp, print_json) 70 else: 71 print(reason) ---> 72 raise TQLAnalysisException(reason) 73 else: 74 msg = '' TQLAnalysisException: Encountered 1 expression compilation error: Line 1:2 no viable alternative at input 'a=' (in column _c0)Expression error: Line 1:2 no viable alternative at input 'a=' (in column _c0) a= --^ ## The Timeline Limit Operator Here we are limiting the number of timelines that we run against. In the following query, we select a _maximum_ of 2 events from a _maximum_ of 1 timelines. ```python select('timeline.id').from_events('lethe4').limit(rows=2, timelines=1) ```

Query results:

partition "_default"
_c0
0 u252
1 u252
query produced 2 rows x 1 columns in 0.36 seconds

## The Where Operator The where operator only emits rows where the given expression evaluated to true. It takes a truthy expression. Anything non-zero, non-empty, and non-null equaluate to true. The example below selects events that are of type activity and that activity was a purchase. ```python select( 'type', 'activity.conversion' ).from_events('lethe4')\ .where('type == "activity" and activity.conversion == "purchase"')\ .limit(2) ```

Query results:

partition "_default"
_c0 _c1
0 activity purchase
1 activity purchase
query produced 2 rows x 2 columns in 0.43 seconds

## The From Timelines Operator The `from_timelines()` operator, is roughly equal to a “group by user/timeline_id” operation. The ResultSet of `from_timelines()` has exactly one row per timeline. With the `from_timelines()` operator, only timeline-level attributes can be selected. Operations that do aggregations of the timeline.events are useful, such as `SIZE(timeline.events)` or other more complex transformations that would occur on a on-per-timeline level. More details are provided in the Python API Doc. ```python select( # the timeline id 'timeline.id', # the number of events on the timeline 'SIZE(timeline.events)', # what was the timestamp of the first bid we made for every user? 'events = FILTER(timeline.events, (x) -> x.bid); events[1]', ).from_timelines('lethe4')\ .limit(3) ```

Query results:

partition "_default"
_c0 _c1 _c2
0 u252 549.0 {"timestamp":1609546601000,"id":51539616491,"type":"bid","activity":null,"bid":{"bid":1.0,"ghosted":false,"won":true,"request_id":"r13376","ad_size":"big","event_time":"2021-01-02 00:16:41","user_id":"u252"},"user":null,"generated":null,"index":null,"duration":null,"generated_weight":null}
1 u577 802.0 {"timestamp":1609461055000,"id":25769817455,"type":"bid","activity":null,"bid":{"bid":1.0,"ghosted":true,"won":false,"request_id":"r267","ad_size":"big","event_time":"2021-01-01 00:30:55","user_id":"u577"},"user":null,"generated":null,"index":null,"duration":null,"generated_weight":null}
2 u582 461.0 {"timestamp":1609460058000,"id":25769817308,"type":"bid","activity":null,"bid":{"bid":1.0,"ghosted":false,"won":true,"request_id":"r120","ad_size":"small","event_time":"2021-01-01 00:14:18","user_id":"u582"},"user":null,"generated":null,"index":null,"duration":null,"generated_weight":null}
query produced 3 rows x 3 columns in 0.51 seconds

## The Submit Operator TQL Queries submitted to Spark also return ResultSet objects, but the difference is that each partition in the resultset will have a path to the data on local disk or a distributed file system, instead of the data itself. To read the data, see the [load](resultsets.html#load) section of the ResultSet Guide. ```python select( 'type', 'timestamp' ).from_events('lethe4')\ .limit(4)\ .submit(spark=True) # we can submit jobs to a Spark cluster where the data is processed by worker nodes. ``` Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS: 8/8|██████████|, TOTAL ELAPSED=0:00:17, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:16

Query results:

partition "_default"
key value
0 name _default
1 data_path file:/Users/smaguire/.tql/files/datasets/2021-08-30/37066623867876/data/partition=_default
2 row_count 32
3 columns [_c0, _c1]
4 positive_row_count 0
query produced 4 rows x 2 columns in 17.92 seconds

## The Format Operator for Spark ResultSets Using `format()` will output CSV, JSON, or Parquet ```python select( 'type', 'timestamp', 'activity.activity_id', 'activity.conversion' ).from_events('lethe4')\ .format('json')\ .limit(5)\ .submit(spark=True) ``` Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS: 8/8|██████████|, TOTAL ELAPSED=0:00:06, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:05

Query results:

partition "_default"
key value
0 name _default
1 data_path file:/Users/smaguire/.tql/files/datasets/2021-08-30/37093997083474/data/partition=_default
2 row_count 40
3 columns [_c0, _c1, _c2, _c3]
4 positive_row_count 0
query produced 5 rows x 4 columns in 7.3 seconds

## Column Columns have types. You must provide an expression, such as `RANDOM()`, `2.0`, or `a='5'`. You can optionally provide names or cast data by assigning the column a machine learning type. ```python select( # A labeled column label('RANDOM()', 'my_label'), # A weight column is for a ML resultset. Values must be numeric. weight(2.0, name='my_weight_col'), # A tag column is a unique identifier for the row. Not required but highly encouraged. tag("'tag'", name='my_tag_col'), # A categorical feature categorical('type', name='my_categorical_col'), # A numerical feature. Can be a mixed numerical separated by ':' # ex: foo:1.3 numerical("'foo:1.3'", name='my_numerical_col'), # A metadata column. metadata("'metadata'", name='my_metadata_col'), # Universal casting operation from string, dict, tuple, etc to column. # defaults to metadata if type is not provided col('YEAR(MILLIS(TO_DATETIME("2020-05-19 03:17:02.000")))', name='my_date_col', type='label') )\ .from_events('lethe4').limit(3) ```

Query results:

partition "_default"
my_label my_weight_col my_tag_col my_categorical_col my_numerical_col my_metadata_col my_date_col
0 2020.0 2.0 tag user foo:1.300000000000000 metadata 2020.0
1 2020.0 2.0 tag bid foo:1.300000000000000 metadata 2020.0
2 2020.0 2.0 tag bid foo:1.300000000000000 metadata 2020.0
query produced 3 rows x 7 columns in 0.37 seconds

## FeatureColumn When a column is casted to a `categorical()` or `numerical()` feature, the following additional behavior is applied: - query option `global_min_total_count` can be applied - query option `apply_charset_filter` can be applied - query option `drop_empty_rows` can be applied - query option `expand_numerical_feature` can be applied - query option `drop_numerical_zero_features` can be applied - column filters can be configured on the column - column value metrics are computed on the column See the [Query Options](options) guide for more information on Query Options. ```python print(type(numerical('1'))) print(type(categorical('type'))) ``` ## Filtering on Columns *exposition about likeness to where()* possibly introduce FeatureColumn above Filtering gives the ability to conditionally null-out a feature. If there is a feature that has a small amount of non-null values you can use a filter like `min_total_count` to set a threshold on the minimum number of non-null rows. The `min_total_count` filter only applies to categorical and numerical features. ```python select( # min_total_count: must have at least non-null values for 3 rows or column will be nulled out for all rows categorical('bid.ad_size').filter(min_total_count=3) )\ .from_events('lethe4')\ .where("type == 'bid'")\ .limit(3) ```

Query results:

partition "_default"
_c0
0 big
1 big
2 small
query produced 3 rows x 1 columns in 0.4 seconds

## Using event_metadata() and event_time() These functions define sets of default columns. Notice that they provide both a column name and an expression that contains a variable. ```python event_metadata() ``` (Column, Column, Column, Column) ```python event_time() ``` (Column, Column) ```python select(event_metadata(), event_time()) \ .from_events('lethe4') \ .limit(3) ```

Query results:

partition "_default"
_timeline_id _event_id _datetime _type _timestamp _duration
0 u252 252 2021-01-01 00:00:00 user 1609459200000 None
1 u252 51539616375 2021-01-02 00:02:36 bid 1609545756000 None
2 u252 51539616491 2021-01-02 00:16:41 bid 1609546601000 None
query produced 3 rows x 6 columns in 0.4 seconds

## The Partition By Operator The `partition_by()` operator splits the data into _n_ partitions. It takes an expression that emits an enumerated value which assigns the row to the partition. In addition to simple operations like randomly assigning rows to a partition, you can also use an attribute of the event to more intelligently assign data to partitions. In all previous examples, the `_default` partition was selected. We will now use `partition_by` to create named partitions. The following will partition the data such that only 80% makes it into the train partition and 20% makes it into the test partition. ```python select( 'type', 'timestamp' ).from_events('lethe4')\ .partition_by('IF(RANDOM() < 0.8, "train", "test")')\ .limit(5) ```

Query results:

partition "train"
_c0 _c1
0 user 1609459200000
1 bid 1609545756000
2 bid 1609546601000
3 bid 1609547455000
4 activity 1609549067000
query produced 5 rows x 2 columns in 0.42 seconds

```python # ex: put events of each ad_size into their own partition select( col('type', 'type'), col('bid.ad_size', 'size') ).from_events('lethe4')\ .where("type=='bid'")\ .partition_by('IF(bid.ad_size == "small", "small", "big") ')\ .limit(7) ```

Query results:

partition "small"
type size
0 bid small
1 bid small
partition "big"
type size
0 bid big
1 bid big
2 bid big
3 bid big
4 bid big
query produced 7 rows x 2 columns in 0.41 seconds

## The External Timelines Operator The `external_timelines()` operator can be used to run your query against custom timeline data you provide, instead of against the regular project timelines. Each timeline is a dictionary with two keys, id (string) and events (list), where each event is a dictionary with fields that match your Project's schema. For example: ```python timeline = { 'id':'my_custom_timeline_id', 'events': [{ 'bid': { 'ad_size': 'absolutely_enormous' } } ] } select( 'bid.ad_size' )\ .from_events('lethe4')\ .external_timelines([timeline]) # This is an array. Multiple timelines can be provided here. ```

Query results:

partition "_default"
column_name result
0 _c0 absolutely_enormous
query produced 1 rows x 1 columns in 0.37 seconds

## The Downsample By Operator The `downsample_by()` operator allows you to conditionally downstample a resultset depending on the parameters. For more information see the [API reference](tql.query.Query.downsample_by) on downsampling. ```python # Downsampling the data such that we have twice as many negative records as positive records select( label("IF(type == 'bid', 0, 1)") ).from_events('lethe4')\ .limit(3)\ .downsample_by(neg_pos_ratio=2, reweight=False) ```

Query results:

partition "_default"
_label
0 1.0
1 0.0
2 0.0
query produced 3 rows x 1 columns in 0.5 seconds

## The Options Operator More details about the options operator can be found in the [Options Guide](./options.html). ```python select( numerical('SUM_OPPORTUNITIES("c")') ).from_events('lethe4')\ .options( max_columns=2, # normally there are 7 kernels in the result. remove this line to see expand_numerical_features=True ).limit(3) ```

Query results:

partition "_default"
_c0_EQ_c-3d-e _c0_EQ_c-7d-e
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
query produced 3 rows x 2 columns in 0.4 seconds

## The *_var Operators These allow you to define scoped variables that can be referenced 1 or more times in your columns. They can be reused by using `timeline_var`, etc. in TQL expressions. There are 3 var operators: - event_var (once per event) - timeline_var (once per timeline) - global_var (run once per ResultSet) Below, a timeline variable `first_bid` filters the events to type `bid`, therefore the select runs against only events that are of type bid. ```python # compute and memoize this expensive operation (avoid going through all events to extract first bid every time the query is run) select( 'TIMELINE_VAR("first_bid").type', # retrieve precomputed timeline variable 'first_bid' 'TIMELINE_VAR("first_bid").bid.request_id', )\ .from_events("lethe4")\ .timeline_var('first_bid', 'FILTER(timeline.events, (x) -> x.type == "bid")[0]')\ .limit(3) ```

Query results:

partition "_default"
_c0 _c1
0 bid r13260
1 bid r13260
2 bid r13260
query produced 3 rows x 2 columns in 0.38 seconds

## The Opportunities Operator Compare the results below. The default kernels are 5 minutes, 15 minutes, 1 hour, 4 hours, 1 day, 3 days, and 7 days (1 week). In the first example, `SUM_OPPORTUNITIES` aggregates `bid.ad_size` via a time-difference weighted categorical-grouped sum where the type is `bid`. It calculates the hypothesized value of that bid (opportunity) over the kernel (time). In the second example, we do the same, only, where the type is `activity` and we set a distribution of exponential decay to evaluate the "opportunity to treat" given the size of the ad. ```python # no opportunities() operator select( event_metadata(), 'bid.ad_size', 'SUM_OPPORTUNITIES(bid.ad_size)' ).from_events('lethe4')\ .where('type == "bid"')\ .limit(3) ```

Query results:

partition "_default"
_timeline_id _event_id _datetime _type _c4 _c5
0 u252 51539616375 2021-01-02 00:02:36 bid big [big-5m-e:1,big-15m-e:1,big-1h-e:1,big-4h-e:1,big-1d-e:1,big-3d-e:1,big-7d-e:1]
1 u252 51539616491 2021-01-02 00:16:41 bid big [big-5m-e:1,big-15m-e:1,big-1h-e:1,big-4h-e:1,big-1d-e:1,big-3d-e:1,big-7d-e:1]
2 u252 51539616624 2021-01-02 00:30:55 bid small [small-5m-e:1,small-15m-e:1,small-1h-e:1,small-4h-e:1,small-1d-e:1,small-3d-e:1,small-7d-e:1]
query produced 3 rows x 6 columns in 0.41 seconds

```python # with opportunities() operator select( event_metadata(), 'bid.ad_size', 'SUM_OPPORTUNITIES(bid.ad_size)' ).from_events('lethe4')\ .where('type == "activity"')\ .opportunities('type == "bid"', distribution='exp')\ .limit(3) # for the given outcome (activity) there were some number of opportunities that preceded it. # sum_opportunities iterates over each oppr and looks at the delta # ```

Query results:

partition "_default"
_timeline_id _event_id _datetime _type _c4 _c5
0 u252 51539608944 2021-01-02 00:57:47 activity None [small-1d-e:0.9815155646112473,small-4h-e:5.364563568032985,big-3d-e:0.6592795151474853,small-1h-e:15.33712028705239,small-5m-e:1.335929289646685,big-5m-e:0.08217085284903825,big-1h-e:21.66527965846466,big-7d-e:0.2843530487444983,big-1d-e:1.934265027978346,big-4h-e:9.823198048668358,big-15m-e:8.622850831823627,small-3d-e:0.3312667209981967,small-15m-e:16.01037709354400,small-7d-e:0.1424768864228655]
1 u252 51539609033 2021-01-02 02:15:43 activity None [small-1d-e:1.881235048884746,small-4h-e:8.327589681452498,big-3d-e:0.9778850917217536,small-1h-e:11.44939465428303,small-5m-e:0.0001706822397623980,big-5m-e:0.1361897575453083,big-1h-e:18.59069078435085,big-7d-e:0.4244786628919324,big-1d-e:2.806128853481813,big-4h-e:12.21484985754475,big-15m-e:7.526981406797988,small-3d-e:0.6531907947592189,small-15m-e:0.8947315945806531,small-7d-e:0.2832241682457595]
2 u252 51539609120 2021-01-02 03:36:02 activity None [small-1d-e:1.779180809669550,small-4h-e:5.959110843185616,big-3d-e:1.950292695357807,small-1h-e:3.002141144325396,big-5m-e:20.72168280223276,big-1h-e:44.28131765794558,big-7d-e:0.8479149361292611,big-1d-e:5.568841224025949,big-4h-e:23.96490852181065,big-15m-e:50.11016424082054,small-3d-e:0.6411589816600961,small-15m-e:0.004229481186244135,small-7d-e:0.2809764266104696]
query produced 3 rows x 6 columns in 0.49 seconds

This feature can be used to write validation test cases for your data, or to build a real time scoring service, if used in conjunction with the `PREDICT()` function. ## The Union or Sampling Operator The union or sampling operator injects new events into each timeline only for the scope of the current query. The generation of these events is described in the `generate_events` function which consists of one or more timestamps and the option of specifying attributes on the generated events. The sampling operator is a powerful construct that can construct panel datasets that would be difficult to create with SQL. An example of a query that is hard in SQL but easy in TQL: take a daily sample, count the number of events on the timeline, and count number of events in the previous day. Below, events are generated at epochs 0ms and 1000ms. The union of the project's events and those generated events is selected. A new type of conversion `simulated_purchase` as been injected into the selection. It shows alongside the timeline's real conversions. Generated events are of type `generated_outcome` by default. ```python select( 'timeline.id', 'type', 'timestamp', 'activity.activity_id', 'activity.conversion' ).from_events('lethe4')\ .union(generate_events('[0, 1000]', attribute_expressions={ 'activity.conversion': '"simulated_purchase"', }))\ .limit(5) ```

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4
0 u252 generated_outcome 0 None simulated_purchase
1 u252 generated_outcome 1000 None simulated_purchase
2 u252 user 1609459200000 None None
3 u252 bid 1609545756000 None None
4 u252 bid 1609546601000 None None
query produced 5 rows x 5 columns in 0.41 seconds

```python select( 'timeline.id', 'type', 'timestamp', 'activity.activity_id', 'activity.conversion' ).from_events('lethe4')\ .union(generate_events('[0, 1000]', attribute_expressions={ 'activity.conversion': '"simulated_purchase"', }))\ .limit(5) ```

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4
0 u252 generated_outcome 0 None simulated_purchase
1 u252 generated_outcome 1000 None simulated_purchase
2 u252 user 1609459200000 None None
3 u252 bid 1609545756000 None None
4 u252 bid 1609546601000 None None
query produced 5 rows x 5 columns in 0.41 seconds

## User-Defined Functions More complex functions can be built on top of the expression language. User-defined functions are a superset of the expression language. A simple example is provided below. For a more complete guide about User-Defined Functions refer to the [User-Defined Functions Guide](user-defined-functions). ```python div2_func_src = ''' /** * given an integer, long, or floating point number, return the value / 2 * @param x Number * @return Double */ function div2(x) { x / 2 } ''' select( 'div2(4)' ).from_events('lethe4')\ .udf(div2_func_src)\ .limit(1) ```

Query results:

partition "_default"
column_name result
0 _c0 2.0
query produced 1 rows x 1 columns in 0.39 seconds