# Writing Queries The primary object of TQL is to use the Query API to extract machine-learning ready [ResultSets](resultsets) from Timeline data. Each row in a ResultSet is an event on a timeline, and each column is a piece of data extracted from the event using the [TQL Expression Language](expression-language). ## Query Syntax Similar to SQL, every query starts with a `select` statement, which contains one or more TQL column expressions, a `from` clause which designates which set of timelines to select from, and optional operators that modify the output, such as `where()` or `limit()`:

Each subsequent call to a Query function modifies the state of the Query itself, and also returns itself for subsequent chaining together of operators, in the [fluent style](https://en.wikipedia.org/wiki/Fluent_interface) of programming. Once the Query has been built up to the desired state, the `submit()` function is called, which returns a TQL ResultSet containing the data or pointer to the data, as well as statistics about the query execution. In the following example, we select the constant `1` from timelines of the demo project `lethe4`: ```python from zeenk.tql import * # select emits a Query and submit emits a ResultSet query = select(1).from_events('lethe4').limit(1) resultset = query.submit() print(type(query)) print(type(resultset)) ``` ## Query Columns Each string passed to `select()` is actually a [TQL expression](expression-language), which describes how to translate timeline events to pieces of derived data. In the following example, we select the string literals "hello" and "world". Since the selected data is constant, this query is going to return the same value for every row: ```python select("'hello'", "'world'")\ .from_events('lethe4')\ .limit(2) # 'hello' and 'world', and runs it against the first 2 events of the project and avoids running against the full ~200,000 rows ```

Query results:

partition "_default"

	_c0	_c1
0	hello	world
1	hello	world

query produced 2 rows x 2 columns in 0.45 seconds

## Expression Types Expressions have several types: integer, long, float, string, null, boolean, array, map ```python select( 123467890, 1.0000001, "'string'", None, True, '[1, 2, 3]', "{'key':'value'}" ).from_events('lethe4').limit(2) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4	_c5	_c6
0	123467890	1.0000001	string	None	true	[1,2,3]	{key=value}
1	123467890	1.0000001	string	None	true	[1,2,3]	{key=value}

query produced 2 rows x 7 columns in 0.53 seconds

## Selecting Event Attributes Expressions are applied to every event. Each row corresponds to an event on the timeline with different types if you have a heterogeneous dataset. `timeline.id` and `type` are universal variables that exist in all projects. There are special variables like `activity.*` that belongs to a specific project. For example, `activity.conversion` and `bid.ad_size` are specific to the timeline data and may not show up in every row. ```python select( 'timeline.id', 'timestamp', 'type', 'activity.conversion', 'bid.ad_size', 'user.gender' ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4	_c5
0	u252	1609459200000	user	None	None	male
1	u252	1609545756000	bid	None	big	None
2	u252	1609546601000	bid	None	big	None

query produced 3 rows x 6 columns in 0.61 seconds

## The Select Wildcard Operator ### The wildcard operator selects all event attributes. ```python select('*').from_events('lethe4').limit(2) ```

Query results:

partition "_default"

	activity_activity_id	activity_conversion	activity_event_time	activity_user_id	bid_ad_size	bid_bid	bid_event_time	bid_ghosted	bid_request_id	bid_user_id	...	outcome	timeline_id	timestamp	type	user_age	user_gender	user_group	user_region	user_timestamp	user_user_id
0	None	None	None	None	None	None	None	None	None	None	...	None	u252	1609459200000	user	55plus	male	influenced	south	2021-01-01	u252
1	None	None	None	None	big	1.0	2021-01-02 00:02:36	false	r13260	u252	...	None	u252	1609545756000	bid	None	None	None	None	None	None

2 rows × 26 columns

query produced 2 rows x 26 columns in 0.84 seconds

### The wildcard operator can also select a subset of event attributes. ```python select('bid.*').from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	bid_ad_size	bid_bid	bid_event_time	bid_ghosted	bid_request_id	bid_user_id	bid_won
0	None	None	None	None	None	None	None
1	big	1.0	2021-01-02 00:02:36	false	r13260	u252	true
2	big	1.0	2021-01-02 00:16:41	false	r13376	u252	true

query produced 3 rows x 7 columns in 0.61 seconds

## The Validation Operator Using `validate()` is a quick way to check if your query is valid. In the expression below, the expression validator is expecting to see a value assigned to `a`. ```python select('a=').from_events('lethe4').validate() ``` Encountered 1 expression compilation error: Line 1:2 no viable alternative at input 'a=' (in column _c0) Expression error: Line 1:2 no viable alternative at input 'a=' (in column _c0) a= --^ --------------------------------------------------------------------------- TQLAnalysisException Traceback (most recent call last) in ----> 1 select('a=').from_events('lethe4').validate() ~/src/noumena/python/zeenk-tql/zeenk/tql/query.py in validate(self) 854 payload = self.json() 855 if payload != self._last_validated_spec: --> 856 icarus.post('query/validate', json_data=payload) 857 self._last_validated_spec = payload 858 ~/src/noumena/python/zeenk-tql/zeenk/tql/icarus.py in post(url, json_data, files_data, data, print_json) 14 print(json.dumps(json_data)) 15 resp = requests.post(url, data=data, json=json_data, files=files_data) ---> 16 return _handle_response(url, resp, print_json=print_json) 17 18 ~/src/noumena/python/zeenk-tql/zeenk/tql/icarus.py in _handle_response(url, resp, print_json) 70 else: 71 print(reason) ---> 72 raise TQLAnalysisException(reason) 73 else: 74 msg = '' TQLAnalysisException: Encountered 1 expression compilation error: Line 1:2 no viable alternative at input 'a=' (in column _c0)Expression error: Line 1:2 no viable alternative at input 'a=' (in column _c0) a= --^ ## The Timeline Limit Operator Here we are limiting the number of timelines that we run against. In the following query, we select a _maximum_ of 2 events from a _maximum_ of 1 timelines. ```python select('timeline.id').from_events('lethe4').limit(rows=2, timelines=1) ```

Query results:

partition "_default"

	_c0
0	u252
1	u252

query produced 2 rows x 1 columns in 0.36 seconds

## The Where Operator The where operator only emits rows where the given expression evaluated to true. It takes a truthy expression. Anything non-zero, non-empty, and non-null equaluate to true. The example below selects events that are of type activity and that activity was a purchase. ```python select( 'type', 'activity.conversion' ).from_events('lethe4')\ .where('type == "activity" and activity.conversion == "purchase"')\ .limit(2) ```

Query results:

partition "_default"

	_c0	_c1
0	activity	purchase
1	activity	purchase

query produced 2 rows x 2 columns in 0.43 seconds

## The From Timelines Operator The `from_timelines()` operator, is roughly equal to a “group by user/timeline_id” operation. The ResultSet of `from_timelines()` has exactly one row per timeline. With the `from_timelines()` operator, only timeline-level attributes can be selected. Operations that do aggregations of the timeline.events are useful, such as `SIZE(timeline.events)` or other more complex transformations that would occur on a on-per-timeline level. More details are provided in the Python API Doc. ```python select( # the timeline id 'timeline.id', # the number of events on the timeline 'SIZE(timeline.events)', # what was the timestamp of the first bid we made for every user? 'events = FILTER(timeline.events, (x) -> x.bid); events[1]', ).from_timelines('lethe4')\ .limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2
0	u252	549.0	{"timestamp":1609546601000,"id":51539616491,"type":"bid","activity":null,"bid":{"bid":1.0,"ghosted":false,"won":true,"request_id":"r13376","ad_size":"big","event_time":"2021-01-02 00:16:41","user_id":"u252"},"user":null,"generated":null,"index":null,"duration":null,"generated_weight":null}
1	u577	802.0	{"timestamp":1609461055000,"id":25769817455,"type":"bid","activity":null,"bid":{"bid":1.0,"ghosted":true,"won":false,"request_id":"r267","ad_size":"big","event_time":"2021-01-01 00:30:55","user_id":"u577"},"user":null,"generated":null,"index":null,"duration":null,"generated_weight":null}
2	u582	461.0	{"timestamp":1609460058000,"id":25769817308,"type":"bid","activity":null,"bid":{"bid":1.0,"ghosted":false,"won":true,"request_id":"r120","ad_size":"small","event_time":"2021-01-01 00:14:18","user_id":"u582"},"user":null,"generated":null,"index":null,"duration":null,"generated_weight":null}

query produced 3 rows x 3 columns in 0.51 seconds

## The Submit Operator TQL Queries submitted to Spark also return ResultSet objects, but the difference is that each partition in the resultset will have a path to the data on local disk or a distributed file system, instead of the data itself. To read the data, see the [load](resultsets.html#load) section of the ResultSet Guide. ```python select( 'type', 'timestamp' ).from_events('lethe4')\ .limit(4)\ .submit(spark=True) # we can submit jobs to a Spark cluster where the data is processed by worker nodes. ``` Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS: 8/8|██████████|, TOTAL ELAPSED=0:00:17, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:16

Query results:

partition "_default"

	key	value
0	name	_default
1	data_path	file:/Users/smaguire/.tql/files/datasets/2021-08-30/37066623867876/data/partition=_default
2	row_count	32
3	columns	[_c0, _c1]
4	positive_row_count	0

query produced 4 rows x 2 columns in 17.92 seconds

## The Format Operator for Spark ResultSets Using `format()` will output CSV, JSON, or Parquet ```python select( 'type', 'timestamp', 'activity.activity_id', 'activity.conversion' ).from_events('lethe4')\ .format('json')\ .limit(5)\ .submit(spark=True) ``` Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS: 8/8|██████████|, TOTAL ELAPSED=0:00:06, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:05

Query results:

partition "_default"

	key	value
0	name	_default
1	data_path	file:/Users/smaguire/.tql/files/datasets/2021-08-30/37093997083474/data/partition=_default
2	row_count	40
3	columns	[_c0, _c1, _c2, _c3]
4	positive_row_count	0

query produced 5 rows x 4 columns in 7.3 seconds

## Column Columns have types. You must provide an expression, such as `RANDOM()`, `2.0`, or `a='5'`. You can optionally provide names or cast data by assigning the column a machine learning type. ```python select( # A labeled column label('RANDOM()', 'my_label'), # A weight column is for a ML resultset. Values must be numeric. weight(2.0, name='my_weight_col'), # A tag column is a unique identifier for the row. Not required but highly encouraged. tag("'tag'", name='my_tag_col'), # A categorical feature categorical('type', name='my_categorical_col'), # A numerical feature. Can be a mixed numerical separated by ':' # ex: foo:1.3 numerical("'foo:1.3'", name='my_numerical_col'), # A metadata column. metadata("'metadata'", name='my_metadata_col'), # Universal casting operation from string, dict, tuple, etc to column. # defaults to metadata if type is not provided col('YEAR(MILLIS(TO_DATETIME("2020-05-19 03:17:02.000")))', name='my_date_col', type='label') )\ .from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	my_label	my_weight_col	my_tag_col	my_categorical_col	my_numerical_col	my_metadata_col	my_date_col
0	2020.0	2.0	tag	user	foo:1.300000000000000	metadata	2020.0
1	2020.0	2.0	tag	bid	foo:1.300000000000000	metadata	2020.0
2	2020.0	2.0	tag	bid	foo:1.300000000000000	metadata	2020.0

query produced 3 rows x 7 columns in 0.37 seconds

## FeatureColumn When a column is casted to a `categorical()` or `numerical()` feature, the following additional behavior is applied: - query option `global_min_total_count` can be applied - query option `apply_charset_filter` can be applied - query option `drop_empty_rows` can be applied - query option `expand_numerical_feature` can be applied - query option `drop_numerical_zero_features` can be applied - column filters can be configured on the column - column value metrics are computed on the column See the [Query Options](options) guide for more information on Query Options. ```python print(type(numerical('1'))) print(type(categorical('type'))) ``` ## Filtering on Columns *exposition about likeness to where()* possibly introduce FeatureColumn above Filtering gives the ability to conditionally null-out a feature. If there is a feature that has a small amount of non-null values you can use a filter like `min_total_count` to set a threshold on the minimum number of non-null rows. The `min_total_count` filter only applies to categorical and numerical features. ```python select( # min_total_count: must have at least non-null values for 3 rows or column will be nulled out for all rows categorical('bid.ad_size').filter(min_total_count=3) )\ .from_events('lethe4')\ .where("type == 'bid'")\ .limit(3) ```

Query results:

partition "_default"

	_c0
0	big
1	big
2	small

query produced 3 rows x 1 columns in 0.4 seconds

## Using event_metadata() and event_time() These functions define sets of default columns. Notice that they provide both a column name and an expression that contains a variable. ```python event_metadata() ``` (Column, Column, Column, Column) ```python event_time() ``` (Column, Column) ```python select(event_metadata(), event_time()) \ .from_events('lethe4') \ .limit(3) ```

Query results:

partition "_default"

	_timeline_id	_event_id	_datetime	_type	_timestamp	_duration
0	u252	252	2021-01-01 00:00:00	user	1609459200000	None
1	u252	51539616375	2021-01-02 00:02:36	bid	1609545756000	None
2	u252	51539616491	2021-01-02 00:16:41	bid	1609546601000	None

query produced 3 rows x 6 columns in 0.4 seconds

## The Partition By Operator The `partition_by()` operator splits the data into _n_ partitions. It takes an expression that emits an enumerated value which assigns the row to the partition. In addition to simple operations like randomly assigning rows to a partition, you can also use an attribute of the event to more intelligently assign data to partitions. In all previous examples, the `_default` partition was selected. We will now use `partition_by` to create named partitions. The following will partition the data such that only 80% makes it into the train partition and 20% makes it into the test partition. ```python select( 'type', 'timestamp' ).from_events('lethe4')\ .partition_by('IF(RANDOM() < 0.8, "train", "test")')\ .limit(5) ```

Query results:

partition "train"

	_c0	_c1
0	user	1609459200000
1	bid	1609545756000
2	bid	1609546601000
3	bid	1609547455000
4	activity	1609549067000

query produced 5 rows x 2 columns in 0.42 seconds

```python # ex: put events of each ad_size into their own partition select( col('type', 'type'), col('bid.ad_size', 'size') ).from_events('lethe4')\ .where("type=='bid'")\ .partition_by('IF(bid.ad_size == "small", "small", "big") ')\ .limit(7) ```

Query results:

partition "small"

	type	size
0	bid	small
1	bid	small

partition "big"

	type	size
0	bid	big
1	bid	big
2	bid	big
3	bid	big
4	bid	big

query produced 7 rows x 2 columns in 0.41 seconds

## The External Timelines Operator The `external_timelines()` operator can be used to run your query against custom timeline data you provide, instead of against the regular project timelines. Each timeline is a dictionary with two keys, id (string) and events (list), where each event is a dictionary with fields that match your Project's schema. For example: ```python timeline = { 'id':'my_custom_timeline_id', 'events': [{ 'bid': { 'ad_size': 'absolutely_enormous' } } ] } select( 'bid.ad_size' )\ .from_events('lethe4')\ .external_timelines([timeline]) # This is an array. Multiple timelines can be provided here. ```

Query results:

partition "_default"

	column_name	result
0	_c0	absolutely_enormous

query produced 1 rows x 1 columns in 0.37 seconds

## The Downsample By Operator The `downsample_by()` operator allows you to conditionally downstample a resultset depending on the parameters. For more information see the [API reference](tql.query.Query.downsample_by) on downsampling. ```python # Downsampling the data such that we have twice as many negative records as positive records select( label("IF(type == 'bid', 0, 1)") ).from_events('lethe4')\ .limit(3)\ .downsample_by(neg_pos_ratio=2, reweight=False) ```

Query results:

partition "_default"

	_label
0	1.0
1	0.0
2	0.0

query produced 3 rows x 1 columns in 0.5 seconds

## The Options Operator More details about the options operator can be found in the [Options Guide](./options.html). ```python select( numerical('SUM_OPPORTUNITIES("c")') ).from_events('lethe4')\ .options( max_columns=2, # normally there are 7 kernels in the result. remove this line to see expand_numerical_features=True ).limit(3) ```

Query results:

partition "_default"

	_c0_EQ_c-3d-e	_c0_EQ_c-7d-e
0	1.0	1.0
1	1.0	1.0
2	1.0	1.0

query produced 3 rows x 2 columns in 0.4 seconds

## The *_var Operators These allow you to define scoped variables that can be referenced 1 or more times in your columns. They can be reused by using `timeline_var`, etc. in TQL expressions. There are 3 var operators: - event_var (once per event) - timeline_var (once per timeline) - global_var (run once per ResultSet) Below, a timeline variable `first_bid` filters the events to type `bid`, therefore the select runs against only events that are of type bid. ```python # compute and memoize this expensive operation (avoid going through all events to extract first bid every time the query is run) select( 'TIMELINE_VAR("first_bid").type', # retrieve precomputed timeline variable 'first_bid' 'TIMELINE_VAR("first_bid").bid.request_id', )\ .from_events("lethe4")\ .timeline_var('first_bid', 'FILTER(timeline.events, (x) -> x.type == "bid")[0]')\ .limit(3) ```

Query results:

partition "_default"

	_c0	_c1
0	bid	r13260
1	bid	r13260
2	bid	r13260

query produced 3 rows x 2 columns in 0.38 seconds

## The Opportunities Operator Compare the results below. The default kernels are 5 minutes, 15 minutes, 1 hour, 4 hours, 1 day, 3 days, and 7 days (1 week). In the first example, `SUM_OPPORTUNITIES` aggregates `bid.ad_size` via a time-difference weighted categorical-grouped sum where the type is `bid`. It calculates the hypothesized value of that bid (opportunity) over the kernel (time). In the second example, we do the same, only, where the type is `activity` and we set a distribution of exponential decay to evaluate the "opportunity to treat" given the size of the ad. ```python # no opportunities() operator select( event_metadata(), 'bid.ad_size', 'SUM_OPPORTUNITIES(bid.ad_size)' ).from_events('lethe4')\ .where('type == "bid"')\ .limit(3) ```

Query results:

partition "_default"

	_timeline_id	_event_id	_datetime	_type	_c4	_c5
0	u252	51539616375	2021-01-02 00:02:36	bid	big	[big-5m-e:1,big-15m-e:1,big-1h-e:1,big-4h-e:1,big-1d-e:1,big-3d-e:1,big-7d-e:1]
1	u252	51539616491	2021-01-02 00:16:41	bid	big	[big-5m-e:1,big-15m-e:1,big-1h-e:1,big-4h-e:1,big-1d-e:1,big-3d-e:1,big-7d-e:1]
2	u252	51539616624	2021-01-02 00:30:55	bid	small	[small-5m-e:1,small-15m-e:1,small-1h-e:1,small-4h-e:1,small-1d-e:1,small-3d-e:1,small-7d-e:1]

query produced 3 rows x 6 columns in 0.41 seconds

```python # with opportunities() operator select( event_metadata(), 'bid.ad_size', 'SUM_OPPORTUNITIES(bid.ad_size)' ).from_events('lethe4')\ .where('type == "activity"')\ .opportunities('type == "bid"', distribution='exp')\ .limit(3) # for the given outcome (activity) there were some number of opportunities that preceded it. # sum_opportunities iterates over each oppr and looks at the delta # ```

Query results:

partition "_default"

	_timeline_id	_event_id	_datetime	_type	_c4	_c5
0	u252	51539608944	2021-01-02 00:57:47	activity	None	[small-1d-e:0.9815155646112473,small-4h-e:5.364563568032985,big-3d-e:0.6592795151474853,small-1h-e:15.33712028705239,small-5m-e:1.335929289646685,big-5m-e:0.08217085284903825,big-1h-e:21.66527965846466,big-7d-e:0.2843530487444983,big-1d-e:1.934265027978346,big-4h-e:9.823198048668358,big-15m-e:8.622850831823627,small-3d-e:0.3312667209981967,small-15m-e:16.01037709354400,small-7d-e:0.1424768864228655]
1	u252	51539609033	2021-01-02 02:15:43	activity	None	[small-1d-e:1.881235048884746,small-4h-e:8.327589681452498,big-3d-e:0.9778850917217536,small-1h-e:11.44939465428303,small-5m-e:0.0001706822397623980,big-5m-e:0.1361897575453083,big-1h-e:18.59069078435085,big-7d-e:0.4244786628919324,big-1d-e:2.806128853481813,big-4h-e:12.21484985754475,big-15m-e:7.526981406797988,small-3d-e:0.6531907947592189,small-15m-e:0.8947315945806531,small-7d-e:0.2832241682457595]
2	u252	51539609120	2021-01-02 03:36:02	activity	None	[small-1d-e:1.779180809669550,small-4h-e:5.959110843185616,big-3d-e:1.950292695357807,small-1h-e:3.002141144325396,big-5m-e:20.72168280223276,big-1h-e:44.28131765794558,big-7d-e:0.8479149361292611,big-1d-e:5.568841224025949,big-4h-e:23.96490852181065,big-15m-e:50.11016424082054,small-3d-e:0.6411589816600961,small-15m-e:0.004229481186244135,small-7d-e:0.2809764266104696]

query produced 3 rows x 6 columns in 0.49 seconds

This feature can be used to write validation test cases for your data, or to build a real time scoring service, if used in conjunction with the `PREDICT()` function. ## The Union or Sampling Operator The union or sampling operator injects new events into each timeline only for the scope of the current query. The generation of these events is described in the `generate_events` function which consists of one or more timestamps and the option of specifying attributes on the generated events. The sampling operator is a powerful construct that can construct panel datasets that would be difficult to create with SQL. An example of a query that is hard in SQL but easy in TQL: take a daily sample, count the number of events on the timeline, and count number of events in the previous day. Below, events are generated at epochs 0ms and 1000ms. The union of the project's events and those generated events is selected. A new type of conversion `simulated_purchase` as been injected into the selection. It shows alongside the timeline's real conversions. Generated events are of type `generated_outcome` by default. ```python select( 'timeline.id', 'type', 'timestamp', 'activity.activity_id', 'activity.conversion' ).from_events('lethe4')\ .union(generate_events('[0, 1000]', attribute_expressions={ 'activity.conversion': '"simulated_purchase"', }))\ .limit(5) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4
0	u252	generated_outcome	0	None	simulated_purchase
1	u252	generated_outcome	1000	None	simulated_purchase
2	u252	user	1609459200000	None	None
3	u252	bid	1609545756000	None	None
4	u252	bid	1609546601000	None	None

query produced 5 rows x 5 columns in 0.41 seconds

```python select( 'timeline.id', 'type', 'timestamp', 'activity.activity_id', 'activity.conversion' ).from_events('lethe4')\ .union(generate_events('[0, 1000]', attribute_expressions={ 'activity.conversion': '"simulated_purchase"', }))\ .limit(5) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4
0	u252	generated_outcome	0	None	simulated_purchase
1	u252	generated_outcome	1000	None	simulated_purchase
2	u252	user	1609459200000	None	None
3	u252	bid	1609545756000	None	None
4	u252	bid	1609546601000	None	None

query produced 5 rows x 5 columns in 0.41 seconds

## User-Defined Functions More complex functions can be built on top of the expression language. User-defined functions are a superset of the expression language. A simple example is provided below. For a more complete guide about User-Defined Functions refer to the [User-Defined Functions Guide](user-defined-functions). ```python div2_func_src = ''' /** * given an integer, long, or floating point number, return the value / 2 * @param x Number * @return Double */ function div2(x) { x / 2 } ''' select( 'div2(4)' ).from_events('lethe4')\ .udf(div2_func_src)\ .limit(1) ```

Query results:

partition "_default"

	column_name	result
0	_c0	2.0

query produced 1 rows x 1 columns in 0.39 seconds