Quick Start 

This tutorial provides a basic overview of TQL. In this guide we will first build Timelines from a set of online advertising auctions and a set of user site activity data, then write a TQL query that produces a dataset we can use to train a simple model that predicts probability of winning a bid based on ad size, and finally deploy that model to make batch and realtime predictions.

Table of Contents 

Contents

Quick Start

In order to follow along with this guide, first install TQL using the instructions provided on the TQL Homepage. Afterwords, we recommend reading the guide to creating timelines, the manual on TQL queries, and on how to use the TQL expression language. Additionally, for useful bakground on TQL’s design and data processing model, refer to the TQL Concepts and Architecture Guide.

Starting the TQL Backend 

After installing TQL, start the backend from the command line using tql start. Learn more about configuring the TQL backend here.

Importing TQL 

Always start your work by importing TQL into your namespace. If you did not previously start the backend from the command line, one will be automatically started upon import.

from zeenk.tql import *

----------------------------------------------------------------
           Version: 20.1.17-SNAPSHOT
 Version Timestamp: 2021-08-31 17:30:26
       Version Age: 3 days, 21 minutes, 5 seconds
   Filesystem Root: /Users/clin/.tql/files
 Working Directory: /Users/clin/.tql
Configuration File: /Users/clin/.tql/tql-conf.yml
       Api Gateway: http://localhost:9000
    Service Status: Icarus: ONLINE, Daedalus: ONLINE
    Service Uptime: 1 days, 6 hours, 6 minutes, 29 seconds
----------------------------------------------------------------

Creating Timelines 

The first step of any TQL project is to map time series data from your data lake into TQL. We will use online advertising log data generated with TQL’s data-simulator package. Every TimeSeries must have a source of data, either from flat files or from an SQL store, as well as a timestamp and id field, passed to identified_by(id_col, ts_col). Here we create two TimeSeries objects, from a source of advertising bid logs, and a source of site activity logs, some of which may have been caused by the ads. We then use the create_project() and from_timeseries() functions to attach them to a project named tql_quickstart, and the build_timelines() function to pull the raw data into timelines.

from zeenk.tql.demo_projects import get_project4_files_dir

# this directory contains the time series data files for demo project 4.
base_dir = get_project4_files_dir()

ad_auctions = create_timeseries('bid')
ad_auctions.from_files(f'{base_dir}/bid*.csv', has_csv_header=True)
ad_auctions.identified_by('user_id', 'event_time')

site_activity = create_timeseries('activity')
site_activity.from_files(f'{base_dir}/activity*.csv', has_csv_header=True)
site_activity.identified_by('user_id', 'event_time')

project = create_project('tql_quickstart', or_update=True)
project.from_timeseries(ad_auctions, site_activity)
project.build_timelines()

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  230/230|██████████|, TOTAL ELAPSED=0:00:23, WAIT=0:00:00, PENDING=0:00:02, COMPILING=0:00:21 

Timeline Statistics
0	1
Total Timelines	993
Smallest Timeline	5 events
Largest Timeline	903 events
Min Timestamp	2021-01-01 00:00:01
Max Timestamp	2021-01-14 23:59:59
Total Events	197393
Activity Events	25416
Bid Events	171977

Attributes
TQL Expression	Description	Type
activity.activity_id	None	string
activity.conversion	None	string
activity.event_time	None	string
activity.user_id	None	string
bid.ad_size	None	string
bid.bid	None	double
bid.event_time	None	string
bid.ghosted	None	boolean
bid.request_id	None	string
bid.user_id	None	string
bid.won	None	boolean
duration	The duration of the event. Typically used with generated events.	long
event	The current event object	event
generated	This event was created by generate_events()	boolean
generated_weight	The weight to apply to the the sample	double
id	The auto-generated event id	long
index	The index of the current event in the timeline	integer
outcome	The outcome event. Only used in conjunction with SUM_OPPORTUNITIES() and CAUSAL_EFFFCT	event
timeline.events	Time-ordered array of events	array
timeline.id	Shared timeline id	string
timestamp	The timestamp of the event	long
type	The event type of this event	string

Introduction to Queries 

Creating and executing queries 

Queries are created by chaining together select(), from(), and various other operators, ending with a call to submit(), which returns a TQL ResultSet. For example:

resultset = select('bid.*').from_events('tql_quickstart').limit(5).submit()
resultset

Query results:

partition "_default"

	bid_ad_size	bid_bid	bid_event_time	bid_ghosted	bid_request_id	bid_user_id	bid_won
0	big	1.0	2021-01-02 00:02:36	false	r13260	u252	true
1	big	1.0	2021-01-02 00:16:41	false	r13376	u252	true
2	small	1.0	2021-01-02 00:30:55	false	r13509	u252	true
3	None	None	None	None	None	None	None
4	small	1.0	2021-01-02 01:04:01	false	r13737	u252	true

query produced 5 rows x 7 columns in 0.28 seconds

Interactive Query Execution in Jupyter 

In a Jupyter notebook environment, if a Query is the last line of code in a cell when it is evaluated, submit() will automatically be called, and the resulting dataset will be displayed as the output. In Jupyter, the above query/resultset code is identical to the following, simpler code:

select('bid.*').from_events('tql_quickstart').limit(5)

Query results:

partition "_default"

	bid_ad_size	bid_bid	bid_event_time	bid_ghosted	bid_request_id	bid_user_id	bid_won
0	big	1.0	2021-01-02 00:02:36	false	r13260	u252	true
1	big	1.0	2021-01-02 00:16:41	false	r13376	u252	true
2	small	1.0	2021-01-02 00:30:55	false	r13509	u252	true
3	None	None	None	None	None	None	None
4	small	1.0	2021-01-02 01:04:01	false	r13737	u252	true

query produced 5 rows x 7 columns in 0.12 seconds

The TQL Data Processing Model 

TQL ResultSets contain exactly one row per event, and exactly one column per TQL expression defined in the select() clause.

In the following query, the expression bid.ad_size will contain null values for some rows, because the timelines contain both bid and activity events. The type of the row can be retrieved with the special event attribute type.

select('type', 'bid.ad_size', 'activity.conversion').from_events('tql_quickstart').limit(5)

Query results:

partition "_default"

	_c0	_c1	_c2
0	bid	big	None
1	bid	big	None
2	bid	small	None
3	activity	None	product_view
4	bid	small	None

query produced 5 rows x 3 columns in 0.09 seconds

The TQL Expression Language 

The select(...) operator takes one or more TQL expression strings, corresponding to columns in the output ResultSet. TQL expressions are a written in a custom DSL (domain-specific language) that describe one-to-one translations from an event object to a feature or column value. The attributes of each event are selectable using timeseries_name + . + column_name, and can be further transformed with TQL expression functions. The TQL Expression Language contains a rich set math, string, and array manipulation functions to perform common “data-wrangling” operations.

For example, here we select the bid ad_size using the attribute bid.ad_size, then further manipulate it into a standardized feature:

select('COALESCE(UPPER(bid.ad_size), "UNKNOWN")').from_events('tql_quickstart').limit(5)

Query results:

partition "_default"

	_c0
0	BIG
1	BIG
2	SMALL
3	UNKNOWN
4	SMALL

query produced 5 rows x 1 columns in 0.08 seconds

Timeline Expressions 

Because of TQL’s unique data processing model, the entire context of the timeline is available to each event at extraction time, using the special attribute timeline. Each timeline has an id, accessible by timeline.id, as well as an array of events, accessible by timeline.events. Here is an example of some timeline expressions:

select(
  #the id of each timeline 
  'timeline.id', 
  #the type of the current event
  'type',
   #the type of the first event
  'timeline.events[0].type',
   #the type of the last event
  'timeline.events[-1].type',
  #the number of bids that preceeded this event 
  'SIZE(FILTER(timeline.events, (x) -> x.type == "bid" and x.timestamp < timestamp))',
   #the total number of purchase events on the timeline
  'SIZE(FILTER(timeline.events, (x) -> x.type == "activity" AND x.activity.conversion == "purchase"))',
)\
.from_events('tql_quickstart')\
.limit(5)

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4	_c5
0	u252	bid	bid	bid	0.0	5.0
1	u252	bid	bid	bid	1.0	5.0
2	u252	bid	bid	bid	2.0	5.0
3	u252	activity	bid	bid	3.0	5.0
4	u252	bid	bid	bid	3.0	5.0

query produced 5 rows x 6 columns in 0.16 seconds

For a comprehensive guide to TQL expressions, please refer to the expression language manual, or the expression language cheat sheat for quick reference.

Building a Machine Learning Dataset 

To build a machine learning dataset, we will use a few more pieces of the query API, including typed column operators label() and categorical(), the where() operator, and the partition_by() operator:

select(
  label('if(bid.won, 1, 0)', 'won_bid'),
  categorical('lower(bid.ad_size)', 'ad_size')
)\
.from_events('tql_quickstart')\
.where('type == "bid"')\
.partition_by('IF(random() < 0.2, "test", "train")')\
.options(expand_numerical_features=True, fill_na=True)\
.limit(10)

Query results:

partition "test"

	won_bid	ad_size
0	1.0	small
1	1.0	big
2	1.0	small

partition "train"

	won_bid	ad_size
0	1.0	big
1	1.0	big
2	1.0	small
3	1.0	big
4	1.0	big
5	1.0	big
6	1.0	big

query produced 10 rows x 2 columns in 0.13 seconds

Now that our query results looks good, we store the output of the query in a ResultSet using submit(). We will also remove the limit(10) to make sure we have enough samples to train a model with:

rs = select(
  label('if(bid.won, 1, 0)', 'won_bid'),
  categorical('lower(bid.ad_size)', 'ad_size')
)\
.from_events('tql_quickstart')\
.where('type == "bid"')\
.partition_by('IF(random() < 0.2, "test", "train")')\
.options(expand_numerical_features=True, fill_na=True)\
.submit()

Training A Machine Learning Model 

We now will pass our resultset directly into TQL’s model trainer. We provide a convienence wrapper for H2O, but drop-in support for other training packages can be made avaiable upon request.

from zeenk.tql.modeling import H2OEstimator

model = H2OEstimator('logistic', rs, 'glm').train()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.

H2O_cluster_uptime:	3 days 22 hours 59 mins
H2O_cluster_timezone:	America/Denver
H2O_data_parsing_timezone:	UTC
H2O_cluster_version:	3.32.1.5
H2O_cluster_version_age:	9 days
H2O_cluster_name:	H2O_from_python_zkozick_t4cxa3
H2O_cluster_total_nodes:	1
H2O_cluster_free_memory:	3.237 Gb
H2O_cluster_total_cores:	8
H2O_cluster_allowed_cores:	8
H2O_cluster_status:	locked, healthy
H2O_connection_url:	http://localhost:54321
H2O_connection_proxy:	{"http": null, "https": null}
H2O_internal_security:	False
H2O_API_Extensions:	Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version:	3.7.3 final

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%

Model Deployment 

In order to use our model to make either batch or online predictions, we first must deploy the model into TQL using .publish(), assigning it a unique model type, which we will use to reference it later in queries:

model.publish('my_first_model')

Making Predictions 

Batch Predictions 

We can now write a query that leverages this model using the TQL function PREDICT('model_type'):

select('PREDICT("my_first_model")').from_events('tql_quickstart').limit(5)

Query results:

partition "_default"

	_c0
0	0.7102528549392564
1	0.7102528549392564
2	0.7101808325686126
3	0.7101808325686126
4	0.7101808325686126

query produced 5 rows x 1 columns in 0.35 seconds

Real-Time Predictions 

Additionally, we can make new predictions against previously unseen events using TQL’s external_timelines() operator. With this operator, the TQL query will be executed only against the provided timeline, instead of against the full set of Project Timelines.

new_bid = {
  'type':'bid', 
  'timestamp': '1628095396',
  'bid': {
    'ad_size': 'BIG',
  }
}

timeline = {
  'id':'',
  'events': [new_bid]
}

select('PREDICT("my_first_model")')\
  .from_events('tql_quickstart')\
  .where('type=="bid"')\
  .external_timelines([timeline])\
  .submit()\
  .dataframe()[0][0]

'0.7102528549392564'

Prediction As A Service 

All TQL operations occur by making requests to a REST API, the gateway to TQL’s Java backend. This means by publishing, your model is already deployed to a real-time prediction service, which can be invoked through Python as shown above, or directly over HTTP if your client is written in another language:

select('PREDICT("my_first_model")')\
  .from_events('tql_quickstart')\
  .where('type=="bid"')\
  .external_timelines([timeline])\
  .json()

{'columns': [{'name': None,
   'expression': 'PREDICT("my_first_model")',
   'type': 'METADATA'}],
 'project_id': 104,
 'where_filters': ['type=="bid"'],
 'external_timelines': [{'id': '',
   'events': [{'type': 'bid',
     'timestamp': '1628095396',
     'bid': {'ad_size': 'BIG'}}]}]}

!curl\
  --header "Content-Type: application/json" \
  --request POST \
  --data '{"columns": [{"name": null, "expression": "PREDICT(\"my_first_model\")", "type": "METADATA"}], "project_id": 4, "where_filters": ["type==\"bid\""], "external_timelines": [{"id": "", "events": [{"type": "bid", "timestamp": "1628095396", "bid": {"ad_size": "tiny"}}]}]}'\
  http://localhost:9000/v1/query/submit 

{"status_code":200,"data":{"query":{"id":-1,"project_id":4,"timeline_id":-1,"query_udfs":[],"timelines":{"id":138,"project_id":4,"project_history_id":27,"compute_stats":true,"is_universal_timelines":true,"statistics":{"timeline_count":1000,"event_count":198393,"event_min_timestamp":"2021-01-01 00:00:00.000","event_max_timestamp":"2021-01-14 23:59:59.000","event_count_min":1,"event_count_max":904,"event_counts_by_type":{"activity":25416,"bid":171977,"user":1000},"time_series_min_max_timestamps":{"activity":{"event_name":"activity","event_min_timestamp":"2021-01-01 00:00:00.000","event_max_timestamp":"2021-01-14 23:58:20.000","column_stats":[],"event_count":0,"timeline_count":0},"bid":{"event_name":"bid","event_min_timestamp":"2021-01-01 00:00:00.000","event_max_timestamp":"2021-01-14 23:59:59.000","column_stats":[],"event_count":0,"timeline_count":0},"user":{"event_name":"user","event_min_timestamp":"2021-01-01 00:00:00.000","event_max_timestamp":"2021-01-01 00:00:00.000","column_stats":[],"event_count":0,"timeline_count":0}}},"dot_notations":["activity.activity_id","activity.conversion","activity.event_time","activity.user_id","bid.ad_size","bid.bid","bid.event_time","bid.ghosted","bid.request_id","bid.user_id","bid.won","duration","event","generated","generated_weight","id","index","outcome","timeline.events","timeline.id","timestamp","type","user.age","user.gender","user.group","user.region","user.timestamp","user.user_id"],"schema":{"fields":[{"name":"id","type":{"type":"string"},"nullable":true},{"name":"events","type":{"element_type":{"fields":[{"name":"timestamp","type":{"type":"long"},"nullable":true},{"name":"id","type":{"type":"long"},"nullable":true},{"name":"type","type":{"type":"string"},"nullable":true},{"name":"activity","type":{"fields":[{"name":"activity_id","type":{"type":"string"},"nullable":true},{"name":"event_time","type":{"type":"string"},"nullable":true},{"name":"user_id","type":{"type":"string"},"nullable":true},{"name":"conversion","type":{"type":"string"},"nullable":true}],"type":"schema"},"nullable":true},{"name":"bid","type":{"fields":[{"name":"request_id","type":{"type":"string"},"nullable":true},{"name":"event_time","type":{"type":"string"},"nullable":true},{"name":"user_id","type":{"type":"string"},"nullable":true},{"name":"ad_size","type":{"type":"string"},"nullable":true},{"name":"bid","type":{"type":"double"},"nullable":true},{"name":"ghosted","type":{"type":"boolean"},"nullable":true},{"name":"won","type":{"type":"boolean"},"nullable":true}],"type":"schema"},"nullable":true},{"name":"user","type":{"fields":[{"name":"user_id","type":{"type":"string"},"nullable":true},{"name":"gender","type":{"type":"string"},"nullable":true},{"name":"region","type":{"type":"string"},"nullable":true},{"name":"age","type":{"type":"string"},"nullable":true},{"name":"group","type":{"type":"string"},"nullable":true},{"name":"timestamp","type":{"type":"string"},"nullable":true}],"type":"schema"},"nullable":true},{"name":"generated_weight","type":{"type":"double"},"nullable":true},{"name":"generated","type":{"type":"boolean"},"nullable":true},{"name":"index","type":{"type":"long"},"nullable":true},{"name":"duration","type":{"type":"long"},"nullable":true}],"type":"schema"},"type":"array"},"nullable":true}],"type":"schema"},"universal_timeline_sample_data_location":"file:/Users/zkozick/.noumena/files/universal_timelines/2021/08/05/project_id=4/id=138/data_sample/","universal_timeline_data_location":"file:/Users/zkozick/.noumena/files/universal_timelines/2021/08/05/project_id=4/id=138/data/","state":"SUCCESS","retry_count":0,"created":"2021-08-04 17:00:14.000","updated":"2021-08-04 10:01:14.000","metadata":{}},"interactive":true,"timeline_sample_rate":1.0,"timeline_limit":-1,"row_limit":-1,"from_filters":[],"where_filters":["type==\"bid\""],"columns":[{"type":"METADATA","name":"_c0","expression":"PREDICT(\"my_first_model\")"}],"global_variables":{},"timeline_variables":{},"event_variables":{},"max_columns":-1,"global_min_total_count":-1,"apply_feature_filters":true,"apply_charset_filter":true,"drop_empty_rows":false,"expand_numerical_features":false,"drop_numerical_zero_features":false,"drop_constant_feature_columns":false,"throw_expression_errors":false,"debug_expressions":false,"allow_invalid_column_expressions":false,"external_timelines":[{"id":"","events":[{"type":"bid","timestamp":"1628095396","bid":{"ad_size":"tiny"}}]}],"output_format":"PARQUET","published_model_context":{"published_models_table":{"328":{"model_id":328,"project_id":4,"model_type":"my_first_model","enabled":true,"columns":[{"type":"LABEL","name":"won_bid","expression":"if(bid.won, 1, 0)"},{"type":"CATEGORICAL","name":"ad_size","expression":"lower(bid.ad_size)"}],"model_estimates":[{"engine":"H2O","artifact_path":"/Users/zkozick/.noumena/files/model_artifacts/artifact_986adb49-cef1-468f-bac5-fe4fd7b11f4d.nml","test_cases":[]}],"constant_enabled":true,"keep_flags":[],"ignore_flags":[],"is_universal_timelines":true,"created":"2021-08-13 05:54:01.000","updated":"2021-08-13 05:54:01.000"},"my_first_model":{"model_id":328,"project_id":4,"model_type":"my_first_model","enabled":true,"columns":[{"type":"LABEL","name":"won_bid","expression":"if(bid.won, 1, 0)"},{"type":"CATEGORICAL","name":"ad_size","expression":"lower(bid.ad_size)"}],"model_estimates":[{"engine":"H2O","artifact_path":"/Users/zkozick/.noumena/files/model_artifacts/artifact_986adb49-cef1-468f-bac5-fe4fd7b11f4d.nml","test_cases":[]}],"constant_enabled":true,"keep_flags":[],"ignore_flags":[],"is_universal_timelines":true,"created":"2021-08-13 05:54:01.000","updated":"2021-08-13 05:54:01.000"}},"published_models":[{"model_id":328,"project_id":4,"model_type":"my_first_model","enabled":true,"columns":[{"type":"LABEL","name":"won_bid","expression":"if(bid.won, 1, 0)"},{"type":"CATEGORICAL","name":"ad_size","expression":"lower(bid.ad_size)"}],"model_estimates":[{"engine":"H2O","artifact_path":"/Users/zkozick/.noumena/files/model_artifacts/artifact_986adb49-cef1-468f-bac5-fe4fd7b11f4d.nml","test_cases":[]}],"constant_enabled":true,"keep_flags":[],"ignore_flags":[],"is_universal_timelines":true,"created":"2021-08-13 05:54:01.000","updated":"2021-08-13 05:54:01.000"}]},"published_models":[],"numerical_feature_precision":16,"numerical_feature_epsilon":1.0E-9,"metadata":{},"fill_na":true},"resultset":{"columns":[{"type":"METADATA","name":"_c0","expression":"PREDICT(\"my_first_model\")"}],"id":-1,"result_set_metrics":{"total_example_count":1,"column_metrics":[{"categorical_cardinality":0,"name":"_c0","expression":"PREDICT(\"my_first_model\")","debug_info":[],"column_value_metrics":{},"expression_timing_stats":{"min":674,"max":674,"mean":674,"total":674,"p50":674.0,"p75":674.0,"p95":674.0,"p99":674.0,"unit":"micros"}}],"dropped_empty_row_count":0,"event_runtime":{"min":681,"max":681,"mean":681,"total":681,"p50":681.0,"p75":681.0,"p95":681.0,"p99":681.0,"unit":"micros"},"timeline_runtime":{"min":6736,"max":6736,"mean":6736,"total":6736,"p50":6736.0,"p75":6736.0,"p95":6736.0,"p99":6736.0,"unit":"micros"},"query_runtime_micros":10055,"outcome_generated":0,"timelines_processed":1,"real_events_processed":1,"event_passed_filters":1,"opportunity_event_count":0,"null_label_evaluation_count":0,"null_weight_evaluation_count":0,"events_processed":1,"column_expanded":false,"dropped_expanded_columns":{},"dropped_column_count":0,"dropped_empty_rows_after_conversion":0,"partition_metrics":{"_default":{"pos_row_count":0,"total_row_count":1}},"warnings":[],"ml_dataset_id":0,"positive_example_count":0,"negative_example_count":1,"empty_example_count":1,"label_sum":0.0,"weighted_label_sum":0.0},"partitions":[{"name":"_default","rows":[["0.7097314931031173"]],"row_count":1,"positive_row_count":0}]},"state":"SUCCESS"},"meta":{"service_name":"icarus-webservice","deployment_level":"standalone","version":"20.1.17-SNAPSHOT"},"data_count":1}

What Next?

Congratulations! You have now built your first TQL application.

Learn more about getting your data into TQL with the Building Timelines Guide.
For an in-depth view of the query API, head over to the Query Manual.
Learn more about writing TQL expressions with the TQL Expression Language Manual.
Understand the background and design of TQL, check out the Core Concepts And Architecture Guide.
Scale TQL to big data applications with the TQL Scaling Guide.