# The TQL Expression Language The expression language is the core of TQL's data processing engine. Expressions are written as Python strings and express a one-to-one transformation between an input timeline event object and a piece of output data. These transformations form the building blocks of all TQL data processing queries. The language has strong syntactic similarities to SQL (basic functions, record-level structure) and Python (variables, list operations/aggregations, lambda/inline functions, user-defined functions). However, the expression language has several advantages: - **Performant**: TQL Expressions are compiled into Java byte code and hence, are very fast. - **Lazily Evaluated**: TQL Expressions are applied directly to raw timelines and events at query time, meaning all datasets are dynamically extracted from Timelines, eliminating the need for hard-to-maintain ETL pipelines. - **Built for ML Model Evaluation**: Trained ML Models deployed using TQL also store all the expressions expressions used to transform raw events into the training dataset, meaning feature extraction at scoring time is already built into the model. TQL Expressions only exist in the context of a project and query. For this guide, we will the demo project **lethe4** to learn the language. Lethe represents a typical online advertising funnel, with a set of online ad auctions `bid`, a set of website activity logs `activity`, and static user demographics `user`. Each Timeline represents the total of all observed events for a particular user. For this guide we will be manipulating data primarily from the `bid` TimeSeries. Further reference: **Expression language cheat sheet**, and the full [**expression function reference**](functions). ## Literals TQL accepts string, boolean, integer, decimal, array, and null literals. For example: ```python from zeenk.tql import * select( '"Hello World!"', 1, 1.3, True, "[1,2,3,4]", None ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4	_c5
0	Hello World!	1	1.3	true	[1,2,3,4]	None
1	Hello World!	1	1.3	true	[1,2,3,4]	None
2	Hello World!	1	1.3	true	[1,2,3,4]	None

query produced 3 rows x 6 columns in 0.69 seconds

## Event Attributes In TQL expressions, attributes of timeline events are addressable as variables. For example, the lethe4 Project is composed of the `bid` TimeSeries, which contains the attributes `ad_size`, `bid`, `event_time`, `ghosted`, `request_id`, `user_id`, and `won`. These attributes are directly selectable using the expression language as variables, for example: ```python select( 'bid.ad_size', 'bid.bid', 'bid.ghosted', 'bid.request_id', 'bid.user_id', 'bid.won' ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4	_c5
0	None	None	None	None	None	None
1	big	1.0	false	r13260	u252	true
2	big	1.0	false	r13376	u252	true

query produced 3 rows x 6 columns in 0.36 seconds

## Booleans and Logical Operators The expression language contains the case-insensitive primitives `true` and `false`. Non-null, non-empty, non-zero values are also considered "truthy" and will be intepreted as true in a boolean context, such as in a where clause, The expression language has the native boolean operators `AND` and `OR` to express compound conditionals. These may be used in isolation, or in conjunction with the `IF()` function to emit conditional values in columns. ```python select( 'true', #primitives of true 'FaLsE', #case insensitive primitives '1 OR "non-empty string!"', # "truthy" values with compound conditionals '0 AND ""', # "falsey" values 'IF(bid.ad_size == "big", "HUGE", "default")' #the IF function ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4
0	true	false	true	false	default
1	true	false	true	false	HUGE
2	true	false	true	false	HUGE

query produced 3 rows x 5 columns in 0.63 seconds

## TQL Functions TQL provides a rich set of functions for transforming event attributes into pieces of derived data, which will be covered in the following sections. ## Function Reference The expression function reference can be found [here](functions). Additionally, you can look up expression function usage directly in your Python code using the `helper` object. `helper.find_function('search_term')` lists functions matching the given search term: ```python helper.find_function('LOG') ``` ['LOG', 'LOG10', 'LOG1P', 'LOG2', 'NATLOG'] `heper.function_usage('function_name')` displays the usage of any function: ```python helper.function_usage('LOG') ```

Usage:	LOG(number, base)
description	Return the logarithm of a {number} for a given {base}. Any null, a base of 0, or a base of 1 will return null.
parameters	number: Double base: Double
return_type	Double
function_type	native

## Mathematical Operators The Expression Language contains a basic set of mathematical operators and functions to cover basic numerical transformations. ```python select( '1 * (3 + 4) / 2 % 2', 'EXP(1)', 'LOG(23, 2)', 'PI()', 'SIN(45)', ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4
0	1.5	2.718281828459045	4.523561956057013	3.141592653589793	0.8509035245341184
1	1.5	2.718281828459045	4.523561956057013	3.141592653589793	0.8509035245341184
2	1.5	2.718281828459045	4.523561956057013	3.141592653589793	0.8509035245341184

query produced 3 rows x 5 columns in 0.71 seconds

## Probabilibity Distributions ```python select( #computes the probability density using the specified, distribution, X value, and scale 'PDF("exp", 1, 1.0)', #computes the cumulative distribution using the specified, distribution, X value, and scale 'CDF("exp", 1, 1.0)', #computes the cumulative distribution using the specified, distribution, X value, and scale 'iCDF("exp", .95, 1.0)', ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2
0	0.36787944117144233	0.6321205588285577	2.99573227355399
1	0.36787944117144233	0.6321205588285577	2.99573227355399
2	0.36787944117144233	0.6321205588285577	2.99573227355399

query produced 3 rows x 3 columns in 0.68 seconds

## Generating Random Numbers ```python select( #generate a random number 'RANDOM()', #generate a random number from the given distribution 'SAMPLE("exp")', #create a random number generator from a known seed that can be reused multiple times in an expression. 'rng = RNG("my_seed"); [rng.next, rng.next]', ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1
0	[0.028381221364210618,0.2602691934427571]	1.5990840325478177
1	[0.028381221364210618,0.2602691934427571]	0.23555180093113778
2	[0.028381221364210618,0.2602691934427571]	3.4662128831513432

query produced 3 rows x 2 columns in 0.31 seconds

## Hashing ```python select( 'HASH("foo")',#hash the input to a number using java hashcode 'MD5("foo")', #compute the string MD5 sum of the input 'MD5_LONG("foo")',#compute the long MD5 sum of the input ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2
0	101574	ACBD18DB4CC2F85CEDEF654FCCC4A4D8	1301710373874522920
1	101574	ACBD18DB4CC2F85CEDEF654FCCC4A4D8	1301710373874522920
2	101574	ACBD18DB4CC2F85CEDEF654FCCC4A4D8	1301710373874522920

query produced 3 rows x 3 columns in 0.69 seconds

## String Manipulation The Expression Language provides the ability to perform string manipulation operations commonly needed for transforming categorical features: `CONCAT()`, `UPPER()`, `LOWER()`, `COALESCE()`, `STRING_SPLIT()`, `REGEX_EXTRACT()`, `REGEX_REPLACE()` ```python select( 'bid.ad_size', #base attribute of ad size 'UPPER(bid.ad_size)', #upper case 'lower(bid.ad_size)', #lower case 'COALESCE(bid.ad_size, "UNKNOWN")', #coalesce nulls 'STRING_SPLIT(bid.ad_size, "i")', #split strings into arrays 'REGEX_EXTRACT(bid.ad_size, "[a-zA-Z0-9]", 0)', #extract a regex group 'REGEX_REPLACE(bid.ad_size, ["b", "_", "i", "xx"])', #replace characters in a string 'TO_STRING(1.3)', #cast anything to a string 'bid.ad_size == "BIG"' #string equality ).from_events('lethe4').limit(3) ``` ## Date/Time Manipulation TODO`NOW()`, `DATE()`, `TO_DATETIME()`, `DAY_OF_WEEK()`, `MILLIS()`, `FROM_UNIXTIME()` ```python select( 'NOW()', #returns the current time in milliseconds 'DATE(NOW())', #parses the input into a date string 'TO_DATETIME("2021-03-21 12:51:00")', #parses the input string or number into a datetime object 'DAY_OF_WEEK(TO_DATETIME("2021-03-21"))', #emits the day of the week of the input in the range of [1,7] 'MILLIS(TO_DATETIME("2021-03-21"))', #emits the epoch milliseconds of the input 'FROM_UNIXTIME(NOW(), "yyyy/MM/dd")' #formats the input epoch timestamp using Joda date format. ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4	_c5
0	1630353837028	2021-08-30	2021-03-21T12:51:00.000Z	7	1616284800000	2021/08/30
1	1630353837028	2021-08-30	2021-03-21T12:51:00.000Z	7	1616284800000	2021/08/30
2	1630353837028	2021-08-30	2021-03-21T12:51:00.000Z	7	1616284800000	2021/08/30

query produced 3 rows x 6 columns in 0.36 seconds

## Array Manipulation Arrays first class citizens of TQL. Arrays can be defined using bracket notation `[obj1, obj2, ...]` individual data from an array can be addressed using brackets array[idx]. ```python select( 'RANGE(0, 5)', #generate an array of numbers in the specified range '[1,2,3,4][3]' #get the nth element of the array 'SORT([6, 1, 234, 2])', #sort the array 'SIZE([3, 4, 58])', #find the size of the array 'MIN([3, 4, 58])', #find the min value of the array 'MAX([3, 4, 58])', #find the max value of the array 'AVG([3, 4, 58])', #find the average value of the array 'STDEV([3, 4, 58])', #take the standard deviation of the array 'MAP([1, 2, 3, 4], (x) -> x*2)', #transform an array to another array 'FILTER([1, 2, 3, 4], (x) -> x > 3)', #filter an array to a smaller array 'REDUCE(["a", "b", "c"], (x,y) -> CONCAT(x, y))', #combine an array to a single value ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1	_c2	_c3	_c4	_c5	_c6	_c7	_c8	_c9	_c10
0	[2,4,6,8]	abc	[4]	[0,1,2,3,4]	[1,2,6,234]	3.0	3	58	21.666666666666668	25.69478978746902	4
1	[2,4,6,8]	abc	[4]	[0,1,2,3,4]	[1,2,6,234]	3.0	3	58	21.666666666666668	25.69478978746902	4
2	[2,4,6,8]	abc	[4]	[0,1,2,3,4]	[1,2,6,234]	3.0	3	58	21.666666666666668	25.69478978746902	4

query produced 3 rows x 11 columns in 0.62 seconds

## Array Manipulation of Timeline Events special variable `timeline.events` is also an array, and is commonly transformed using the array manipulation functions. For example, we can extract the number of small ad size bids we made for each timeline: ```python select( 'SIZE(FILTER(timeline.events, (x) -> x.type == "bid" and x.bid.ad_size == "small"))', ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0
0	221.0
1	221.0
2	221.0

query produced 3 rows x 1 columns in 0.37 seconds

## Local Variables and Multi-Statement Expressions Long transformations can be broken up into multiple statements, delimited by a `;` for readability. For example: ```python select(''' positive_value=1.0; negative_value=-1.0; ad_size=COALESCE(UPPER(bid.ad_size), 'UNKNOWN'); if(bid.ad_size == 'BIG', positive_value, negative_value) ''').from_events('lethe4').limit(3) #In multi-statement expressions, the last statement in the expression must return a value. ```

Query results:

partition "_default"

	_c0
0	-1.0
1	-1.0
2	-1.0

query produced 3 rows x 1 columns in 0.29 seconds

## The PREDICT Function Models are first-class citizens of TQL, and are accessible via the native function `PREDICT()`. Models are published into TQL with the column expressions used to produce the features, and because TQL expressions are applied to timeline events only when a query is executed, TQL can create result sets with model predictions as well as other attributes or metrics. For example: ```python select( 'bid.ad_size', 'PREDICT("my_model")' ).from_events('lethe4').limit(3) ``` Since the output of `PREDICT()` is simply a floating point number, it can be further manipulated using the expression language. Multiple models can even be composed together to compute derived values. For example, perhaps you would like to join together your predictions of winrate and conversion rate: ```python select( "PREDICT('winrate_model') * PREDICT('coversion_rate_model')" ).from_events('lethe4').limit(3) ``` Models must be previously published into TQL. Refer to [Model Training and Publishing](modeling) guide for more information on this topic. ## The SET_PROPERTY Function The native function `SET_PROPERTY()`can be used to temporarily change an attribute of the input event for the rest of the execution of the current expression. For example: ```python select( 'bid.ad_size', 'SET_PROPERTY("bid.ad_size", "tiny"); bid.ad_size' ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1
0	None	tiny
1	big	tiny
2	big	tiny

query produced 3 rows x 2 columns in 0.28 seconds

This functionality is powerful in conjunction with the [Model Predict Function](#Model-Predict-Function), allowing you to easily write queries that make counterfactual predictions. For example: ```python select( 'bid.ad_size', 'SET_PROPERTY("bid.ad_size", "tiny"); bid.ad_size' 'PREDICT("my_winrate_model")', 'SET_PROPERTY("bid.ad_size", "tiny"); PREDICT("my_winrate_model")' ).from_events('lethe4').limit(3) ``` ## Comments Expressions may contain comments. Both inline and multi-line comments are supported: ```python select(''' ad_size = bid.ad_size; //this is a comment /* this is a comment... ...on multiple lines */ ad_size ''').from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0
0	None
1	big
2	big

query produced 3 rows x 1 columns in 0.3 seconds

## User Defined Functions (UDFs) In addition to TQL's library of built-in functions, the user can write additional functions to extend the language's capabilities with custom behavior. Functions are defined with the syntax `function(...){ ... }`. A function may also provide an optional doc string preceding it. For example: ```python select( 'clean_ad_size()' ).from_events('lethe4').udf(''' function clean_ad_size() { COALESCE(UPPER(bid.ad_size), 'UNKNOWN') } ''').limit(3) ```

Query results:

partition "_default"

	_c0
0	UNKNOWN
1	BIG
2	BIG

query produced 3 rows x 1 columns in 0.27 seconds

Just like regular expressions, UDFs may contain multiple statements, delimited by a `;`, in which case the last statement must return a value. UDFs may also define input arguments. For example: ```python select( 'circumference(3)' ).from_events('lethe4').udf(''' function circumference(r) { diameter = 2 * r; circ = diameter * PI(); circ } ''').limit(3) ```

Query results:

partition "_default"

	_c0
0	18.84955592153876
1	18.84955592153876
2	18.84955592153876

query produced 3 rows x 1 columns in 0.26 seconds

Read full guide to writing user-defined functions [here](user-defined-functions). ## Compile and Runtime Error Handling If an expression fails to compile, an error will be shown when trying to submit the query. ```python #For example, 'blah' has no meaning here: select('blah').from_events('lethe4') ``` Encountered 1 expression compilation error: Line 1:0 Unknown variable '@blah' (in column _c0) Expression error: Line 1:0 Unknown variable '@blah' (in column _c0) blah ^ --------------------------------------------------------------------------- TQLAnalysisException Traceback (most recent call last) ~/dev/.virtualenvs/noumena-dev/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj) 343 method = get_real_method(obj, self.print_method) 344 if method is not None: --> 345 return method() 346 return None 347 else: ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in _repr_html_(self) 893 894 def _repr_html_(self): --> 895 return self.submit(interactive=self._interactive)._repr_html_() 896 897 def __iter__(self): ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in submit(self, interactive, wait, print_payloads, analyze, spark) 779 from noumena.tql import icarus 780 start = time.time() --> 781 submit_reply = icarus.post('query/submit', self.json(), print_json=print_payloads) 782 if interactive: 783 if self.spec.get('allow_invalid_column_expressions') is None: ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in post(url, json_data, files_data, data, print_json) 14 print(json.dumps(json_data)) 15 resp = requests.post(url, data=data, json=json_data, files=files_data) ---> 16 return _handle_response(url, resp, print_json=print_json) 17 18 ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in _handle_response(url, resp, print_json) 70 else: 71 print(reason) ---> 72 raise TQLAnalysisException(reason) 73 else: 74 msg = '' TQLAnalysisException: Encountered 1 expression compilation error: Line 1:0 Unknown variable '@blah' (in column _c0)Expression error: Line 1:0 Unknown variable '@blah' (in column _c0) blah ^ Query ```python #In this example, the `LOG` function requires two operands, the input number and the log base to use: select('log(3)').from_events('lethe4') ``` Encountered 1 expression compilation error: Line 1:0 LOG function needs 2 operands (in column _c0) Expression error: Line 1:0 LOG function needs 2 operands (in column _c0) log(3) ^ --------------------------------------------------------------------------- TQLAnalysisException Traceback (most recent call last) ~/dev/.virtualenvs/noumena-dev/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj) 343 method = get_real_method(obj, self.print_method) 344 if method is not None: --> 345 return method() 346 return None 347 else: ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in _repr_html_(self) 893 894 def _repr_html_(self): --> 895 return self.submit(interactive=self._interactive)._repr_html_() 896 897 def __iter__(self): ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in submit(self, interactive, wait, print_payloads, analyze, spark) 779 from noumena.tql import icarus 780 start = time.time() --> 781 submit_reply = icarus.post('query/submit', self.json(), print_json=print_payloads) 782 if interactive: 783 if self.spec.get('allow_invalid_column_expressions') is None: ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in post(url, json_data, files_data, data, print_json) 14 print(json.dumps(json_data)) 15 resp = requests.post(url, data=data, json=json_data, files=files_data) ---> 16 return _handle_response(url, resp, print_json=print_json) 17 18 ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in _handle_response(url, resp, print_json) 70 else: 71 print(reason) ---> 72 raise TQLAnalysisException(reason) 73 else: 74 msg = '' TQLAnalysisException: Encountered 1 expression compilation error: Line 1:0 LOG function needs 2 operands (in column _c0)Expression error: Line 1:0 LOG function needs 2 operands (in column _c0) log(3) ^ Query ```python #In this example, we simply forgot a closing parenthesis: select('log(3').from_events('lethe4') ``` Encountered 1 expression compilation error: Line 1:5 missing ')' at '' (in column _c0) Expression error: Line 1:5 missing ')' at '' (in column _c0) log(3 -----^ --------------------------------------------------------------------------- TQLAnalysisException Traceback (most recent call last) ~/dev/.virtualenvs/noumena-dev/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj) 343 method = get_real_method(obj, self.print_method) 344 if method is not None: --> 345 return method() 346 return None 347 else: ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in _repr_html_(self) 893 894 def _repr_html_(self): --> 895 return self.submit(interactive=self._interactive)._repr_html_() 896 897 def __iter__(self): ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in submit(self, interactive, wait, print_payloads, analyze, spark) 779 from noumena.tql import icarus 780 start = time.time() --> 781 submit_reply = icarus.post('query/submit', self.json(), print_json=print_payloads) 782 if interactive: 783 if self.spec.get('allow_invalid_column_expressions') is None: ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in post(url, json_data, files_data, data, print_json) 14 print(json.dumps(json_data)) 15 resp = requests.post(url, data=data, json=json_data, files=files_data) ---> 16 return _handle_response(url, resp, print_json=print_json) 17 18 ~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in _handle_response(url, resp, print_json) 70 else: 71 print(reason) ---> 72 raise TQLAnalysisException(reason) 73 else: 74 msg = '' TQLAnalysisException: Encountered 1 expression compilation error: Line 1:5 missing ')' at '' (in column _c0)Expression error: Line 1:5 missing ')' at '' (in column _c0) log(3 -----^ Query However, if an expression compiles successfully but fails during evaluation, then the result for that row will be null. In this example, `upper(bid.ad_size)` returns null because some events rows do not have ad size defined at all. ```python select( 'bid.ad_size', 'upper(bid.ad_size)' ).from_events('lethe4').limit(3) ```

Query results:

partition "_default"

	_c0	_c1
0	None	None
1	big	BIG
2	big	BIG

query produced 3 rows x 2 columns in 0.29 seconds

## Expression Language Cheat Sheet ![](_static/expression_cheat_sheet.png) ## Where To Next? Browse the full [**expression function reference**](functions).