The TQL Expression Language

The expression language is the core of TQL’s data processing engine. Expressions are written as Python strings and express a one-to-one transformation between an input timeline event object and a piece of output data. These transformations form the building blocks of all TQL data processing queries.

The language has strong syntactic similarities to SQL (basic functions, record-level structure) and Python (variables, list operations/aggregations, lambda/inline functions, user-defined functions). However, the expression language has several advantages:

  • Performant: TQL Expressions are compiled into Java byte code and hence, are very fast.

  • Lazily Evaluated: TQL Expressions are applied directly to raw timelines and events at query time, meaning all datasets are dynamically extracted from Timelines, eliminating the need for hard-to-maintain ETL pipelines.

  • Built for ML Model Evaluation: Trained ML Models deployed using TQL also store all the expressions expressions used to transform raw events into the training dataset, meaning feature extraction at scoring time is already built into the model.

TQL Expressions only exist in the context of a project and query. For this guide, we will the demo project lethe4 to learn the language. Lethe represents a typical online advertising funnel, with a set of online ad auctions bid, a set of website activity logs activity, and static user demographics user. Each Timeline represents the total of all observed events for a particular user. For this guide we will be manipulating data primarily from the bid TimeSeries.

Further reference: Expression language cheat sheet, and the full expression function reference.

Literals

TQL accepts string, boolean, integer, decimal, array, and null literals. For example:

from zeenk.tql import *

select(
	'"Hello World!"', 
	1, 
	1.3, 
	True,
	"[1,2,3,4]",
	None
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4 _c5
0 Hello World! 1 1.3 true [1,2,3,4] None
1 Hello World! 1 1.3 true [1,2,3,4] None
2 Hello World! 1 1.3 true [1,2,3,4] None
query produced 3 rows x 6 columns in 0.69 seconds

Event Attributes

In TQL expressions, attributes of timeline events are addressable as variables. For example, the lethe4 Project is composed of the bid TimeSeries, which contains the attributes ad_size, bid, event_time, ghosted, request_id, user_id, and won. These attributes are directly selectable using the expression language as variables, for example:

select(
	'bid.ad_size', 
	'bid.bid', 
	'bid.ghosted', 
	'bid.request_id', 
	'bid.user_id', 
	'bid.won'
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4 _c5
0 None None None None None None
1 big 1.0 false r13260 u252 true
2 big 1.0 false r13376 u252 true
query produced 3 rows x 6 columns in 0.36 seconds

Booleans and Logical Operators

The expression language contains the case-insensitive primitives true and false. Non-null, non-empty, non-zero values are also considered “truthy” and will be intepreted as true in a boolean context, such as in a where clause, The expression language has the native boolean operators AND and OR to express compound conditionals. These may be used in isolation, or in conjunction with the IF() function to emit conditional values in columns.

select(
	'true', #primitives of true
	'FaLsE', #case insensitive primitives
	'1 OR "non-empty string!"', # "truthy" values with compound conditionals
	'0 AND ""', # "falsey" values
	'IF(bid.ad_size == "big", "HUGE", "default")' #the IF function 
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4
0 true false true false default
1 true false true false HUGE
2 true false true false HUGE
query produced 3 rows x 5 columns in 0.63 seconds

TQL Functions

TQL provides a rich set of functions for transforming event attributes into pieces of derived data, which will be covered in the following sections.

Function Reference

The expression function reference can be found here. Additionally, you can look up expression function usage directly in your Python code using the helper object.

helper.find_function('search_term') lists functions matching the given search term:

helper.find_function('LOG')
['LOG', 'LOG10', 'LOG1P', 'LOG2', 'NATLOG']

heper.function_usage('function_name') displays the usage of any function:

helper.function_usage('LOG')
Usage: LOG(number, base)
description Return the logarithm of a {number} for a given {base}. Any null, a base of 0, or a base of 1 will return null.
parameters number: Double base: Double
return_type Double
function_type native

Mathematical Operators

The Expression Language contains a basic set of mathematical operators and functions to cover basic numerical transformations.

select(
  '1 * (3 + 4) / 2 % 2',
  'EXP(1)',
  'LOG(23, 2)',
  'PI()', 
  'SIN(45)',
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4
0 1.5 2.718281828459045 4.523561956057013 3.141592653589793 0.8509035245341184
1 1.5 2.718281828459045 4.523561956057013 3.141592653589793 0.8509035245341184
2 1.5 2.718281828459045 4.523561956057013 3.141592653589793 0.8509035245341184
query produced 3 rows x 5 columns in 0.71 seconds

Probabilibity Distributions

select(
    #computes the probability density using the specified, distribution, X value, and scale
    'PDF("exp", 1, 1.0)',
    #computes the cumulative distribution using the specified, distribution, X value, and scale
    'CDF("exp", 1, 1.0)',
    #computes the cumulative distribution using the specified, distribution, X value, and scale
    'iCDF("exp", .95, 1.0)',
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1 _c2
0 0.36787944117144233 0.6321205588285577 2.99573227355399
1 0.36787944117144233 0.6321205588285577 2.99573227355399
2 0.36787944117144233 0.6321205588285577 2.99573227355399
query produced 3 rows x 3 columns in 0.68 seconds

Generating Random Numbers

select(
    #generate a random number
    'RANDOM()',
    #generate a random number from the given distribution
    'SAMPLE("exp")',
    #create a random number generator from a known seed that can be reused multiple times in an expression.
    'rng = RNG("my_seed"); [rng.next, rng.next]',
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1
0 [0.028381221364210618,0.2602691934427571] 1.5990840325478177
1 [0.028381221364210618,0.2602691934427571] 0.23555180093113778
2 [0.028381221364210618,0.2602691934427571] 3.4662128831513432
query produced 3 rows x 2 columns in 0.31 seconds

Hashing

select(
    'HASH("foo")',#hash the input to a number using java hashcode
    'MD5("foo")', #compute the string MD5 sum of the input
    'MD5_LONG("foo")',#compute the long MD5 sum of the input
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1 _c2
0 101574 ACBD18DB4CC2F85CEDEF654FCCC4A4D8 1301710373874522920
1 101574 ACBD18DB4CC2F85CEDEF654FCCC4A4D8 1301710373874522920
2 101574 ACBD18DB4CC2F85CEDEF654FCCC4A4D8 1301710373874522920
query produced 3 rows x 3 columns in 0.69 seconds

String Manipulation

The Expression Language provides the ability to perform string manipulation operations commonly needed for transforming categorical features:

CONCAT(), UPPER(), LOWER(), COALESCE(), STRING_SPLIT(), REGEX_EXTRACT(), REGEX_REPLACE()

select(
	'bid.ad_size', #base attribute of ad size
	'UPPER(bid.ad_size)', #upper case
	'lower(bid.ad_size)', #lower case
	'COALESCE(bid.ad_size, "UNKNOWN")', #coalesce nulls
	'STRING_SPLIT(bid.ad_size, "i")', #split strings into arrays
	'REGEX_EXTRACT(bid.ad_size, "[a-zA-Z0-9]", 0)', #extract a regex group
	'REGEX_REPLACE(bid.ad_size, ["b", "_", "i", "xx"])', #replace characters in a string
	'TO_STRING(1.3)', #cast anything to a string
	'bid.ad_size == "BIG"' #string equality
).from_events('lethe4').limit(3)

Date/Time Manipulation

TODONOW(), DATE(), TO_DATETIME(), DAY_OF_WEEK(), MILLIS(), FROM_UNIXTIME()

select(
    'NOW()', #returns the current time in milliseconds
    'DATE(NOW())', #parses the input into a date string
    'TO_DATETIME("2021-03-21 12:51:00")', #parses the input string or number into a datetime object
    'DAY_OF_WEEK(TO_DATETIME("2021-03-21"))', #emits the day of the week of the input in the range of [1,7]
    'MILLIS(TO_DATETIME("2021-03-21"))', #emits the epoch milliseconds of the input
    'FROM_UNIXTIME(NOW(), "yyyy/MM/dd")' #formats the input epoch timestamp using Joda date format.
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4 _c5
0 1630353837028 2021-08-30 2021-03-21T12:51:00.000Z 7 1616284800000 2021/08/30
1 1630353837028 2021-08-30 2021-03-21T12:51:00.000Z 7 1616284800000 2021/08/30
2 1630353837028 2021-08-30 2021-03-21T12:51:00.000Z 7 1616284800000 2021/08/30
query produced 3 rows x 6 columns in 0.36 seconds

Array Manipulation

Arrays first class citizens of TQL. Arrays can be defined using bracket notation [obj1, obj2, ...] individual data from an array can be addressed using brackets array[idx].

select(
    'RANGE(0, 5)', #generate an array of numbers in the specified range
    '[1,2,3,4][3]' #get the nth element of the array
    'SORT([6, 1, 234, 2])', #sort the array
    'SIZE([3, 4, 58])', #find the size of the array
    'MIN([3, 4, 58])', #find the min value of the array
    'MAX([3, 4, 58])', #find the max value of the array
    'AVG([3, 4, 58])', #find the average value of the array
    'STDEV([3, 4, 58])', #take the standard deviation of the array
    'MAP([1, 2, 3, 4], (x) -> x*2)', #transform an array to another array
    'FILTER([1, 2, 3, 4], (x) -> x > 3)', #filter an array to a smaller array
    'REDUCE(["a", "b", "c"], (x,y) -> CONCAT(x, y))', #combine an array to a single value
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1 _c2 _c3 _c4 _c5 _c6 _c7 _c8 _c9 _c10
0 [2,4,6,8] abc [4] [0,1,2,3,4] [1,2,6,234] 3.0 3 58 21.666666666666668 25.69478978746902 4
1 [2,4,6,8] abc [4] [0,1,2,3,4] [1,2,6,234] 3.0 3 58 21.666666666666668 25.69478978746902 4
2 [2,4,6,8] abc [4] [0,1,2,3,4] [1,2,6,234] 3.0 3 58 21.666666666666668 25.69478978746902 4
query produced 3 rows x 11 columns in 0.62 seconds

Array Manipulation of Timeline Events

special variable timeline.events is also an array, and is commonly transformed using the array manipulation functions. For example, we can extract the number of small ad size bids we made for each timeline:

select(
    'SIZE(FILTER(timeline.events, (x) -> x.type == "bid" and x.bid.ad_size == "small"))',
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0
0 221.0
1 221.0
2 221.0
query produced 3 rows x 1 columns in 0.37 seconds

Local Variables and Multi-Statement Expressions

Long transformations can be broken up into multiple statements, delimited by a ; for readability. For example:

select('''
positive_value=1.0;
negative_value=-1.0;
ad_size=COALESCE(UPPER(bid.ad_size), 'UNKNOWN');
if(bid.ad_size == 'BIG', positive_value, negative_value)
''').from_events('lethe4').limit(3)
#In multi-statement expressions, the last statement in the expression must return a value.

Query results:

partition "_default"
_c0
0 -1.0
1 -1.0
2 -1.0
query produced 3 rows x 1 columns in 0.29 seconds

The PREDICT Function

Models are first-class citizens of TQL, and are accessible via the native function PREDICT(). Models are published into TQL with the column expressions used to produce the features, and because TQL expressions are applied to timeline events only when a query is executed, TQL can create result sets with model predictions as well as other attributes or metrics. For example:

select(
	'bid.ad_size',
	'PREDICT("my_model")'
).from_events('lethe4').limit(3)

Since the output of PREDICT() is simply a floating point number, it can be further manipulated using the expression language. Multiple models can even be composed together to compute derived values. For example, perhaps you would like to join together your predictions of winrate and conversion rate:

select(
	"PREDICT('winrate_model') * PREDICT('coversion_rate_model')"
).from_events('lethe4').limit(3)

Models must be previously published into TQL. Refer to Model Training and Publishing guide for more information on this topic.

The SET_PROPERTY Function

The native function SET_PROPERTY()can be used to temporarily change an attribute of the input event for the rest of the execution of the current expression. For example:

select(
	'bid.ad_size',
	'SET_PROPERTY("bid.ad_size", "tiny"); bid.ad_size'
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1
0 None tiny
1 big tiny
2 big tiny
query produced 3 rows x 2 columns in 0.28 seconds

This functionality is powerful in conjunction with the Model Predict Function, allowing you to easily write queries that make counterfactual predictions. For example:

select(
	'bid.ad_size',
	'SET_PROPERTY("bid.ad_size", "tiny"); bid.ad_size'
	'PREDICT("my_winrate_model")',
	'SET_PROPERTY("bid.ad_size", "tiny"); PREDICT("my_winrate_model")'
).from_events('lethe4').limit(3)

Comments

Expressions may contain comments. Both inline and multi-line comments are supported:

select('''
ad_size = bid.ad_size; //this is a comment
/*
this is a comment...
...on multiple lines
*/
ad_size 
''').from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0
0 None
1 big
2 big
query produced 3 rows x 1 columns in 0.3 seconds

User Defined Functions (UDFs)

In addition to TQL’s library of built-in functions, the user can write additional functions to extend the language’s capabilities with custom behavior. Functions are defined with the syntax function(...){ ... }. A function may also provide an optional doc string preceding it. For example:

select(
  'clean_ad_size()'
).from_events('lethe4').udf('''
function clean_ad_size() {
  COALESCE(UPPER(bid.ad_size), 'UNKNOWN')
}
''').limit(3)

Query results:

partition "_default"
_c0
0 UNKNOWN
1 BIG
2 BIG
query produced 3 rows x 1 columns in 0.27 seconds

Just like regular expressions, UDFs may contain multiple statements, delimited by a ;, in which case the last statement must return a value. UDFs may also define input arguments. For example:

select(
  'circumference(3)'
).from_events('lethe4').udf('''
function circumference(r) {
	diameter = 2 * r;
	circ = diameter *  PI();
	circ
}
''').limit(3)

Query results:

partition "_default"
_c0
0 18.84955592153876
1 18.84955592153876
2 18.84955592153876
query produced 3 rows x 1 columns in 0.26 seconds

Read full guide to writing user-defined functions here.

Compile and Runtime Error Handling

If an expression fails to compile, an error will be shown when trying to submit the query.

#For example, 'blah'  has no meaning here:
select('blah').from_events('lethe4')
Encountered 1 expression compilation error: Line 1:0 Unknown variable '@blah' (in column _c0)
Expression error: Line 1:0 Unknown variable '@blah' (in column _c0)
blah
^




---------------------------------------------------------------------------

TQLAnalysisException                      Traceback (most recent call last)

~/dev/.virtualenvs/noumena-dev/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in _repr_html_(self)
    893 
    894     def _repr_html_(self):
--> 895         return self.submit(interactive=self._interactive)._repr_html_()
    896 
    897     def __iter__(self):


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in submit(self, interactive, wait, print_payloads, analyze, spark)
    779         from noumena.tql import icarus
    780         start = time.time()
--> 781         submit_reply = icarus.post('query/submit', self.json(), print_json=print_payloads)
    782         if interactive:
    783             if self.spec.get('allow_invalid_column_expressions') is None:


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in post(url, json_data, files_data, data, print_json)
     14         print(json.dumps(json_data))
     15     resp = requests.post(url, data=data, json=json_data, files=files_data)
---> 16     return _handle_response(url, resp, print_json=print_json)
     17 
     18 


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in _handle_response(url, resp, print_json)
     70             else:
     71                 print(reason)
---> 72             raise TQLAnalysisException(reason)
     73         else:
     74             msg = ''


TQLAnalysisException: Encountered 1 expression compilation error: Line 1:0 Unknown variable '@blah' (in column _c0)Expression error: Line 1:0 Unknown variable '@blah' (in column _c0)
blah
^





Query<project_id=4, 
interactive=None, 
timeline_limit=None, 
row_limit=None, 
from_filters=None, 
where_filters=None, 
columns=[{'name': None, 'expression': 'blah', 'type': 'METADATA'}], 
sampling=None>
#In this example, the `LOG` function requires two operands, the input number and the log base to use:
select('log(3)').from_events('lethe4')
Encountered 1 expression compilation error: Line 1:0 LOG function needs 2 operands (in column _c0)
Expression error: Line 1:0 LOG function needs 2 operands (in column _c0)
log(3)
^




---------------------------------------------------------------------------

TQLAnalysisException                      Traceback (most recent call last)

~/dev/.virtualenvs/noumena-dev/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in _repr_html_(self)
    893 
    894     def _repr_html_(self):
--> 895         return self.submit(interactive=self._interactive)._repr_html_()
    896 
    897     def __iter__(self):


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in submit(self, interactive, wait, print_payloads, analyze, spark)
    779         from noumena.tql import icarus
    780         start = time.time()
--> 781         submit_reply = icarus.post('query/submit', self.json(), print_json=print_payloads)
    782         if interactive:
    783             if self.spec.get('allow_invalid_column_expressions') is None:


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in post(url, json_data, files_data, data, print_json)
     14         print(json.dumps(json_data))
     15     resp = requests.post(url, data=data, json=json_data, files=files_data)
---> 16     return _handle_response(url, resp, print_json=print_json)
     17 
     18 


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in _handle_response(url, resp, print_json)
     70             else:
     71                 print(reason)
---> 72             raise TQLAnalysisException(reason)
     73         else:
     74             msg = ''


TQLAnalysisException: Encountered 1 expression compilation error: Line 1:0 LOG function needs 2 operands (in column _c0)Expression error: Line 1:0 LOG function needs 2 operands (in column _c0)
log(3)
^





Query<project_id=4, 
interactive=None, 
timeline_limit=None, 
row_limit=None, 
from_filters=None, 
where_filters=None, 
columns=[{'name': None, 'expression': 'log(3)', 'type': 'METADATA'}], 
sampling=None>
#In this example, we simply forgot a closing parenthesis:
select('log(3').from_events('lethe4')
Encountered 1 expression compilation error: Line 1:5 missing ')' at '<EOF>' (in column _c0)
Expression error: Line 1:5 missing ')' at '<EOF>' (in column _c0)
log(3
-----^




---------------------------------------------------------------------------

TQLAnalysisException                      Traceback (most recent call last)

~/dev/.virtualenvs/noumena-dev/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in _repr_html_(self)
    893 
    894     def _repr_html_(self):
--> 895         return self.submit(interactive=self._interactive)._repr_html_()
    896 
    897     def __iter__(self):


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in submit(self, interactive, wait, print_payloads, analyze, spark)
    779         from noumena.tql import icarus
    780         start = time.time()
--> 781         submit_reply = icarus.post('query/submit', self.json(), print_json=print_payloads)
    782         if interactive:
    783             if self.spec.get('allow_invalid_column_expressions') is None:


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in post(url, json_data, files_data, data, print_json)
     14         print(json.dumps(json_data))
     15     resp = requests.post(url, data=data, json=json_data, files=files_data)
---> 16     return _handle_response(url, resp, print_json=print_json)
     17 
     18 


~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in _handle_response(url, resp, print_json)
     70             else:
     71                 print(reason)
---> 72             raise TQLAnalysisException(reason)
     73         else:
     74             msg = ''


TQLAnalysisException: Encountered 1 expression compilation error: Line 1:5 missing ')' at '<EOF>' (in column _c0)Expression error: Line 1:5 missing ')' at '<EOF>' (in column _c0)
log(3
-----^





Query<project_id=4, 
interactive=None, 
timeline_limit=None, 
row_limit=None, 
from_filters=None, 
where_filters=None, 
columns=[{'name': None, 'expression': 'log(3', 'type': 'METADATA'}], 
sampling=None>

However, if an expression compiles successfully but fails during evaluation, then the result for that row will be null. In this example, upper(bid.ad_size) returns null because some events rows do not have ad size defined at all.

select(
	'bid.ad_size', 
	'upper(bid.ad_size)'
).from_events('lethe4').limit(3)

Query results:

partition "_default"
_c0 _c1
0 None None
1 big BIG
2 big BIG
query produced 3 rows x 2 columns in 0.29 seconds

Expression Language Cheat Sheet

Where To Next?

Browse the full expression function reference.