The TQL Expression Language
The expression language is the core of TQL’s data processing engine. Expressions are written as Python strings and express a one-to-one transformation between an input timeline event object and a piece of output data. These transformations form the building blocks of all TQL data processing queries.
The language has strong syntactic similarities to SQL (basic functions, record-level structure) and Python (variables, list operations/aggregations, lambda/inline functions, user-defined functions). However, the expression language has several advantages:
Performant: TQL Expressions are compiled into Java byte code and hence, are very fast.
Lazily Evaluated: TQL Expressions are applied directly to raw timelines and events at query time, meaning all datasets are dynamically extracted from Timelines, eliminating the need for hard-to-maintain ETL pipelines.
Built for ML Model Evaluation: Trained ML Models deployed using TQL also store all the expressions expressions used to transform raw events into the training dataset, meaning feature extraction at scoring time is already built into the model.
TQL Expressions only exist in the context of a project and query. For this guide, we will the
demo project lethe4 to learn the language. Lethe represents a typical online advertising
funnel, with a set of online ad auctions bid
, a set of website activity logs activity
, and
static user demographics user
. Each Timeline represents the total of all observed events for
a particular user. For this guide we will be manipulating data primarily from the bid
TimeSeries.
Further reference: Expression language cheat sheet, and the full expression function reference.
Literals
TQL accepts string, boolean, integer, decimal, array, and null literals. For example:
from zeenk.tql import *
select(
'"Hello World!"',
1,
1.3,
True,
"[1,2,3,4]",
None
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | _c5 | |
---|---|---|---|---|---|---|
0 | Hello World! | 1 | 1.3 | true | [1,2,3,4] | None |
1 | Hello World! | 1 | 1.3 | true | [1,2,3,4] | None |
2 | Hello World! | 1 | 1.3 | true | [1,2,3,4] | None |
Event Attributes
In TQL expressions, attributes of timeline events are addressable as variables. For example, the lethe4
Project is composed of the bid
TimeSeries, which contains the attributes ad_size
, bid
, event_time
,
ghosted
, request_id
, user_id
, and won
. These attributes are directly selectable using the
expression language as variables, for example:
select(
'bid.ad_size',
'bid.bid',
'bid.ghosted',
'bid.request_id',
'bid.user_id',
'bid.won'
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | _c5 | |
---|---|---|---|---|---|---|
0 | None | None | None | None | None | None |
1 | big | 1.0 | false | r13260 | u252 | true |
2 | big | 1.0 | false | r13376 | u252 | true |
Booleans and Logical Operators
The expression language contains the case-insensitive primitives true
and false
. Non-null, non-empty,
non-zero values are also considered “truthy” and will be intepreted as true in a boolean context, such
as in a where clause, The expression language has the native boolean operators AND
and OR
to express
compound conditionals. These may be used in isolation, or in conjunction with the IF()
function to
emit conditional values in columns.
select(
'true', #primitives of true
'FaLsE', #case insensitive primitives
'1 OR "non-empty string!"', # "truthy" values with compound conditionals
'0 AND ""', # "falsey" values
'IF(bid.ad_size == "big", "HUGE", "default")' #the IF function
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | |
---|---|---|---|---|---|
0 | true | false | true | false | default |
1 | true | false | true | false | HUGE |
2 | true | false | true | false | HUGE |
TQL Functions
TQL provides a rich set of functions for transforming event attributes into pieces of derived data, which will be covered in the following sections.
Function Reference
The expression function reference can be found here. Additionally, you can look up
expression function usage directly in your Python code using the helper
object.
helper.find_function('search_term')
lists functions matching the given search term:
helper.find_function('LOG')
['LOG', 'LOG10', 'LOG1P', 'LOG2', 'NATLOG']
heper.function_usage('function_name')
displays the usage of any function:
helper.function_usage('LOG')
Usage: | LOG(number, base) |
---|---|
description | Return the logarithm of a {number} for a given {base}. Any null, a base of 0, or a base of 1 will return null. |
parameters | number: Double base: Double |
return_type | Double |
function_type | native |
Mathematical Operators
The Expression Language contains a basic set of mathematical operators and functions to cover basic numerical transformations.
select(
'1 * (3 + 4) / 2 % 2',
'EXP(1)',
'LOG(23, 2)',
'PI()',
'SIN(45)',
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | |
---|---|---|---|---|---|
0 | 1.5 | 2.718281828459045 | 4.523561956057013 | 3.141592653589793 | 0.8509035245341184 |
1 | 1.5 | 2.718281828459045 | 4.523561956057013 | 3.141592653589793 | 0.8509035245341184 |
2 | 1.5 | 2.718281828459045 | 4.523561956057013 | 3.141592653589793 | 0.8509035245341184 |
Probabilibity Distributions
select(
#computes the probability density using the specified, distribution, X value, and scale
'PDF("exp", 1, 1.0)',
#computes the cumulative distribution using the specified, distribution, X value, and scale
'CDF("exp", 1, 1.0)',
#computes the cumulative distribution using the specified, distribution, X value, and scale
'iCDF("exp", .95, 1.0)',
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | |
---|---|---|---|
0 | 0.36787944117144233 | 0.6321205588285577 | 2.99573227355399 |
1 | 0.36787944117144233 | 0.6321205588285577 | 2.99573227355399 |
2 | 0.36787944117144233 | 0.6321205588285577 | 2.99573227355399 |
Generating Random Numbers
select(
#generate a random number
'RANDOM()',
#generate a random number from the given distribution
'SAMPLE("exp")',
#create a random number generator from a known seed that can be reused multiple times in an expression.
'rng = RNG("my_seed"); [rng.next, rng.next]',
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | |
---|---|---|
0 | [0.028381221364210618,0.2602691934427571] | 1.5990840325478177 |
1 | [0.028381221364210618,0.2602691934427571] | 0.23555180093113778 |
2 | [0.028381221364210618,0.2602691934427571] | 3.4662128831513432 |
Hashing
select(
'HASH("foo")',#hash the input to a number using java hashcode
'MD5("foo")', #compute the string MD5 sum of the input
'MD5_LONG("foo")',#compute the long MD5 sum of the input
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | |
---|---|---|---|
0 | 101574 | ACBD18DB4CC2F85CEDEF654FCCC4A4D8 | 1301710373874522920 |
1 | 101574 | ACBD18DB4CC2F85CEDEF654FCCC4A4D8 | 1301710373874522920 |
2 | 101574 | ACBD18DB4CC2F85CEDEF654FCCC4A4D8 | 1301710373874522920 |
String Manipulation
The Expression Language provides the ability to perform string manipulation operations commonly needed for transforming categorical features:
CONCAT()
, UPPER()
, LOWER()
, COALESCE()
, STRING_SPLIT()
, REGEX_EXTRACT()
, REGEX_REPLACE()
select(
'bid.ad_size', #base attribute of ad size
'UPPER(bid.ad_size)', #upper case
'lower(bid.ad_size)', #lower case
'COALESCE(bid.ad_size, "UNKNOWN")', #coalesce nulls
'STRING_SPLIT(bid.ad_size, "i")', #split strings into arrays
'REGEX_EXTRACT(bid.ad_size, "[a-zA-Z0-9]", 0)', #extract a regex group
'REGEX_REPLACE(bid.ad_size, ["b", "_", "i", "xx"])', #replace characters in a string
'TO_STRING(1.3)', #cast anything to a string
'bid.ad_size == "BIG"' #string equality
).from_events('lethe4').limit(3)
Date/Time Manipulation
TODONOW()
, DATE()
, TO_DATETIME()
, DAY_OF_WEEK()
, MILLIS()
, FROM_UNIXTIME()
select(
'NOW()', #returns the current time in milliseconds
'DATE(NOW())', #parses the input into a date string
'TO_DATETIME("2021-03-21 12:51:00")', #parses the input string or number into a datetime object
'DAY_OF_WEEK(TO_DATETIME("2021-03-21"))', #emits the day of the week of the input in the range of [1,7]
'MILLIS(TO_DATETIME("2021-03-21"))', #emits the epoch milliseconds of the input
'FROM_UNIXTIME(NOW(), "yyyy/MM/dd")' #formats the input epoch timestamp using Joda date format.
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | _c5 | |
---|---|---|---|---|---|---|
0 | 1630353837028 | 2021-08-30 | 2021-03-21T12:51:00.000Z | 7 | 1616284800000 | 2021/08/30 |
1 | 1630353837028 | 2021-08-30 | 2021-03-21T12:51:00.000Z | 7 | 1616284800000 | 2021/08/30 |
2 | 1630353837028 | 2021-08-30 | 2021-03-21T12:51:00.000Z | 7 | 1616284800000 | 2021/08/30 |
Array Manipulation
Arrays first class citizens of TQL. Arrays can be defined using bracket notation [obj1, obj2, ...]
individual data from an array can be addressed using brackets array[idx].
select(
'RANGE(0, 5)', #generate an array of numbers in the specified range
'[1,2,3,4][3]' #get the nth element of the array
'SORT([6, 1, 234, 2])', #sort the array
'SIZE([3, 4, 58])', #find the size of the array
'MIN([3, 4, 58])', #find the min value of the array
'MAX([3, 4, 58])', #find the max value of the array
'AVG([3, 4, 58])', #find the average value of the array
'STDEV([3, 4, 58])', #take the standard deviation of the array
'MAP([1, 2, 3, 4], (x) -> x*2)', #transform an array to another array
'FILTER([1, 2, 3, 4], (x) -> x > 3)', #filter an array to a smaller array
'REDUCE(["a", "b", "c"], (x,y) -> CONCAT(x, y))', #combine an array to a single value
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | _c2 | _c3 | _c4 | _c5 | _c6 | _c7 | _c8 | _c9 | _c10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | [2,4,6,8] | abc | [4] | [0,1,2,3,4] | [1,2,6,234] | 3.0 | 3 | 58 | 21.666666666666668 | 25.69478978746902 | 4 |
1 | [2,4,6,8] | abc | [4] | [0,1,2,3,4] | [1,2,6,234] | 3.0 | 3 | 58 | 21.666666666666668 | 25.69478978746902 | 4 |
2 | [2,4,6,8] | abc | [4] | [0,1,2,3,4] | [1,2,6,234] | 3.0 | 3 | 58 | 21.666666666666668 | 25.69478978746902 | 4 |
Array Manipulation of Timeline Events
special variable timeline.events
is also an array, and is commonly transformed using the array manipulation
functions. For example, we can extract the number of small ad size bids we made for each timeline:
select(
'SIZE(FILTER(timeline.events, (x) -> x.type == "bid" and x.bid.ad_size == "small"))',
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | |
---|---|
0 | 221.0 |
1 | 221.0 |
2 | 221.0 |
Local Variables and Multi-Statement Expressions
Long transformations can be broken up into multiple statements, delimited by a ;
for readability. For example:
select('''
positive_value=1.0;
negative_value=-1.0;
ad_size=COALESCE(UPPER(bid.ad_size), 'UNKNOWN');
if(bid.ad_size == 'BIG', positive_value, negative_value)
''').from_events('lethe4').limit(3)
#In multi-statement expressions, the last statement in the expression must return a value.
Query results:
partition "_default"_c0 | |
---|---|
0 | -1.0 |
1 | -1.0 |
2 | -1.0 |
The PREDICT Function
Models are first-class citizens of TQL, and are accessible via the native function PREDICT()
. Models are
published into TQL with the column expressions used to produce the features, and because TQL expressions
are applied to timeline events only when a query is executed, TQL can create result sets with model predictions
as well as other attributes or metrics. For example:
select(
'bid.ad_size',
'PREDICT("my_model")'
).from_events('lethe4').limit(3)
Since the output of PREDICT()
is simply a floating point number, it can be further manipulated using the
expression language. Multiple models can even be composed together to compute derived values. For example,
perhaps you would like to join together your predictions of winrate and conversion rate:
select(
"PREDICT('winrate_model') * PREDICT('coversion_rate_model')"
).from_events('lethe4').limit(3)
Models must be previously published into TQL. Refer to Model Training and Publishing guide for more information on this topic.
The SET_PROPERTY Function
The native function SET_PROPERTY()
can be used to temporarily change an attribute of the input event for the
rest of the execution of the current expression. For example:
select(
'bid.ad_size',
'SET_PROPERTY("bid.ad_size", "tiny"); bid.ad_size'
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | |
---|---|---|
0 | None | tiny |
1 | big | tiny |
2 | big | tiny |
This functionality is powerful in conjunction with the Model Predict Function, allowing you to easily write queries that make counterfactual predictions. For example:
select(
'bid.ad_size',
'SET_PROPERTY("bid.ad_size", "tiny"); bid.ad_size'
'PREDICT("my_winrate_model")',
'SET_PROPERTY("bid.ad_size", "tiny"); PREDICT("my_winrate_model")'
).from_events('lethe4').limit(3)
User Defined Functions (UDFs)
In addition to TQL’s library of built-in functions, the user can write additional functions to extend the
language’s capabilities with custom behavior. Functions are defined with the syntax function(...){ ... }
.
A function may also provide an optional doc string preceding it. For example:
select(
'clean_ad_size()'
).from_events('lethe4').udf('''
function clean_ad_size() {
COALESCE(UPPER(bid.ad_size), 'UNKNOWN')
}
''').limit(3)
Query results:
partition "_default"_c0 | |
---|---|
0 | UNKNOWN |
1 | BIG |
2 | BIG |
Just like regular expressions, UDFs may contain multiple statements, delimited by a ;
, in which case the last
statement must return a value. UDFs may also define input arguments. For example:
select(
'circumference(3)'
).from_events('lethe4').udf('''
function circumference(r) {
diameter = 2 * r;
circ = diameter * PI();
circ
}
''').limit(3)
Query results:
partition "_default"_c0 | |
---|---|
0 | 18.84955592153876 |
1 | 18.84955592153876 |
2 | 18.84955592153876 |
Read full guide to writing user-defined functions here.
Compile and Runtime Error Handling
If an expression fails to compile, an error will be shown when trying to submit the query.
#For example, 'blah' has no meaning here:
select('blah').from_events('lethe4')
Encountered 1 expression compilation error: Line 1:0 Unknown variable '@blah' (in column _c0)
Expression error: Line 1:0 Unknown variable '@blah' (in column _c0)
blah
^
---------------------------------------------------------------------------
TQLAnalysisException Traceback (most recent call last)
~/dev/.virtualenvs/noumena-dev/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
343 method = get_real_method(obj, self.print_method)
344 if method is not None:
--> 345 return method()
346 return None
347 else:
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in _repr_html_(self)
893
894 def _repr_html_(self):
--> 895 return self.submit(interactive=self._interactive)._repr_html_()
896
897 def __iter__(self):
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in submit(self, interactive, wait, print_payloads, analyze, spark)
779 from noumena.tql import icarus
780 start = time.time()
--> 781 submit_reply = icarus.post('query/submit', self.json(), print_json=print_payloads)
782 if interactive:
783 if self.spec.get('allow_invalid_column_expressions') is None:
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in post(url, json_data, files_data, data, print_json)
14 print(json.dumps(json_data))
15 resp = requests.post(url, data=data, json=json_data, files=files_data)
---> 16 return _handle_response(url, resp, print_json=print_json)
17
18
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in _handle_response(url, resp, print_json)
70 else:
71 print(reason)
---> 72 raise TQLAnalysisException(reason)
73 else:
74 msg = ''
TQLAnalysisException: Encountered 1 expression compilation error: Line 1:0 Unknown variable '@blah' (in column _c0)Expression error: Line 1:0 Unknown variable '@blah' (in column _c0)
blah
^
Query<project_id=4,
interactive=None,
timeline_limit=None,
row_limit=None,
from_filters=None,
where_filters=None,
columns=[{'name': None, 'expression': 'blah', 'type': 'METADATA'}],
sampling=None>
#In this example, the `LOG` function requires two operands, the input number and the log base to use:
select('log(3)').from_events('lethe4')
Encountered 1 expression compilation error: Line 1:0 LOG function needs 2 operands (in column _c0)
Expression error: Line 1:0 LOG function needs 2 operands (in column _c0)
log(3)
^
---------------------------------------------------------------------------
TQLAnalysisException Traceback (most recent call last)
~/dev/.virtualenvs/noumena-dev/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
343 method = get_real_method(obj, self.print_method)
344 if method is not None:
--> 345 return method()
346 return None
347 else:
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in _repr_html_(self)
893
894 def _repr_html_(self):
--> 895 return self.submit(interactive=self._interactive)._repr_html_()
896
897 def __iter__(self):
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in submit(self, interactive, wait, print_payloads, analyze, spark)
779 from noumena.tql import icarus
780 start = time.time()
--> 781 submit_reply = icarus.post('query/submit', self.json(), print_json=print_payloads)
782 if interactive:
783 if self.spec.get('allow_invalid_column_expressions') is None:
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in post(url, json_data, files_data, data, print_json)
14 print(json.dumps(json_data))
15 resp = requests.post(url, data=data, json=json_data, files=files_data)
---> 16 return _handle_response(url, resp, print_json=print_json)
17
18
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in _handle_response(url, resp, print_json)
70 else:
71 print(reason)
---> 72 raise TQLAnalysisException(reason)
73 else:
74 msg = ''
TQLAnalysisException: Encountered 1 expression compilation error: Line 1:0 LOG function needs 2 operands (in column _c0)Expression error: Line 1:0 LOG function needs 2 operands (in column _c0)
log(3)
^
Query<project_id=4,
interactive=None,
timeline_limit=None,
row_limit=None,
from_filters=None,
where_filters=None,
columns=[{'name': None, 'expression': 'log(3)', 'type': 'METADATA'}],
sampling=None>
#In this example, we simply forgot a closing parenthesis:
select('log(3').from_events('lethe4')
Encountered 1 expression compilation error: Line 1:5 missing ')' at '<EOF>' (in column _c0)
Expression error: Line 1:5 missing ')' at '<EOF>' (in column _c0)
log(3
-----^
---------------------------------------------------------------------------
TQLAnalysisException Traceback (most recent call last)
~/dev/.virtualenvs/noumena-dev/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
343 method = get_real_method(obj, self.print_method)
344 if method is not None:
--> 345 return method()
346 return None
347 else:
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in _repr_html_(self)
893
894 def _repr_html_(self):
--> 895 return self.submit(interactive=self._interactive)._repr_html_()
896
897 def __iter__(self):
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/query.py in submit(self, interactive, wait, print_payloads, analyze, spark)
779 from noumena.tql import icarus
780 start = time.time()
--> 781 submit_reply = icarus.post('query/submit', self.json(), print_json=print_payloads)
782 if interactive:
783 if self.spec.get('allow_invalid_column_expressions') is None:
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in post(url, json_data, files_data, data, print_json)
14 print(json.dumps(json_data))
15 resp = requests.post(url, data=data, json=json_data, files=files_data)
---> 16 return _handle_response(url, resp, print_json=print_json)
17
18
~/dev/nanigans/noumena/noumena-python/noumena/noumena/tql/icarus.py in _handle_response(url, resp, print_json)
70 else:
71 print(reason)
---> 72 raise TQLAnalysisException(reason)
73 else:
74 msg = ''
TQLAnalysisException: Encountered 1 expression compilation error: Line 1:5 missing ')' at '<EOF>' (in column _c0)Expression error: Line 1:5 missing ')' at '<EOF>' (in column _c0)
log(3
-----^
Query<project_id=4,
interactive=None,
timeline_limit=None,
row_limit=None,
from_filters=None,
where_filters=None,
columns=[{'name': None, 'expression': 'log(3', 'type': 'METADATA'}],
sampling=None>
However, if an expression compiles successfully but fails during evaluation, then the result for that row will
be null. In this example, upper(bid.ad_size)
returns null because some events rows do not have ad size
defined at all.
select(
'bid.ad_size',
'upper(bid.ad_size)'
).from_events('lethe4').limit(3)
Query results:
partition "_default"_c0 | _c1 | |
---|---|---|
0 | None | None |
1 | big | BIG |
2 | big | BIG |
Expression Language Cheat Sheet
Where To Next?
Browse the full expression function reference.
Comments
Expressions may contain comments. Both inline and multi-line comments are supported:
Query results:
partition "_default"