Using ResultSets

TQL ResultSet is the result of calling the Query object’s submit() method. ResultSets contain exactly one row per event, and exactly one column per TQL expression defined in the select() clause.

When submitting a Query, you can define it to be either interactive with submit(interactive=True) or asynchronous with submit(interative=False). Interactive queries store their ResultSet data in memory. Asynchronous queries store their ResultSets on disk. ResultSets stored on disk can be recalled at a later time, see the load() function below.

from zeenk.tql import *

------------------------------------------------------------------
           Version: 20.1.17-SNAPSHOT
 Version Timestamp: 2021-08-27 20:56:17
       Version Age: 4 days, 13 hours, 39 minutes, 56 seconds
   Filesystem Root: /Users/zkozick/.tql/files
 Working Directory: /Users/zkozick/.tql
Configuration File: /Users/zkozick/.tql/tql-conf.yml
       Api Gateway: http://localhost:9000
    Service Status: Icarus: ONLINE, Daedalus: ONLINE
    Service Uptime: 1 days, 18 hours, 7 minutes, 55 seconds
------------------------------------------------------------------

Loading ResultSet Data

Loading Data as a Python List

The .dataframe() function returns the dataframe as a Python list of lists from all partitions:

resultset = select('bid.*').from_events('lethe4').limit(5).submit()
# first row of the dataframe
resultset.dataframe()

[[None, None, None, None, None, None, None],
 ['big', '1.0', '2021-01-02 00:02:36', 'false', 'r13260', 'u252', 'true'],
 ['big', '1.0', '2021-01-02 00:16:41', 'false', 'r13376', 'u252', 'true'],
 ['small', '1.0', '2021-01-02 00:30:55', 'false', 'r13509', 'u252', 'true'],
 [None, None, None, None, None, None, None]]

Loading Data as a Pandas Dataframe

Similar to dataframe, .pandas_dataframe() returns the data, but as a Pandas dataframe.

resultset = select('bid.*').from_events('lethe4').limit(5).submit()
resultset.pandas_dataframe()

	bid_ad_size	bid_bid	bid_event_time	bid_ghosted	bid_request_id	bid_user_id	bid_won
0	None	None	None	None	None	None	None
1	big	1.0	2021-01-02 00:02:36	false	r13260	u252	true
2	big	1.0	2021-01-02 00:16:41	false	r13376	u252	true
3	small	1.0	2021-01-02 00:30:55	false	r13509	u252	true
4	None	None	None	None	None	None	None

Loading Data as a Spark Dataframe

Similar to dataframe, .spark_dataframe() returns the data, but as a Spark dataframe. The backend Java process that manipulated the data outputs it as strings so Spark cannot always automatically infer data types.

resultset = select("'test'", '1').from_events('lethe4').limit(5).submit()
spark_df = resultset.spark_dataframe()
spark_df.show(2)

+----+---+
|  _1| _2|
+----+---+
|test|  1|
|test|  1|
+----+---+
only showing top 2 rows

ResultSet Partitions

ResultSet data is always partitioned, though if the query did not define a partition_key(expr), then there will only be a single “default” partition with name _default. If .partition_key(expr) was defined, then there will be N partitions, one for each unique value output by the partition key expression.

Default Partition

The method default_partition() outputs the contents of the default partition.

resultset = select('bid.*').from_events('lethe4').limit(5).submit()
resultset.default_partition()

	bid_ad_size	bid_bid	bid_event_time	bid_ghosted	bid_request_id	bid_user_id	bid_won
0	None	None	None	None	None	None	None
1	big	1.0	2021-01-02 00:02:36	false	r13260	u252	true
2	big	1.0	2021-01-02 00:16:41	false	r13376	u252	true
3	small	1.0	2021-01-02 00:30:55	false	r13509	u252	true
4	None	None	None	None	None	None	None

Partition Names

The partition_names() method outputs a tuple of partition names.

resultset = select('activity.*')\
.from_events('lethe4')\
.limit(7)\
.partition_by('IF(RANDOM() < 0.5, "train", "test")')\
.submit()

resultset.partition_names()

('test', 'train')

Partitions

The partitions() method outputs Partition objects.

for part in resultset.partitions():
  print(f"""
name: {part.name()}
row_count: {part.row_count()}""")

name: test
row_count: 5

name: train
row_count: 2

Partition

The partition() method gets the partition supplied as an argument. If no argument is supplied, it tries to get the default partition.

resultset.partition('test')

	activity_activity_id	activity_conversion	activity_event_time	activity_user_id
0	None	None	None	None
1	None	None	None	None
2	None	None	None	None
3	None	None	None	None
4	None	None	None	None

Retrieving Column Names

Calling column_names() retrieves the columns of the ResultSet.

resultset = select('bid.*').from_events('lethe4').limit(5).submit()
resultset.column_names()

Row Count

The row_count() method returns the total number of rows in the ResultSet over all partitions.

resultset.row_count()

Positive row count

The positive_row_count() method returns the number of positive rows.

resultset.positive_row_count()

Get Query

The method get_query() returns the query that was used to generate the ResultSet

query = resultset.get_query()
query.describe()

Query
Attribute	Value
project_id	4
timeline_limit	-1
row_limit	7
partition_key_expression	IF(RANDOM() < 0.5, "train", "test")
interactive	True
allow_invalid_column_expressions	True

Query.columns
name	expression	type
activity_activity_id	activity.activity_id	METADATA
activity_conversion	activity.conversion	METADATA
activity_event_time	activity.event_time	METADATA
activity_user_id	activity.user_id	METADATA

Get ID

The get_id method returns the ID of the ResultSet if the query was executed as a non-interactive job. Otherwise, it returns -1.

resultset = select('bid.*').from_events('lethe4').limit(5).submit(interactive=False)
resultset.get_id()

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  8/8|██████████|, TOTAL ELAPSED=0:00:17, WAIT=0:00:00, PENDING=0:00:01, RUNNING=0:00:16

33

Load

The load() method takes the ResultSet ID as an argument and loads it into the notebook. You can optionally set analyze parameter to True to get an analysis of the Query.

resultset = select('bid.*').from_events('lethe4').limit(5).submit(interactive=False)
rs_id = resultset.get_id()
load_resultset(rs_id).pandas_dataframe()

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  8/8|██████████|, TOTAL ELAPSED=0:00:06, WAIT=0:00:00, PENDING=0:00:01, RUNNING=0:00:05

	bid_ad_size	bid_bid	bid_event_time	bid_ghosted	bid_request_id	bid_user_id	bid_won
0	None	None	None	None	None	None	None
1	big	1.0	2021-01-02 00:02:36	false	r13260	u252	true
2	big	1.0	2021-01-02 00:16:41	false	r13376	u252	true
3	small	1.0	2021-01-02 00:30:55	false	r13509	u252	true
4	None	None	None	None	None	None	None

Refresh

The refresh() method refreshes non-interactive ResultSet fields after query execution finishes.

resultset = select('bid.*').from_events('lethe4').limit(5).submit(interactive=False)
resultset.refresh()
resultset

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  8/8|██████████|, TOTAL ELAPSED=0:00:07, WAIT=0:00:00, PENDING=0:00:01, RUNNING=0:00:06

Query results:

partition "_default"

	key	value
0	name	_default
1	data_path	file:/Users/smaguire/.tql/files/datasets/2021-08-31/94096521418144/data/partition=_default
2	row_count	40
3	columns	[bid_ad_size, bid_bid, bid_event_time, bid_ghosted, bid_request_id, bid_user_id, bid_won]
4	positive_row_count	0

query produced 5 rows x 7 columns in 8.2 seconds

Metrics

The method metrics() returns timing and runtime metadata about the ResultSet.

resultset = select(
  # column value metrics are computed when using FeatureColumn such as numerical() and categorical()
  numerical('1'),
  categorical('type', name='my_categorical_col')
).from_events('lethe4').limit(5).submit(analyze=True)
resultset.metrics()

Expression timing stats:

	name	expression	mean(ms)	p95(ms)
0	_c0	1	0.007	0.022
1	my_categorical_col	type	0.015	0.046

Timeline & Event runtime:

	min	max	mean	total	p50	p75	p95	p99	unit
timeline_runtime	7405	7405	7405	7405	7405.0	7405.0	7405.0	7405.00	micros
event_runtime	29	652	193	773	42.0	50.0	531.6	627.92	micros

Query summary:

	key	value
0	query_runtime_micros	21732
1	outcome_generated	0
2	timelines_processed	1
3	real_events_processed	5
4	event_passed_filters	5
5	opportunity_event_count	0
6	null_label_evaluation_count	0
7	null_weight_evaluation_count	0
8	events_processed	5
9	column_expanded	False
10	dropped_empty_rows_after_conversion	0
11	positive_example_count	0
12	negative_example_count	5
13	empty_example_count	0
14	label_sum	0.0
15	weighted_label_sum	0.0

Column value metrics:

	name	column_index	negative_count	non_null_count	non_zero_count	negative_sum	negative_weighted_count	min_numerical_value	max_numerical_value	total_weighted_count	total_sum	total_count
0	_c0	0	5	5	5	5.0	5.0	1.0	1.0	5.0	5.0	5
1	my_categorical_col_EQ_bid	1	3	3	0	0.0	3.0	NaN	NaN	3.0	0.0	3
2	my_categorical_col_EQ_activity	1	1	1	0	0.0	1.0	NaN	NaN	1.0	0.0	1
3	my_categorical_col_EQ_user	1	1	1	0	0.0	1.0	NaN	NaN	1.0	0.0	1