Using ResultSets

TQL ResultSet is the result of calling the Query object’s submit() method. ResultSets contain exactly one row per event, and exactly one column per TQL expression defined in the select() clause.

When submitting a Query, you can define it to be either interactive with submit(interactive=True) or asynchronous with submit(interative=False). Interactive queries store their ResultSet data in memory. Asynchronous queries store their ResultSets on disk. ResultSets stored on disk can be recalled at a later time, see the load() function below.

from zeenk.tql import *
------------------------------------------------------------------
           Version: 20.1.17-SNAPSHOT
 Version Timestamp: 2021-08-27 20:56:17
       Version Age: 4 days, 13 hours, 39 minutes, 56 seconds
   Filesystem Root: /Users/zkozick/.tql/files
 Working Directory: /Users/zkozick/.tql
Configuration File: /Users/zkozick/.tql/tql-conf.yml
       Api Gateway: http://localhost:9000
    Service Status: Icarus: ONLINE, Daedalus: ONLINE
    Service Uptime: 1 days, 18 hours, 7 minutes, 55 seconds
------------------------------------------------------------------

Loading ResultSet Data

Loading Data as a Python List

The .dataframe() function returns the dataframe as a Python list of lists from all partitions:

resultset = select('bid.*').from_events('lethe4').limit(5).submit()
# first row of the dataframe
resultset.dataframe()
[[None, None, None, None, None, None, None],
 ['big', '1.0', '2021-01-02 00:02:36', 'false', 'r13260', 'u252', 'true'],
 ['big', '1.0', '2021-01-02 00:16:41', 'false', 'r13376', 'u252', 'true'],
 ['small', '1.0', '2021-01-02 00:30:55', 'false', 'r13509', 'u252', 'true'],
 [None, None, None, None, None, None, None]]

Loading Data as a Pandas Dataframe

Similar to dataframe, .pandas_dataframe() returns the data, but as a Pandas dataframe.

resultset = select('bid.*').from_events('lethe4').limit(5).submit()
resultset.pandas_dataframe()
bid_ad_size bid_bid bid_event_time bid_ghosted bid_request_id bid_user_id bid_won
0 None None None None None None None
1 big 1.0 2021-01-02 00:02:36 false r13260 u252 true
2 big 1.0 2021-01-02 00:16:41 false r13376 u252 true
3 small 1.0 2021-01-02 00:30:55 false r13509 u252 true
4 None None None None None None None

Loading Data as a Spark Dataframe

Similar to dataframe, .spark_dataframe() returns the data, but as a Spark dataframe. The backend Java process that manipulated the data outputs it as strings so Spark cannot always automatically infer data types.

resultset = select("'test'", '1').from_events('lethe4').limit(5).submit()
spark_df = resultset.spark_dataframe()
spark_df.show(2)
+----+---+
|  _1| _2|
+----+---+
|test|  1|
|test|  1|
+----+---+
only showing top 2 rows

ResultSet Partitions

ResultSet data is always partitioned, though if the query did not define a partition_key(expr), then there will only be a single “default” partition with name _default. If .partition_key(expr) was defined, then there will be N partitions, one for each unique value output by the partition key expression.

Default Partition

The method default_partition() outputs the contents of the default partition.

resultset = select('bid.*').from_events('lethe4').limit(5).submit()
resultset.default_partition()
bid_ad_size bid_bid bid_event_time bid_ghosted bid_request_id bid_user_id bid_won
0 None None None None None None None
1 big 1.0 2021-01-02 00:02:36 false r13260 u252 true
2 big 1.0 2021-01-02 00:16:41 false r13376 u252 true
3 small 1.0 2021-01-02 00:30:55 false r13509 u252 true
4 None None None None None None None

Partition Names

The partition_names() method outputs a tuple of partition names.

resultset = select('activity.*')\
.from_events('lethe4')\
.limit(7)\
.partition_by('IF(RANDOM() < 0.5, "train", "test")')\
.submit()

resultset.partition_names()
('test', 'train')

Partitions

The partitions() method outputs Partition objects.

for part in resultset.partitions():
  print(f"""
name: {part.name()}
row_count: {part.row_count()}""")
name: test
row_count: 5

name: train
row_count: 2

Partition

The partition() method gets the partition supplied as an argument. If no argument is supplied, it tries to get the default partition.

resultset.partition('test')
activity_activity_id activity_conversion activity_event_time activity_user_id
0 None None None None
1 None None None None
2 None None None None
3 None None None None
4 None None None None

Retrieving Column Names

Calling column_names() retrieves the columns of the ResultSet.

resultset = select('bid.*').from_events('lethe4').limit(5).submit()
resultset.column_names()

Row Count

The row_count() method returns the total number of rows in the ResultSet over all partitions.

resultset.row_count()
7

Positive row count

The positive_row_count() method returns the number of positive rows.

resultset.positive_row_count()
0

Get Query

The method get_query() returns the query that was used to generate the ResultSet

query = resultset.get_query()
query.describe()
Query
Attribute Value
project_id 4
timeline_limit -1
row_limit 7
partition_key_expression IF(RANDOM() < 0.5, "train", "test")
interactive True
allow_invalid_column_expressions True
Query.columns
name expression type
activity_activity_id activity.activity_id METADATA
activity_conversion activity.conversion METADATA
activity_event_time activity.event_time METADATA
activity_user_id activity.user_id METADATA

Get ID

The get_id method returns the ID of the ResultSet if the query was executed as a non-interactive job. Otherwise, it returns -1.

resultset = select('bid.*').from_events('lethe4').limit(5).submit(interactive=False)
resultset.get_id()

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  8/8|██████████|, TOTAL ELAPSED=0:00:17, WAIT=0:00:00, PENDING=0:00:01, RUNNING=0:00:16





33

Load

The load() method takes the ResultSet ID as an argument and loads it into the notebook. You can optionally set analyze parameter to True to get an analysis of the Query.

resultset = select('bid.*').from_events('lethe4').limit(5).submit(interactive=False)
rs_id = resultset.get_id()
load_resultset(rs_id).pandas_dataframe()

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  8/8|██████████|, TOTAL ELAPSED=0:00:06, WAIT=0:00:00, PENDING=0:00:01, RUNNING=0:00:05
bid_ad_size bid_bid bid_event_time bid_ghosted bid_request_id bid_user_id bid_won
0 None None None None None None None
1 big 1.0 2021-01-02 00:02:36 false r13260 u252 true
2 big 1.0 2021-01-02 00:16:41 false r13376 u252 true
3 small 1.0 2021-01-02 00:30:55 false r13509 u252 true
4 None None None None None None None

Refresh

The refresh() method refreshes non-interactive ResultSet fields after query execution finishes.

resultset = select('bid.*').from_events('lethe4').limit(5).submit(interactive=False)
resultset.refresh()
resultset

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  8/8|██████████|, TOTAL ELAPSED=0:00:07, WAIT=0:00:00, PENDING=0:00:01, RUNNING=0:00:06

Query results:

partition "_default"
key value
0 name _default
1 data_path file:/Users/smaguire/.tql/files/datasets/2021-08-31/94096521418144/data/partition=_default
2 row_count 40
3 columns [bid_ad_size, bid_bid, bid_event_time, bid_ghosted, bid_request_id, bid_user_id, bid_won]
4 positive_row_count 0
query produced 5 rows x 7 columns in 8.2 seconds

Metrics

The method metrics() returns timing and runtime metadata about the ResultSet.

resultset = select(
  # column value metrics are computed when using FeatureColumn such as numerical() and categorical()
  numerical('1'),
  categorical('type', name='my_categorical_col')
).from_events('lethe4').limit(5).submit(analyze=True)
resultset.metrics()

Expression timing stats:

name expression mean(ms) p95(ms)
0 _c0 1 0.007 0.022
1 my_categorical_col type 0.015 0.046

Timeline & Event runtime:

min max mean total p50 p75 p95 p99 unit
timeline_runtime 7405 7405 7405 7405 7405.0 7405.0 7405.0 7405.00 micros
event_runtime 29 652 193 773 42.0 50.0 531.6 627.92 micros

Query summary:

key value
0 query_runtime_micros 21732
1 outcome_generated 0
2 timelines_processed 1
3 real_events_processed 5
4 event_passed_filters 5
5 opportunity_event_count 0
6 null_label_evaluation_count 0
7 null_weight_evaluation_count 0
8 events_processed 5
9 column_expanded False
10 dropped_empty_rows_after_conversion 0
11 positive_example_count 0
12 negative_example_count 5
13 empty_example_count 0
14 label_sum 0.0
15 weighted_label_sum 0.0

Column value metrics:

name column_index positive_count negative_count non_null_count non_zero_count positive_sum negative_sum positive_weighted_count negative_weighted_count label_sum weighted_label_sum min_numerical_value max_numerical_value total_weighted_count total_sum total_count
0 _c0 0 0 5 5 5 0.0 5.0 0.0 5.0 0.0 0.0 1.0 1.0 5.0 5.0 5
1 my_categorical_col_EQ_bid 1 0 3 3 0 0.0 0.0 0.0 3.0 0.0 0.0 NaN NaN 3.0 0.0 3
2 my_categorical_col_EQ_activity 1 0 1 1 0 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN 1.0 0.0 1
3 my_categorical_col_EQ_user 1 0 1 1 0 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN 1.0 0.0 1