# Using ResultSets TQL ResultSet is the result of calling the [Query](queries) object's `submit()` method. ResultSets contain ***exactly one row per event***, and ***exactly one column per TQL expression*** defined in the `select()` clause. When submitting a [Query](queries), you can define it to be either interactive with `submit(interactive=True)` or asynchronous with `submit(interative=False)`. Interactive queries store their ResultSet data in memory. Asynchronous queries store their ResultSets on disk. ResultSets stored on disk can be recalled at a later time, see the [load() function](#Load) below. ```python from zeenk.tql import * ``` ------------------------------------------------------------------ Version: 20.1.17-SNAPSHOT Version Timestamp: 2021-08-27 20:56:17 Version Age: 4 days, 13 hours, 39 minutes, 56 seconds Filesystem Root: /Users/zkozick/.tql/files Working Directory: /Users/zkozick/.tql Configuration File: /Users/zkozick/.tql/tql-conf.yml Api Gateway: http://localhost:9000 Service Status: Icarus: ONLINE, Daedalus: ONLINE Service Uptime: 1 days, 18 hours, 7 minutes, 55 seconds ------------------------------------------------------------------ ## Loading ResultSet Data ### Loading Data as a Python List The `.dataframe()` function returns the dataframe as a Python list of lists from all partitions: ```python resultset = select('bid.*').from_events('lethe4').limit(5).submit() # first row of the dataframe resultset.dataframe() ``` [[None, None, None, None, None, None, None], ['big', '1.0', '2021-01-02 00:02:36', 'false', 'r13260', 'u252', 'true'], ['big', '1.0', '2021-01-02 00:16:41', 'false', 'r13376', 'u252', 'true'], ['small', '1.0', '2021-01-02 00:30:55', 'false', 'r13509', 'u252', 'true'], [None, None, None, None, None, None, None]] ### Loading Data as a Pandas Dataframe Similar to dataframe, `.pandas_dataframe()` returns the data, but as a [Pandas](https://pandas.pydata.org/) dataframe. ```python resultset = select('bid.*').from_events('lethe4').limit(5).submit() resultset.pandas_dataframe() ```
bid_ad_size bid_bid bid_event_time bid_ghosted bid_request_id bid_user_id bid_won
0 None None None None None None None
1 big 1.0 2021-01-02 00:02:36 false r13260 u252 true
2 big 1.0 2021-01-02 00:16:41 false r13376 u252 true
3 small 1.0 2021-01-02 00:30:55 false r13509 u252 true
4 None None None None None None None
### Loading Data as a Spark Dataframe Similar to dataframe, `.spark_dataframe()` returns the data, but as a [Spark](https://spark.apache.org/docs/3.1.1/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations) dataframe. The backend Java process that manipulated the data outputs it as strings so Spark cannot always automatically infer data types. ```python resultset = select("'test'", '1').from_events('lethe4').limit(5).submit() spark_df = resultset.spark_dataframe() spark_df.show(2) ``` +----+---+ | _1| _2| +----+---+ |test| 1| |test| 1| +----+---+ only showing top 2 rows ## ResultSet Partitions ResultSet data is always partitioned, though if the query did not define a `partition_key(expr)`, then there will only be a single "default" partition with name `_default`. If .`partition_key(expr)` was defined, then there will be N partitions, one for each unique value output by the partition key expression. ### Default Partition The method `default_partition()` outputs the contents of the default partition. ```python resultset = select('bid.*').from_events('lethe4').limit(5).submit() resultset.default_partition() ```
bid_ad_size bid_bid bid_event_time bid_ghosted bid_request_id bid_user_id bid_won
0 None None None None None None None
1 big 1.0 2021-01-02 00:02:36 false r13260 u252 true
2 big 1.0 2021-01-02 00:16:41 false r13376 u252 true
3 small 1.0 2021-01-02 00:30:55 false r13509 u252 true
4 None None None None None None None
### Partition Names The `partition_names()` method outputs a tuple of partition names. ```python resultset = select('activity.*')\ .from_events('lethe4')\ .limit(7)\ .partition_by('IF(RANDOM() < 0.5, "train", "test")')\ .submit() resultset.partition_names() ``` ('test', 'train') ### Partitions The `partitions()` method outputs [Partition](tql.html#tql.resultset.Partition) objects. ```python for part in resultset.partitions(): print(f""" name: {part.name()} row_count: {part.row_count()}""") ``` name: test row_count: 5 name: train row_count: 2 ### Partition The `partition()` method gets the partition supplied as an argument. If no argument is supplied, it tries to get the default partition. ```python resultset.partition('test') ```
activity_activity_id activity_conversion activity_event_time activity_user_id
0 None None None None
1 None None None None
2 None None None None
3 None None None None
4 None None None None
## Retrieving Column Names Calling column_names() retrieves the columns of the ResultSet. ```python resultset = select('bid.*').from_events('lethe4').limit(5).submit() resultset.column_names() ``` ## Row Count The `row_count()` method returns the total number of rows in the ResultSet over all partitions. ```python resultset.row_count() ``` 7 ## Positive row count The `positive_row_count()` method returns the number of positive rows. ```python resultset.positive_row_count() ``` 0 ## Get Query The method `get_query()` returns the query that was used to generate the ResultSet ```python query = resultset.get_query() query.describe() ```
Query
Attribute Value
project_id 4
timeline_limit -1
row_limit 7
partition_key_expression IF(RANDOM() < 0.5, "train", "test")
interactive True
allow_invalid_column_expressions True
Query.columns
name expression type
activity_activity_id activity.activity_id METADATA
activity_conversion activity.conversion METADATA
activity_event_time activity.event_time METADATA
activity_user_id activity.user_id METADATA
## Get ID The `get_id` method returns the ID of the ResultSet if the query was executed as a non-interactive job. Otherwise, it returns `-1`. ```python resultset = select('bid.*').from_events('lethe4').limit(5).submit(interactive=False) resultset.get_id() ``` Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS: 8/8|██████████|, TOTAL ELAPSED=0:00:17, WAIT=0:00:00, PENDING=0:00:01, RUNNING=0:00:16 33 ## Load The `load()` method takes the ResultSet ID as an argument and loads it into the notebook. You can optionally set `analyze` parameter to `True` to get an analysis of the Query. ```python resultset = select('bid.*').from_events('lethe4').limit(5).submit(interactive=False) rs_id = resultset.get_id() load_resultset(rs_id).pandas_dataframe() ``` Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS: 8/8|██████████|, TOTAL ELAPSED=0:00:06, WAIT=0:00:00, PENDING=0:00:01, RUNNING=0:00:05
bid_ad_size bid_bid bid_event_time bid_ghosted bid_request_id bid_user_id bid_won
0 None None None None None None None
1 big 1.0 2021-01-02 00:02:36 false r13260 u252 true
2 big 1.0 2021-01-02 00:16:41 false r13376 u252 true
3 small 1.0 2021-01-02 00:30:55 false r13509 u252 true
4 None None None None None None None
## Refresh The `refresh()` method refreshes non-interactive ResultSet fields after query execution finishes. ```python resultset = select('bid.*').from_events('lethe4').limit(5).submit(interactive=False) resultset.refresh() resultset ``` Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS: 8/8|██████████|, TOTAL ELAPSED=0:00:07, WAIT=0:00:00, PENDING=0:00:01, RUNNING=0:00:06

Query results:

partition "_default"
key value
0 name _default
1 data_path file:/Users/smaguire/.tql/files/datasets/2021-08-31/94096521418144/data/partition=_default
2 row_count 40
3 columns [bid_ad_size, bid_bid, bid_event_time, bid_ghosted, bid_request_id, bid_user_id, bid_won]
4 positive_row_count 0
query produced 5 rows x 7 columns in 8.2 seconds

## Metrics The method `metrics()` returns timing and runtime metadata about the ResultSet. ```python resultset = select( # column value metrics are computed when using FeatureColumn such as numerical() and categorical() numerical('1'), categorical('type', name='my_categorical_col') ).from_events('lethe4').limit(5).submit(analyze=True) resultset.metrics() ```

Expression timing stats:

name expression mean(ms) p95(ms)
0 _c0 1 0.007 0.022
1 my_categorical_col type 0.015 0.046

Timeline & Event runtime:

min max mean total p50 p75 p95 p99 unit
timeline_runtime 7405 7405 7405 7405 7405.0 7405.0 7405.0 7405.00 micros
event_runtime 29 652 193 773 42.0 50.0 531.6 627.92 micros

Query summary:

key value
0 query_runtime_micros 21732
1 outcome_generated 0
2 timelines_processed 1
3 real_events_processed 5
4 event_passed_filters 5
5 opportunity_event_count 0
6 null_label_evaluation_count 0
7 null_weight_evaluation_count 0
8 events_processed 5
9 column_expanded False
10 dropped_empty_rows_after_conversion 0
11 positive_example_count 0
12 negative_example_count 5
13 empty_example_count 0
14 label_sum 0.0
15 weighted_label_sum 0.0

Column value metrics:

name column_index positive_count negative_count non_null_count non_zero_count positive_sum negative_sum positive_weighted_count negative_weighted_count label_sum weighted_label_sum min_numerical_value max_numerical_value total_weighted_count total_sum total_count
0 _c0 0 0 5 5 5 0.0 5.0 0.0 5.0 0.0 0.0 1.0 1.0 5.0 5.0 5
1 my_categorical_col_EQ_bid 1 0 3 3 0 0.0 0.0 0.0 3.0 0.0 0.0 NaN NaN 3.0 0.0 3
2 my_categorical_col_EQ_activity 1 0 1 1 0 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN 1.0 0.0 1
3 my_categorical_col_EQ_user 1 0 1 1 0 0.0 0.0 0.0 1.0 0.0 0.0 NaN NaN 1.0 0.0 1