Creating Timelines

Before you can write TQL queries, you must import your raw time series data into TQL and construct timelines. A Timeline is simply a collection of events identified by a common join key, for example a logged in user_id or web cookie, and sorted by timestamp. In this guide we will create a mappings from source data into TQL, inspect the data, then finally create a Project and Timelines that TQL queries can be executed against.

Creating A TimeSeries

Your source data is typically be stored in your data lake in flat files, and/or in a SQL-like data warehouse such as Big Query or Snowflake. The mapping between your data source and TQL is called a TimeSeries. It is important to note that a TimeSeries object in TQL is not the data itself, but merely a script on how to read the data out of storage into TQL. We start by using the TQL function create_timeseries(), providing a source using a from_files() or from_sql() clause, and providing the names of the common id and timestamp fields in the source data.

from zeenk.tql import *
-----------------------------------------------------------------
           Version: 20.1.17-SNAPSHOT
 Version Timestamp: 2021-08-27 20:56:17
       Version Age: 2 days, 19 hours, 33 minutes, 6 seconds
   Filesystem Root: /Users/zkozick/.tql/files
 Working Directory: /Users/zkozick/.tql
Configuration File: /Users/zkozick/.tql/tql-conf.yml
       Api Gateway: http://localhost:9000
    Service Status: Icarus: ONLINE, Daedalus: ONLINE
    Service Uptime: 1 minutes, 5 seconds
-----------------------------------------------------------------
from zeenk.tql.demo_projects import get_project4_files_dir

# this directory contains the time series data files for demo project 4.
base_dir = get_project4_files_dir()

activities = create_timeseries('activity')
activities.identified_by('user_id', 'event_time')
activities.from_files(
    f'{base_dir}/activity*.csv',
     has_csv_header=True
  ) 
TimeSeries(name=activity, 
	timeline_id_col=user_id, 
	timestamp_col=event_time, 
	format=CSV, 
	data_path=gs://noumena...ctivity*.csv, 
	sql_statements=None)

Validating A TimeSeries

the first step after your TimeSeries has been created is to validate it with validate(). Validation checks to make sure your data can be loaded from the external source, the identity fields are not empty, and the timestamp format is parsable.

activities.validate()
just a moment, setting up a PySpark session...
PySpark is now available via the function zeenk.tql.spark()
or just spark() if you used from zeenk.tql import *
validating 'activity'... no errors found.

Loading A TimeSeries into Pandas or Spark

Often it is useful to load your timeseries data as a dataframe for inspection or external analysis. For this we can use the convienence functions pandas_dataframe() and spark_dataframe():

activities.pandas_dataframe()
activity_id event_time user_id conversion
0 a15784 2021-01-11 00:00:03 u196 product_view
1 a15785 2021-01-11 00:00:51 u88 product_view
2 a15786 2021-01-11 00:01:18 u668 product_view
3 a15787 2021-01-11 00:01:19 u88 product_view
4 a15788 2021-01-11 00:01:22 u116 product_view
... ... ... ... ...
25411 a2924 2021-01-02 23:54:36 u718 product_view
25412 a2925 2021-01-02 23:55:02 u80 product_view
25413 a2926 2021-01-02 23:59:07 u819 product_view
25414 a2927 2021-01-02 23:59:35 u264 product_view
25415 a2928 2021-01-02 23:59:59 u169 product_view

25416 rows × 4 columns

activities.spark_dataframe().show(5)
+-----------+-------------------+-------+------------+
|activity_id|         event_time|user_id|  conversion|
+-----------+-------------------+-------+------------+
|     a15784|2021-01-11 00:00:03|   u196|product_view|
|     a15785|2021-01-11 00:00:51|    u88|product_view|
|     a15786|2021-01-11 00:01:18|   u668|product_view|
|     a15787|2021-01-11 00:01:19|    u88|product_view|
|     a15788|2021-01-11 00:01:22|   u116|product_view|
+-----------+-------------------+-------+------------+
only showing top 5 rows

Getting an Example TimeSeries Row

You can get an example from your time series using get_example():

activities.get_example()
{'activity_id': 'a15784',
 'event_time': '2021-01-11 00:00:03',
 'user_id': 'u196',
 'conversion': 'product_view'}

Visualizing A TimeSeries

It is often helpful to inspect your timeseries visually. For this we provide the function visualize(), available as a Jupyter extension:

activities.visualize()
preparing data...



Timeline(events='[{"_type": "activity", "_timeline_id": "u196", "_datetime": "2021-01-11 00:00:03", "activity_…

Analyzing A TimeSeries

TQL also provides a quick way to understand the shape of your TimeSeries, via the function analyze():

activities.analyze()

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  224/224|██████████|, TOTAL ELAPSED=0:00:10, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:09
TimeSeries Name: activity
Timeline Identity Column: user_id
Timestamp Column: event_time
File Format: CSV
Number of Events: 25416
Number of Timelines: 876
Data Path: gs://noumena-public/data/lethe_demo_set_randomize_users_id4_v1.3/activity*.csv

TimeSeries Statistics
Column Cardinality Range Type
user_id 876 u0 - u999 string
activity_id 25416 a0 - i999 string
conversion 3 add_to_cart - purchase string

Top 5 Values of activity_id:
Value Number of Occurrences Pct of All Events
a15822 1 0.0%
a15890 1 0.0%
a16415 1 0.0%
a16450 1 0.0%
a16462 1 0.0%
Top 5 Values of conversion:
Value Number of Occurrences Pct of All Events
product_view 22251 87.55%
add_to_cart 2165 8.52%
purchase 1000 3.93%
Top 5 Values of user_id:
Value Number of Occurrences Pct of All Events
u134 171 0.67%
u634 156 0.61%
u482 151 0.59%
u677 149 0.59%
u963 148 0.58%

Creating Another TimeSeries

Usually, a set of timelines will consist of more than a single time series. For example, we may wish to augment our site activity with a set of bids we made to show ads to the same set of users. For this we simply create another TimeSeries object, which we will eventually add to our project to build timelines.

bids = create_timeseries('bid')
bids.identified_by('user_id', 'event_time')
bids.from_files(
    f'{base_dir}/bid*.csv',
     has_csv_header=True
  ) 
TimeSeries(name=bid, 
	timeline_id_col=user_id, 
	timestamp_col=event_time, 
	format=CSV, 
	data_path=gs://noumena...1.3/bid*.csv, 
	sql_statements=None)

Creating a TimeSeries from an External SQL Store

It is common to already have your time series data loaded into a data warehouse engine for analytics. TQL supports directly ingesting data from an external source with the function from_sql() Flat files, Big Query, and Snowflake (with the appropriate credentials) are supported out of the box. Using SQL to map your data into TQL has the added benefit of being able to manipulate your data upon ingestion. Other common uses for SQL-style ingestion would be to filter out bot, duplicate, or garbage data, or to augment your timeseries with additional attributes from another table using a JOIN statement. In the following example, we assign a proxy timestamp to all our user records, as well as add a duration field:

users = create_timeseries('user')
users.identified_by('user_id', 'timestamp', duration_col='duration')
users.from_sql(f'''
  CREATE OR REPLACE TEMPORARY VIEW users 
  USING CSV options (
    path="{base_dir}/user.csv",
    inferSchema="true",
    header="true"
  )''',
#   '''create or replace temporary view users_filtered as select u.* from users where <fancy bot filter statement here>''',
  'SELECT u.*, "2021-01-01" timestamp, "1" duration FROM users u'
)
TimeSeries(name=user, 
	timeline_id_col=user_id, 
	timestamp_col=timestamp, 
	format=SQL, 
	data_path=None, 
	sql_statements=['\n  CREATE OR...r="true"\n  )', 'SELECT u.*, ...FROM users u'])
project = create_project('advertising_funnel_data', or_update=True)
project.from_timeseries(bids, activities, users)
timelines = project.build_timelines()

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  234/234|██████████|, TOTAL ELAPSED=0:00:22, WAIT=0:00:00, PENDING=0:00:04, COMPILING=0:00:18  

Listing All Available Projects

You can list all available projects by using show_projects(). Note that TQL ships with four demo projects to help you learn the basics of TQL, without having to ingest any time series data:

show_projects()
id name description timeseries timelines events status
1 acheron Acheron synthetic data, 2000 users, 5 requests per user, v1.1 ['request', 'conversion', 'user'] 2000.000000 14220.000000 ready for queries
2 acheron2 Acheron 2 synthetic data, reasonably_rich, 200 users, 10 days, v1.1 ['request', 'conversion', 'user'] 200.000000 13464.000000 ready for queries
3 lethe3 Lethe synthetic data, demo_set, random treatment, 1000 users, 14 days, v1.3 ['bid', 'activity', 'user'] 1000.000000 198341.000000 ready for queries
4 lethe4 Lethe synthetic data, demo_set, random users, 1000 users, 14 days, v1.3 ['bid', 'activity', 'user'] 1000.000000 198393.000000 ready for queries
10 unittest this is a unit test project ['request'] 1990.000000 12220.000000 ready for queries
101 None unittest ['unittest'] nan nan timelines not yet created
102 advertising_funnel_data None ['bid', 'activity', 'user'] 1000.000000 198393.000000 ready for queries
103 demo_project None ['activity', 'bid', 'user'] 1000.000000 198393.000000 ready for queries

Loading a Project Definition

Previously built projects can be recalled using load_project(). This is often useful to inspect the TQL expression attributes that are adressable in queries.

load_project('advertising_funnel_data')
Table Overview
id name description timeseries status
102 advertising_funnel_data No description yet added ['bid', 'activity', 'user'] ready for queries

Timeline Statistics
0 1
Total Timelines 1000
Smallest Timeline 1 events
Largest Timeline 904 events
Min Timestamp 2021-01-01 00:00:00
Max Timestamp 2021-01-14 23:59:59
Total Events 198393
Activity Events 25416
Bid Events 171977
User Events 1000

Attributes
TQL Expression Description Type
activity.activity_id None string
activity.conversion None string
activity.event_time None string
activity.user_id None string
bid.ad_size None string
bid.bid None double
bid.event_time None string
bid.ghosted None boolean
bid.request_id None string
bid.user_id None string
bid.won None boolean
duration The duration of the event. Typically used with generated events. long
event The current event object event
generated This event was created by generate_events() boolean
generated_weight The weight to apply to the the sample double
id The auto-generated event id long
index The index of the current event in the timeline integer
outcome The outcome event. Only used in conjunction with SUM_OPPORTUNITIES() and CAUSAL_EFFFCT event
timeline.events Time-ordered array of events array
timeline.id Shared timeline id string
timestamp The timestamp of the event long
type The event type of this event string
user.age None string
user.duration None string
user.gender None string
user.group None string
user.region None string
user.timestamp None string
user.user_id None string

Using A Project In A TQL Query

use select(...).from_events('project_name') to write Queries that produce ResultSets from your timelines.

select('*').from_events('advertising_funnel_data').limit(10)

Query results:

partition "_default"
activity_activity_id activity_conversion activity_event_time activity_user_id bid_ad_size bid_bid bid_event_time bid_ghosted bid_request_id bid_user_id ... timeline_id timestamp type user_age user_duration user_gender user_group user_region user_timestamp user_user_id
0 None None None None None None None None None None ... u252 1609459200000 user 55plus None male influenced south 2021-01-01 u252
1 None None None None big 1.0 2021-01-02 00:02:36 false r13260 u252 ... u252 1609545756000 bid None None None None None None None
2 None None None None big 1.0 2021-01-02 00:16:41 false r13376 u252 ... u252 1609546601000 bid None None None None None None None
3 None None None None small 1.0 2021-01-02 00:30:55 false r13509 u252 ... u252 1609547455000 bid None None None None None None None
4 i176 product_view 2021-01-02 00:57:47 u252 None None None None None None ... u252 1609549067000 activity None None None None None None None
5 None None None None small 1.0 2021-01-02 01:04:01 false r13737 u252 ... u252 1609549441000 bid None None None None None None None
6 None None None None big 1.0 2021-01-02 01:37:26 false r13985 u252 ... u252 1609551446000 bid None None None None None None None
7 i187 product_view 2021-01-02 02:15:43 u252 None None None None None None ... u252 1609553743000 activity None None None None None None None
8 None None None None big 1.0 2021-01-02 02:18:30 false r14310 u252 ... u252 1609553910000 bid None None None None None None None
9 None None None None big 1.0 2021-01-02 03:01:48 false r14607 u252 ... u252 1609556508000 bid None None None None None None None

10 rows × 27 columns

query produced 10 rows x 27 columns in 1.12 seconds

Go here for the full guide to writing TQL Queries.