Creating Timelines

Before you can write TQL queries, you must import your raw time series data into TQL and construct timelines. A Timeline is simply a collection of events identified by a common join key, for example a logged in user_id or web cookie, and sorted by timestamp. In this guide we will create a mappings from source data into TQL, inspect the data, then finally create a Project and Timelines that TQL queries can be executed against.

Creating A TimeSeries

Your source data is typically be stored in your data lake in flat files, and/or in a SQL-like data warehouse such as Big Query or Snowflake. The mapping between your data source and TQL is called a TimeSeries. It is important to note that a TimeSeries object in TQL is not the data itself, but merely a script on how to read the data out of storage into TQL. We start by using the TQL function create_timeseries(), providing a source using a from_files() or from_sql() clause, and providing the names of the common id and timestamp fields in the source data.

from zeenk.tql import *

-----------------------------------------------------------------
           Version: 20.1.17-SNAPSHOT
 Version Timestamp: 2021-08-27 20:56:17
       Version Age: 2 days, 19 hours, 33 minutes, 6 seconds
   Filesystem Root: /Users/zkozick/.tql/files
 Working Directory: /Users/zkozick/.tql
Configuration File: /Users/zkozick/.tql/tql-conf.yml
       Api Gateway: http://localhost:9000
    Service Status: Icarus: ONLINE, Daedalus: ONLINE
    Service Uptime: 1 minutes, 5 seconds
-----------------------------------------------------------------

from zeenk.tql.demo_projects import get_project4_files_dir

# this directory contains the time series data files for demo project 4.
base_dir = get_project4_files_dir()

activities = create_timeseries('activity')
activities.identified_by('user_id', 'event_time')
activities.from_files(
    f'{base_dir}/activity*.csv',
     has_csv_header=True
  ) 

TimeSeries(name=activity, 
	timeline_id_col=user_id, 
	timestamp_col=event_time, 
	format=CSV, 
	data_path=gs://noumena...ctivity*.csv, 
	sql_statements=None)

Validating A TimeSeries

the first step after your TimeSeries has been created is to validate it with validate(). Validation checks to make sure your data can be loaded from the external source, the identity fields are not empty, and the timestamp format is parsable.

activities.validate()

just a moment, setting up a PySpark session...
PySpark is now available via the function zeenk.tql.spark()
or just spark() if you used from zeenk.tql import *
validating 'activity'... no errors found.

Loading A TimeSeries into Pandas or Spark

Often it is useful to load your timeseries data as a dataframe for inspection or external analysis. For this we can use the convienence functions pandas_dataframe() and spark_dataframe():

activities.pandas_dataframe()

	activity_id	event_time	user_id	conversion
0	a15784	2021-01-11 00:00:03	u196	product_view
1	a15785	2021-01-11 00:00:51	u88	product_view
2	a15786	2021-01-11 00:01:18	u668	product_view
3	a15787	2021-01-11 00:01:19	u88	product_view
4	a15788	2021-01-11 00:01:22	u116	product_view
...	...	...	...	...
25411	a2924	2021-01-02 23:54:36	u718	product_view
25412	a2925	2021-01-02 23:55:02	u80	product_view
25413	a2926	2021-01-02 23:59:07	u819	product_view
25414	a2927	2021-01-02 23:59:35	u264	product_view
25415	a2928	2021-01-02 23:59:59	u169	product_view

25416 rows × 4 columns

activities.spark_dataframe().show(5)

+-----------+-------------------+-------+------------+
|activity_id|         event_time|user_id|  conversion|
+-----------+-------------------+-------+------------+
|     a15784|2021-01-11 00:00:03|   u196|product_view|
|     a15785|2021-01-11 00:00:51|    u88|product_view|
|     a15786|2021-01-11 00:01:18|   u668|product_view|
|     a15787|2021-01-11 00:01:19|    u88|product_view|
|     a15788|2021-01-11 00:01:22|   u116|product_view|
+-----------+-------------------+-------+------------+
only showing top 5 rows

Getting an Example TimeSeries Row

You can get an example from your time series using get_example():

activities.get_example()

{'activity_id': 'a15784',
 'event_time': '2021-01-11 00:00:03',
 'user_id': 'u196',
 'conversion': 'product_view'}

Visualizing A TimeSeries

It is often helpful to inspect your timeseries visually. For this we provide the function visualize(), available as a Jupyter extension:

activities.visualize()

preparing data...

Timeline(events='[{"_type": "activity", "_timeline_id": "u196", "_datetime": "2021-01-11 00:00:03", "activity_…

Analyzing A TimeSeries

TQL also provides a quick way to understand the shape of your TimeSeries, via the function analyze():

activities.analyze()

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  224/224|██████████|, TOTAL ELAPSED=0:00:10, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:09

TimeSeries Name: activity
Timeline Identity Column: user_id
Timestamp Column: event_time
File Format: CSV
Number of Events: 25416
Number of Timelines: 876
Data Path: gs://noumena-public/data/lethe_demo_set_randomize_users_id4_v1.3/activity*.csv

TimeSeries Statistics
Column	Cardinality	Range	Type
user_id	876	u0 - u999	string
activity_id	25416	a0 - i999	string
conversion	3	add_to_cart - purchase	string

Top 5 Values of activity_id:
Value	Number of Occurrences	Pct of All Events
a15822	1	0.0%
a15890	1	0.0%
a16415	1	0.0%
a16450	1	0.0%
a16462	1	0.0%

Top 5 Values of conversion:
Value	Number of Occurrences	Pct of All Events
product_view	22251	87.55%
add_to_cart	2165	8.52%
purchase	1000	3.93%

Top 5 Values of user_id:
Value	Number of Occurrences	Pct of All Events
u134	171	0.67%
u634	156	0.61%
u482	151	0.59%
u677	149	0.59%
u963	148	0.58%

Creating Another TimeSeries

Usually, a set of timelines will consist of more than a single time series. For example, we may wish to augment our site activity with a set of bids we made to show ads to the same set of users. For this we simply create another TimeSeries object, which we will eventually add to our project to build timelines.

bids = create_timeseries('bid')
bids.identified_by('user_id', 'event_time')
bids.from_files(
    f'{base_dir}/bid*.csv',
     has_csv_header=True
  ) 

TimeSeries(name=bid, 
	timeline_id_col=user_id, 
	timestamp_col=event_time, 
	format=CSV, 
	data_path=gs://noumena...1.3/bid*.csv, 
	sql_statements=None)

Creating a TimeSeries from an External SQL Store

It is common to already have your time series data loaded into a data warehouse engine for analytics. TQL supports directly ingesting data from an external source with the function from_sql() Flat files, Big Query, and Snowflake (with the appropriate credentials) are supported out of the box. Using SQL to map your data into TQL has the added benefit of being able to manipulate your data upon ingestion. Other common uses for SQL-style ingestion would be to filter out bot, duplicate, or garbage data, or to augment your timeseries with additional attributes from another table using a JOIN statement. In the following example, we assign a proxy timestamp to all our user records, as well as add a duration field:

users = create_timeseries('user')
users.identified_by('user_id', 'timestamp', duration_col='duration')
users.from_sql(f'''
  CREATE OR REPLACE TEMPORARY VIEW users 
  USING CSV options (
    path="{base_dir}/user.csv",
    inferSchema="true",
    header="true"
  )''',
#   '''create or replace temporary view users_filtered as select u.* from users where <fancy bot filter statement here>''',
  'SELECT u.*, "2021-01-01" timestamp, "1" duration FROM users u'
)

TimeSeries(name=user, 
	timeline_id_col=user_id, 
	timestamp_col=timestamp, 
	format=SQL, 
	data_path=None, 
	sql_statements=['\n  CREATE OR...r="true"\n  )', 'SELECT u.*, ...FROM users u'])

project = create_project('advertising_funnel_data', or_update=True)
project.from_timeseries(bids, activities, users)
timelines = project.build_timelines()

Click to Abort Job

NOTE: Abort link will open the result of the call in a new tab/window

SUCCESS:  234/234|██████████|, TOTAL ELAPSED=0:00:22, WAIT=0:00:00, PENDING=0:00:04, COMPILING=0:00:18  

Listing All Available Projects

You can list all available projects by using show_projects(). Note that TQL ships with four demo projects to help you learn the basics of TQL, without having to ingest any time series data:

show_projects()

id	name	description	timeseries	timelines	events	status
1	acheron	Acheron synthetic data, 2000 users, 5 requests per user, v1.1	['request', 'conversion', 'user']	2000.000000	14220.000000	ready for queries
2	acheron2	Acheron 2 synthetic data, reasonably_rich, 200 users, 10 days, v1.1	['request', 'conversion', 'user']	200.000000	13464.000000	ready for queries
3	lethe3	Lethe synthetic data, demo_set, random treatment, 1000 users, 14 days, v1.3	['bid', 'activity', 'user']	1000.000000	198341.000000	ready for queries
4	lethe4	Lethe synthetic data, demo_set, random users, 1000 users, 14 days, v1.3	['bid', 'activity', 'user']	1000.000000	198393.000000	ready for queries
10	unittest	this is a unit test project	['request']	1990.000000	12220.000000	ready for queries
101	None	unittest	['unittest']	nan	nan	timelines not yet created
102	advertising_funnel_data	None	['bid', 'activity', 'user']	1000.000000	198393.000000	ready for queries
103	demo_project	None	['activity', 'bid', 'user']	1000.000000	198393.000000	ready for queries

Loading a Project Definition

Previously built projects can be recalled using load_project(). This is often useful to inspect the TQL expression attributes that are adressable in queries.

load_project('advertising_funnel_data')

Table Overview
id	name	description	timeseries	status
102	advertising_funnel_data	No description yet added	['bid', 'activity', 'user']	ready for queries

Timeline Statistics
0	1
Total Timelines	1000
Smallest Timeline	1 events
Largest Timeline	904 events
Min Timestamp	2021-01-01 00:00:00
Max Timestamp	2021-01-14 23:59:59
Total Events	198393
Activity Events	25416
Bid Events	171977
User Events	1000

Attributes
TQL Expression	Description	Type
activity.activity_id	None	string
activity.conversion	None	string
activity.event_time	None	string
activity.user_id	None	string
bid.ad_size	None	string
bid.bid	None	double
bid.event_time	None	string
bid.ghosted	None	boolean
bid.request_id	None	string
bid.user_id	None	string
bid.won	None	boolean
duration	The duration of the event. Typically used with generated events.	long
event	The current event object	event
generated	This event was created by generate_events()	boolean
generated_weight	The weight to apply to the the sample	double
id	The auto-generated event id	long
index	The index of the current event in the timeline	integer
outcome	The outcome event. Only used in conjunction with SUM_OPPORTUNITIES() and CAUSAL_EFFFCT	event
timeline.events	Time-ordered array of events	array
timeline.id	Shared timeline id	string
timestamp	The timestamp of the event	long
type	The event type of this event	string
user.age	None	string
user.duration	None	string
user.gender	None	string
user.group	None	string
user.region	None	string
user.timestamp	None	string
user.user_id	None	string

Using A Project In A TQL Query

use select(...).from_events('project_name') to write Queries that produce ResultSets from your timelines.

select('*').from_events('advertising_funnel_data').limit(10)

Query results:

partition "_default"

	activity_activity_id	activity_conversion	activity_event_time	activity_user_id	bid_ad_size	bid_bid	bid_event_time	bid_ghosted	bid_request_id	bid_user_id	...	timeline_id	timestamp	type	user_age	user_duration	user_gender	user_group	user_region	user_timestamp	user_user_id
0	None	None	None	None	None	None	None	None	None	None	...	u252	1609459200000	user	55plus	None	male	influenced	south	2021-01-01	u252
1	None	None	None	None	big	1.0	2021-01-02 00:02:36	false	r13260	u252	...	u252	1609545756000	bid	None	None	None	None	None	None	None
2	None	None	None	None	big	1.0	2021-01-02 00:16:41	false	r13376	u252	...	u252	1609546601000	bid	None	None	None	None	None	None	None
3	None	None	None	None	small	1.0	2021-01-02 00:30:55	false	r13509	u252	...	u252	1609547455000	bid	None	None	None	None	None	None	None
4	i176	product_view	2021-01-02 00:57:47	u252	None	None	None	None	None	None	...	u252	1609549067000	activity	None	None	None	None	None	None	None
5	None	None	None	None	small	1.0	2021-01-02 01:04:01	false	r13737	u252	...	u252	1609549441000	bid	None	None	None	None	None	None	None
6	None	None	None	None	big	1.0	2021-01-02 01:37:26	false	r13985	u252	...	u252	1609551446000	bid	None	None	None	None	None	None	None
7	i187	product_view	2021-01-02 02:15:43	u252	None	None	None	None	None	None	...	u252	1609553743000	activity	None	None	None	None	None	None	None
8	None	None	None	None	big	1.0	2021-01-02 02:18:30	false	r14310	u252	...	u252	1609553910000	bid	None	None	None	None	None	None	None
9	None	None	None	None	big	1.0	2021-01-02 03:01:48	false	r14607	u252	...	u252	1609556508000	bid	None	None	None	None	None	None	None

10 rows × 27 columns

query produced 10 rows x 27 columns in 1.12 seconds

Go here for the full guide to writing TQL Queries.