Creating Timelines
Before you can write TQL queries, you must import your raw time series data into TQL and construct timelines. A Timeline is simply a collection of events identified by a common join key, for example a logged in user_id or web cookie, and sorted by timestamp. In this guide we will create a mappings from source data into TQL, inspect the data, then finally create a Project and Timelines that TQL queries can be executed against.
Creating A TimeSeries
Your source data is typically be stored in your data lake in flat files, and/or in a SQL-like data warehouse such as Big Query or Snowflake. The mapping between your data source and TQL is called a TimeSeries. It is important to note that a TimeSeries object in TQL is not the data itself, but merely a script on how to read the data out of storage into TQL. We start by using the TQL function create_timeseries()
, providing a source using a from_files()
or from_sql()
clause, and providing the names of the common id and timestamp fields in the source data.
from zeenk.tql import *
-----------------------------------------------------------------
Version: 20.1.17-SNAPSHOT
Version Timestamp: 2021-08-27 20:56:17
Version Age: 2 days, 19 hours, 33 minutes, 6 seconds
Filesystem Root: /Users/zkozick/.tql/files
Working Directory: /Users/zkozick/.tql
Configuration File: /Users/zkozick/.tql/tql-conf.yml
Api Gateway: http://localhost:9000
Service Status: Icarus: ONLINE, Daedalus: ONLINE
Service Uptime: 1 minutes, 5 seconds
-----------------------------------------------------------------
from zeenk.tql.demo_projects import get_project4_files_dir
# this directory contains the time series data files for demo project 4.
base_dir = get_project4_files_dir()
activities = create_timeseries('activity')
activities.identified_by('user_id', 'event_time')
activities.from_files(
f'{base_dir}/activity*.csv',
has_csv_header=True
)
TimeSeries(name=activity,
timeline_id_col=user_id,
timestamp_col=event_time,
format=CSV,
data_path=gs://noumena...ctivity*.csv,
sql_statements=None)
Validating A TimeSeries
the first step after your TimeSeries has been created is to validate it with validate()
. Validation checks to make sure your data can be loaded from the external source, the identity fields are not empty, and the timestamp format is parsable.
activities.validate()
just a moment, setting up a PySpark session...
PySpark is now available via the function zeenk.tql.spark()
or just spark() if you used from zeenk.tql import *
validating 'activity'... no errors found.
Loading A TimeSeries into Pandas or Spark
Often it is useful to load your timeseries data as a dataframe for inspection or external analysis. For this we can use the convienence functions pandas_dataframe()
and spark_dataframe()
:
activities.pandas_dataframe()
activity_id | event_time | user_id | conversion | |
---|---|---|---|---|
0 | a15784 | 2021-01-11 00:00:03 | u196 | product_view |
1 | a15785 | 2021-01-11 00:00:51 | u88 | product_view |
2 | a15786 | 2021-01-11 00:01:18 | u668 | product_view |
3 | a15787 | 2021-01-11 00:01:19 | u88 | product_view |
4 | a15788 | 2021-01-11 00:01:22 | u116 | product_view |
... | ... | ... | ... | ... |
25411 | a2924 | 2021-01-02 23:54:36 | u718 | product_view |
25412 | a2925 | 2021-01-02 23:55:02 | u80 | product_view |
25413 | a2926 | 2021-01-02 23:59:07 | u819 | product_view |
25414 | a2927 | 2021-01-02 23:59:35 | u264 | product_view |
25415 | a2928 | 2021-01-02 23:59:59 | u169 | product_view |
25416 rows × 4 columns
activities.spark_dataframe().show(5)
+-----------+-------------------+-------+------------+
|activity_id| event_time|user_id| conversion|
+-----------+-------------------+-------+------------+
| a15784|2021-01-11 00:00:03| u196|product_view|
| a15785|2021-01-11 00:00:51| u88|product_view|
| a15786|2021-01-11 00:01:18| u668|product_view|
| a15787|2021-01-11 00:01:19| u88|product_view|
| a15788|2021-01-11 00:01:22| u116|product_view|
+-----------+-------------------+-------+------------+
only showing top 5 rows
Getting an Example TimeSeries Row
You can get an example from your time series using get_example()
:
activities.get_example()
{'activity_id': 'a15784',
'event_time': '2021-01-11 00:00:03',
'user_id': 'u196',
'conversion': 'product_view'}
Visualizing A TimeSeries
It is often helpful to inspect your timeseries visually. For this we provide the function visualize()
,
available as a Jupyter extension:
activities.visualize()
preparing data...
Timeline(events='[{"_type": "activity", "_timeline_id": "u196", "_datetime": "2021-01-11 00:00:03", "activity_…
Analyzing A TimeSeries
TQL also provides a quick way to understand the shape of your TimeSeries, via the function analyze()
:
activities.analyze()
Click to Abort Job
NOTE: Abort link will open the result of the call in a new tab/window
SUCCESS: 224/224|██████████|, TOTAL ELAPSED=0:00:10, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:09
Timeline Identity Column: user_id
Timestamp Column: event_time
File Format: CSV
Number of Events: 25416
Number of Timelines: 876
Data Path: gs://noumena-public/data/lethe_demo_set_randomize_users_id4_v1.3/activity*.csv
Column | Cardinality | Range | Type |
---|---|---|---|
user_id | 876 | u0 - u999 | string |
activity_id | 25416 | a0 - i999 | string |
conversion | 3 | add_to_cart - purchase | string |
Value | Number of Occurrences | Pct of All Events |
---|---|---|
a15822 | 1 | 0.0% |
a15890 | 1 | 0.0% |
a16415 | 1 | 0.0% |
a16450 | 1 | 0.0% |
a16462 | 1 | 0.0% |
Value | Number of Occurrences | Pct of All Events |
---|---|---|
product_view | 22251 | 87.55% |
add_to_cart | 2165 | 8.52% |
purchase | 1000 | 3.93% |
Value | Number of Occurrences | Pct of All Events |
---|---|---|
u134 | 171 | 0.67% |
u634 | 156 | 0.61% |
u482 | 151 | 0.59% |
u677 | 149 | 0.59% |
u963 | 148 | 0.58% |
Creating Another TimeSeries
Usually, a set of timelines will consist of more than a single time series. For example, we may wish to augment
our site activity with a set of bids we made to show ads to the same set of users. For this we simply create another
TimeSeries
object, which we will eventually add to our project to build timelines.
bids = create_timeseries('bid')
bids.identified_by('user_id', 'event_time')
bids.from_files(
f'{base_dir}/bid*.csv',
has_csv_header=True
)
TimeSeries(name=bid,
timeline_id_col=user_id,
timestamp_col=event_time,
format=CSV,
data_path=gs://noumena...1.3/bid*.csv,
sql_statements=None)
Creating a TimeSeries from an External SQL Store
It is common to already have your time series data loaded into a data warehouse engine for analytics.
TQL supports directly ingesting data from an external source with the function from_sql()
Flat files,
Big Query, and Snowflake (with the appropriate credentials) are supported out of the box. Using SQL
to map your data into TQL has the added benefit of being able to manipulate your data upon ingestion. Other common uses for SQL-style ingestion would be to filter out bot, duplicate, or garbage data, or to augment your timeseries with additional attributes from another table using a JOIN statement. In the following example, we assign a proxy timestamp
to all our user records, as well as add a duration
field:
users = create_timeseries('user')
users.identified_by('user_id', 'timestamp', duration_col='duration')
users.from_sql(f'''
CREATE OR REPLACE TEMPORARY VIEW users
USING CSV options (
path="{base_dir}/user.csv",
inferSchema="true",
header="true"
)''',
# '''create or replace temporary view users_filtered as select u.* from users where <fancy bot filter statement here>''',
'SELECT u.*, "2021-01-01" timestamp, "1" duration FROM users u'
)
TimeSeries(name=user,
timeline_id_col=user_id,
timestamp_col=timestamp,
format=SQL,
data_path=None,
sql_statements=['\n CREATE OR...r="true"\n )', 'SELECT u.*, ...FROM users u'])
project = create_project('advertising_funnel_data', or_update=True)
project.from_timeseries(bids, activities, users)
timelines = project.build_timelines()
Click to Abort Job
NOTE: Abort link will open the result of the call in a new tab/window
SUCCESS: 234/234|██████████|, TOTAL ELAPSED=0:00:22, WAIT=0:00:00, PENDING=0:00:04, COMPILING=0:00:18
Listing All Available Projects
You can list all available projects by using show_projects()
. Note that TQL ships with four demo
projects to help you learn the basics of TQL, without having to ingest any time series data:
show_projects()
id | name | description | timeseries | timelines | events | status |
---|---|---|---|---|---|---|
1 | acheron | Acheron synthetic data, 2000 users, 5 requests per user, v1.1 | ['request', 'conversion', 'user'] | 2000.000000 | 14220.000000 | ready for queries |
2 | acheron2 | Acheron 2 synthetic data, reasonably_rich, 200 users, 10 days, v1.1 | ['request', 'conversion', 'user'] | 200.000000 | 13464.000000 | ready for queries |
3 | lethe3 | Lethe synthetic data, demo_set, random treatment, 1000 users, 14 days, v1.3 | ['bid', 'activity', 'user'] | 1000.000000 | 198341.000000 | ready for queries |
4 | lethe4 | Lethe synthetic data, demo_set, random users, 1000 users, 14 days, v1.3 | ['bid', 'activity', 'user'] | 1000.000000 | 198393.000000 | ready for queries |
10 | unittest | this is a unit test project | ['request'] | 1990.000000 | 12220.000000 | ready for queries |
101 | None | unittest | ['unittest'] | nan | nan | timelines not yet created |
102 | advertising_funnel_data | None | ['bid', 'activity', 'user'] | 1000.000000 | 198393.000000 | ready for queries |
103 | demo_project | None | ['activity', 'bid', 'user'] | 1000.000000 | 198393.000000 | ready for queries |
Loading a Project Definition
Previously built projects can be recalled using load_project()
. This is often useful to inspect the
TQL expression attributes that are adressable in queries.
load_project('advertising_funnel_data')
id | name | description | timeseries | status |
---|---|---|---|---|
102 | advertising_funnel_data | No description yet added | ['bid', 'activity', 'user'] | ready for queries |
0 | 1 |
---|---|
Total Timelines | 1000 |
Smallest Timeline | 1 events |
Largest Timeline | 904 events |
Min Timestamp | 2021-01-01 00:00:00 |
Max Timestamp | 2021-01-14 23:59:59 |
Total Events | 198393 |
Activity Events | 25416 |
Bid Events | 171977 |
User Events | 1000 |
TQL Expression | Description | Type |
---|---|---|
activity.activity_id | None | string |
activity.conversion | None | string |
activity.event_time | None | string |
activity.user_id | None | string |
bid.ad_size | None | string |
bid.bid | None | double |
bid.event_time | None | string |
bid.ghosted | None | boolean |
bid.request_id | None | string |
bid.user_id | None | string |
bid.won | None | boolean |
duration | The duration of the event. Typically used with generated events. | long |
event | The current event object | event |
generated | This event was created by generate_events() | boolean |
generated_weight | The weight to apply to the the sample | double |
id | The auto-generated event id | long |
index | The index of the current event in the timeline | integer |
outcome | The outcome event. Only used in conjunction with SUM_OPPORTUNITIES() and CAUSAL_EFFFCT | event |
timeline.events | Time-ordered array of events | array |
timeline.id | Shared timeline id | string |
timestamp | The timestamp of the event | long |
type | The event type of this event | string |
user.age | None | string |
user.duration | None | string |
user.gender | None | string |
user.group | None | string |
user.region | None | string |
user.timestamp | None | string |
user.user_id | None | string |
Using A Project In A TQL Query
use select(...).from_events('project_name')
to write Queries that produce ResultSets from your timelines.
select('*').from_events('advertising_funnel_data').limit(10)
Query results:
partition "_default"activity_activity_id | activity_conversion | activity_event_time | activity_user_id | bid_ad_size | bid_bid | bid_event_time | bid_ghosted | bid_request_id | bid_user_id | ... | timeline_id | timestamp | type | user_age | user_duration | user_gender | user_group | user_region | user_timestamp | user_user_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | None | None | None | None | None | None | None | None | None | ... | u252 | 1609459200000 | user | 55plus | None | male | influenced | south | 2021-01-01 | u252 |
1 | None | None | None | None | big | 1.0 | 2021-01-02 00:02:36 | false | r13260 | u252 | ... | u252 | 1609545756000 | bid | None | None | None | None | None | None | None |
2 | None | None | None | None | big | 1.0 | 2021-01-02 00:16:41 | false | r13376 | u252 | ... | u252 | 1609546601000 | bid | None | None | None | None | None | None | None |
3 | None | None | None | None | small | 1.0 | 2021-01-02 00:30:55 | false | r13509 | u252 | ... | u252 | 1609547455000 | bid | None | None | None | None | None | None | None |
4 | i176 | product_view | 2021-01-02 00:57:47 | u252 | None | None | None | None | None | None | ... | u252 | 1609549067000 | activity | None | None | None | None | None | None | None |
5 | None | None | None | None | small | 1.0 | 2021-01-02 01:04:01 | false | r13737 | u252 | ... | u252 | 1609549441000 | bid | None | None | None | None | None | None | None |
6 | None | None | None | None | big | 1.0 | 2021-01-02 01:37:26 | false | r13985 | u252 | ... | u252 | 1609551446000 | bid | None | None | None | None | None | None | None |
7 | i187 | product_view | 2021-01-02 02:15:43 | u252 | None | None | None | None | None | None | ... | u252 | 1609553743000 | activity | None | None | None | None | None | None | None |
8 | None | None | None | None | big | 1.0 | 2021-01-02 02:18:30 | false | r14310 | u252 | ... | u252 | 1609553910000 | bid | None | None | None | None | None | None | None |
9 | None | None | None | None | big | 1.0 | 2021-01-02 03:01:48 | false | r14607 | u252 | ... | u252 | 1609556508000 | bid | None | None | None | None | None | None | None |
10 rows × 27 columns
Go here for the full guide to writing TQL Queries.