# Creating Timelines
Before you can write TQL queries, you must import your raw time series data into TQL and construct timelines. A Timeline is simply a collection of events identified by a common join key, for example a logged in user_id or web cookie, and sorted by timestamp. In this guide we will create a mappings from source data into TQL, inspect the data, then finally create a Project and Timelines that TQL queries can be executed against.
### Creating A TimeSeries
Your source data is typically be stored in your data lake in flat files, and/or in a SQL-like data warehouse such as Big Query or Snowflake. The mapping between your data source and TQL is called a ***TimeSeries***. It is important to note that a TimeSeries object in TQL is not the data itself, but merely a script on how to read the data out of storage into TQL. We start by using the TQL function `create_timeseries()`, providing a source using a `from_files()` or `from_sql()` clause, and providing the names of the common id and timestamp fields in the source data.
```python
from zeenk.tql import *
```
-----------------------------------------------------------------
Version: 20.1.17-SNAPSHOT
Version Timestamp: 2021-08-27 20:56:17
Version Age: 2 days, 19 hours, 33 minutes, 6 seconds
Filesystem Root: /Users/zkozick/.tql/files
Working Directory: /Users/zkozick/.tql
Configuration File: /Users/zkozick/.tql/tql-conf.yml
Api Gateway: http://localhost:9000
Service Status: Icarus: ONLINE, Daedalus: ONLINE
Service Uptime: 1 minutes, 5 seconds
-----------------------------------------------------------------
```python
from zeenk.tql.demo_projects import get_project4_files_dir
# this directory contains the time series data files for demo project 4.
base_dir = get_project4_files_dir()
activities = create_timeseries('activity')
activities.identified_by('user_id', 'event_time')
activities.from_files(
f'{base_dir}/activity*.csv',
has_csv_header=True
)
```
TimeSeries(name=activity,
timeline_id_col=user_id,
timestamp_col=event_time,
format=CSV,
data_path=gs://noumena...ctivity*.csv,
sql_statements=None)
### Validating A TimeSeries
the first step after your TimeSeries has been created is to validate it with `validate()`. Validation checks to make sure your data can be loaded from the external source, the identity fields are not empty, and the timestamp format is parsable.
```python
activities.validate()
```
just a moment, setting up a PySpark session...
PySpark is now available via the function zeenk.tql.spark()
or just spark() if you used from zeenk.tql import *
validating 'activity'... no errors found.
### Loading A TimeSeries into Pandas or Spark
Often it is useful to load your timeseries data as a dataframe for inspection or external analysis. For this we can use the convienence functions `pandas_dataframe()` and `spark_dataframe()`:
```python
activities.pandas_dataframe()
```
|
activity_id |
event_time |
user_id |
conversion |
0 |
a15784 |
2021-01-11 00:00:03 |
u196 |
product_view |
1 |
a15785 |
2021-01-11 00:00:51 |
u88 |
product_view |
2 |
a15786 |
2021-01-11 00:01:18 |
u668 |
product_view |
3 |
a15787 |
2021-01-11 00:01:19 |
u88 |
product_view |
4 |
a15788 |
2021-01-11 00:01:22 |
u116 |
product_view |
... |
... |
... |
... |
... |
25411 |
a2924 |
2021-01-02 23:54:36 |
u718 |
product_view |
25412 |
a2925 |
2021-01-02 23:55:02 |
u80 |
product_view |
25413 |
a2926 |
2021-01-02 23:59:07 |
u819 |
product_view |
25414 |
a2927 |
2021-01-02 23:59:35 |
u264 |
product_view |
25415 |
a2928 |
2021-01-02 23:59:59 |
u169 |
product_view |
25416 rows × 4 columns
```python
activities.spark_dataframe().show(5)
```
+-----------+-------------------+-------+------------+
|activity_id| event_time|user_id| conversion|
+-----------+-------------------+-------+------------+
| a15784|2021-01-11 00:00:03| u196|product_view|
| a15785|2021-01-11 00:00:51| u88|product_view|
| a15786|2021-01-11 00:01:18| u668|product_view|
| a15787|2021-01-11 00:01:19| u88|product_view|
| a15788|2021-01-11 00:01:22| u116|product_view|
+-----------+-------------------+-------+------------+
only showing top 5 rows
### Getting an Example TimeSeries Row
You can get an example from your time series using `get_example()`:
```python
activities.get_example()
```
{'activity_id': 'a15784',
'event_time': '2021-01-11 00:00:03',
'user_id': 'u196',
'conversion': 'product_view'}
### Visualizing A TimeSeries
It is often helpful to inspect your timeseries visually. For this we provide the function `visualize()`,
available as a Jupyter extension:
```python
activities.visualize()
```
preparing data...
Timeline(events='[{"_type": "activity", "_timeline_id": "u196", "_datetime": "2021-01-11 00:00:03", "activity_…
### Analyzing A TimeSeries
TQL also provides a quick way to understand the shape of your TimeSeries, via the function `analyze()`:
```python
activities.analyze()
```
Click to Abort Job
NOTE: Abort link will open the result of the call in a new tab/window
SUCCESS: 224/224|██████████|, TOTAL ELAPSED=0:00:10, WAIT=0:00:00, PENDING=0:00:00, RUNNING=0:00:09
TimeSeries Name: activity
Timeline Identity Column: user_id
Timestamp Column: event_time
File Format: CSV
Number of Events: 25416
Number of Timelines: 876
Data Path: gs://noumena-public/data/lethe_demo_set_randomize_users_id4_v1.3/activity*.csv
TimeSeries Statistics Column | Cardinality | Range | Type |
user_id |
876 |
u0 - u999 |
string |
activity_id |
25416 |
a0 - i999 |
string |
conversion |
3 |
add_to_cart - purchase |
string |
Top 5 Values of activity_id: Value | Number of Occurrences | Pct of All Events |
a15822 |
1 |
0.0% |
a15890 |
1 |
0.0% |
a16415 |
1 |
0.0% |
a16450 |
1 |
0.0% |
a16462 |
1 |
0.0% |
Top 5 Values of conversion: Value | Number of Occurrences | Pct of All Events |
product_view |
22251 |
87.55% |
add_to_cart |
2165 |
8.52% |
purchase |
1000 |
3.93% |
Top 5 Values of user_id: Value | Number of Occurrences | Pct of All Events |
u134 |
171 |
0.67% |
u634 |
156 |
0.61% |
u482 |
151 |
0.59% |
u677 |
149 |
0.59% |
u963 |
148 |
0.58% |
### Creating Another TimeSeries
Usually, a set of timelines will consist of more than a single time series. For example, we may wish to augment
our site activity with a set of bids we made to show ads to the same set of users. For this we simply create another
`TimeSeries` object, which we will eventually add to our project to build timelines.
```python
bids = create_timeseries('bid')
bids.identified_by('user_id', 'event_time')
bids.from_files(
f'{base_dir}/bid*.csv',
has_csv_header=True
)
```
TimeSeries(name=bid,
timeline_id_col=user_id,
timestamp_col=event_time,
format=CSV,
data_path=gs://noumena...1.3/bid*.csv,
sql_statements=None)
### Creating a TimeSeries from an External SQL Store
It is common to already have your time series data loaded into a data warehouse engine for analytics.
TQL supports directly ingesting data from an external source with the function `from_sql()` Flat files,
Big Query, and Snowflake (with the appropriate credentials) are supported out of the box. Using SQL
to map your data into TQL has the added benefit of being able to manipulate your data upon ingestion. Other common uses for SQL-style ingestion would be to filter out bot, duplicate, or garbage data, or to augment your timeseries with additional attributes from another table using a JOIN statement. In the following example, we assign a proxy `timestamp` to all our user records, as well as add a `duration` field:
```python
users = create_timeseries('user')
users.identified_by('user_id', 'timestamp', duration_col='duration')
users.from_sql(f'''
CREATE OR REPLACE TEMPORARY VIEW users
USING CSV options (
path="{base_dir}/user.csv",
inferSchema="true",
header="true"
)''',
# '''create or replace temporary view users_filtered as select u.* from users where ''',
'SELECT u.*, "2021-01-01" timestamp, "1" duration FROM users u'
)
```
TimeSeries(name=user,
timeline_id_col=user_id,
timestamp_col=timestamp,
format=SQL,
data_path=None,
sql_statements=['\n CREATE OR...r="true"\n )', 'SELECT u.*, ...FROM users u'])
```python
project = create_project('advertising_funnel_data', or_update=True)
project.from_timeseries(bids, activities, users)
timelines = project.build_timelines()
```
Click to Abort Job
NOTE: Abort link will open the result of the call in a new tab/window
SUCCESS: 234/234|██████████|, TOTAL ELAPSED=0:00:22, WAIT=0:00:00, PENDING=0:00:04, COMPILING=0:00:18
### Listing All Available Projects
You can list all available projects by using `show_projects()`. Note that TQL ships with four demo
projects to help you learn the basics of TQL, without having to ingest any time series data:
```python
show_projects()
```
id | name | description | timeseries | timelines | events | status |
1 |
acheron |
Acheron synthetic data, 2000 users, 5 requests per user, v1.1 |
['request', 'conversion', 'user'] |
2000.000000 |
14220.000000 |
ready for queries |
2 |
acheron2 |
Acheron 2 synthetic data, reasonably_rich, 200 users, 10 days, v1.1 |
['request', 'conversion', 'user'] |
200.000000 |
13464.000000 |
ready for queries |
3 |
lethe3 |
Lethe synthetic data, demo_set, random treatment, 1000 users, 14 days, v1.3 |
['bid', 'activity', 'user'] |
1000.000000 |
198341.000000 |
ready for queries |
4 |
lethe4 |
Lethe synthetic data, demo_set, random users, 1000 users, 14 days, v1.3 |
['bid', 'activity', 'user'] |
1000.000000 |
198393.000000 |
ready for queries |
10 |
unittest |
this is a unit test project |
['request'] |
1990.000000 |
12220.000000 |
ready for queries |
101 |
None |
unittest |
['unittest'] |
nan |
nan |
timelines not yet created |
102 |
advertising_funnel_data |
None |
['bid', 'activity', 'user'] |
1000.000000 |
198393.000000 |
ready for queries |
103 |
demo_project |
None |
['activity', 'bid', 'user'] |
1000.000000 |
198393.000000 |
ready for queries |
### Loading a Project Definition
Previously built projects can be recalled using `load_project()`. This is often useful to inspect the
TQL expression attributes that are adressable in queries.
```python
load_project('advertising_funnel_data')
```
Table Overview
id | name | description | timeseries | status |
102 |
advertising_funnel_data |
No description yet added |
['bid', 'activity', 'user'] |
ready for queries |
Timeline Statistics
0 | 1 |
Total Timelines |
1000 |
Smallest Timeline |
1 events |
Largest Timeline |
904 events |
Min Timestamp |
2021-01-01 00:00:00 |
Max Timestamp |
2021-01-14 23:59:59 |
Total Events |
198393 |
Activity Events |
25416 |
Bid Events |
171977 |
User Events |
1000 |
Attributes
TQL Expression | Description | Type |
activity.activity_id |
None |
string |
activity.conversion |
None |
string |
activity.event_time |
None |
string |
activity.user_id |
None |
string |
bid.ad_size |
None |
string |
bid.bid |
None |
double |
bid.event_time |
None |
string |
bid.ghosted |
None |
boolean |
bid.request_id |
None |
string |
bid.user_id |
None |
string |
bid.won |
None |
boolean |
duration |
The duration of the event. Typically used with generated events. |
long |
event |
The current event object |
event |
generated |
This event was created by generate_events() |
boolean |
generated_weight |
The weight to apply to the the sample |
double |
id |
The auto-generated event id |
long |
index |
The index of the current event in the timeline |
integer |
outcome |
The outcome event. Only used in conjunction with SUM_OPPORTUNITIES() and CAUSAL_EFFFCT |
event |
timeline.events |
Time-ordered array of events |
array |
timeline.id |
Shared timeline id |
string |
timestamp |
The timestamp of the event |
long |
type |
The event type of this event |
string |
user.age |
None |
string |
user.duration |
None |
string |
user.gender |
None |
string |
user.group |
None |
string |
user.region |
None |
string |
user.timestamp |
None |
string |
user.user_id |
None |
string |
### Using A Project In A TQL Query
use `select(...).from_events('project_name')` to write Queries that produce ResultSets from your timelines.
```python
select('*').from_events('advertising_funnel_data').limit(10)
```
Query results:
partition "_default"
|
activity_activity_id |
activity_conversion |
activity_event_time |
activity_user_id |
bid_ad_size |
bid_bid |
bid_event_time |
bid_ghosted |
bid_request_id |
bid_user_id |
... |
timeline_id |
timestamp |
type |
user_age |
user_duration |
user_gender |
user_group |
user_region |
user_timestamp |
user_user_id |
0 |
None |
None |
None |
None |
None |
None |
None |
None |
None |
None |
... |
u252 |
1609459200000 |
user |
55plus |
None |
male |
influenced |
south |
2021-01-01 |
u252 |
1 |
None |
None |
None |
None |
big |
1.0 |
2021-01-02 00:02:36 |
false |
r13260 |
u252 |
... |
u252 |
1609545756000 |
bid |
None |
None |
None |
None |
None |
None |
None |
2 |
None |
None |
None |
None |
big |
1.0 |
2021-01-02 00:16:41 |
false |
r13376 |
u252 |
... |
u252 |
1609546601000 |
bid |
None |
None |
None |
None |
None |
None |
None |
3 |
None |
None |
None |
None |
small |
1.0 |
2021-01-02 00:30:55 |
false |
r13509 |
u252 |
... |
u252 |
1609547455000 |
bid |
None |
None |
None |
None |
None |
None |
None |
4 |
i176 |
product_view |
2021-01-02 00:57:47 |
u252 |
None |
None |
None |
None |
None |
None |
... |
u252 |
1609549067000 |
activity |
None |
None |
None |
None |
None |
None |
None |
5 |
None |
None |
None |
None |
small |
1.0 |
2021-01-02 01:04:01 |
false |
r13737 |
u252 |
... |
u252 |
1609549441000 |
bid |
None |
None |
None |
None |
None |
None |
None |
6 |
None |
None |
None |
None |
big |
1.0 |
2021-01-02 01:37:26 |
false |
r13985 |
u252 |
... |
u252 |
1609551446000 |
bid |
None |
None |
None |
None |
None |
None |
None |
7 |
i187 |
product_view |
2021-01-02 02:15:43 |
u252 |
None |
None |
None |
None |
None |
None |
... |
u252 |
1609553743000 |
activity |
None |
None |
None |
None |
None |
None |
None |
8 |
None |
None |
None |
None |
big |
1.0 |
2021-01-02 02:18:30 |
false |
r14310 |
u252 |
... |
u252 |
1609553910000 |
bid |
None |
None |
None |
None |
None |
None |
None |
9 |
None |
None |
None |
None |
big |
1.0 |
2021-01-02 03:01:48 |
false |
r14607 |
u252 |
... |
u252 |
1609556508000 |
bid |
None |
None |
None |
None |
None |
None |
None |
10 rows × 27 columns
query produced 10 rows x 27 columns in 1.12 seconds
Go [here](queries.html) for the full guide to writing TQL Queries.