Concepts and Architecture

The Timeline

In TQL, a Timeline is simply a collection of Events observed for a single user, ordered by timestamp. Events may come from one or more TimeSeries, which are added to a Project, then synthesized into a set of Timelines and saved to persistent storage.

The TimeSeries

Every TQL project starts with one or more sources of event log data, stored in files on disk or in cloud storage, or in a database system. The TimeSeries object tells TQL how to read this data out of your storage system into your workspace. Each TimeSeries mapping must declare

A name of the TimeSeries, such as bids
A source of the data, either flat files or SQL SELECT statements
A user_id and timestamp field.

For example:

from zeenk.tql import *
# define the bids timeseries
bids = create_timeseries('bids')
bids.from_files('/path/to/bid/logs')
bids.identified_by('user_id', 'timestamp')

# define the site activity timeseries
site_activity = create_timeseries('activity')
...

The Project

A Project consists of TimeSeries objects, usually more than one. For example, a typical online advertising project might consist of a set of advertising auction logs, and a separate set of user site activity logs, that share a common user_id. For example:

project = create_project('my_project')
project.from_timeseries(bids, site_activity)

Building Timelines

Timelines are constructed when project.build_timelines() is called. At this point, the data is actually read from your storage system (files or SQL). It groups all events in all TimeSeries by user_id, sorts the events by timestamp, and then writes them to persistent storage using a nested object schema that contains all fields from all the input time series side by side.

project.build_timelines()

The Timeline Data Structure

TODO

Executing Queries against Timelines

Each Timeline contains all the events from each TimeSeries for that particular user, and are the source of data the TQL queries are executed against. This is analogous to a table in traditional SQL system. When TQL queries SELECT one or more columns FROM a set of Timelines, one row is emitted for each event in each Timeline. The data processing model can be expressed in the following pseudocode:

for Timeline t in all Timelines:
    for Event e in t.events 
        for expression exp in SELECT(exp1, exp2, ...)
            emit row vector with one value for each expression

For example, If your input data has 10 timelines with 5 events each, a TQL query with no where clauses or limits will emit a 50 row dataset, one for each event.

[Diagram of Timelines + TQL Query (expressions) -> TQL ResultSet with one row per event]

For many machine learning projects, you may wish to combine more than one source of Events may also originate from more than one source. For example, a typically online advertising funnel use case, a user’s Timeline might be the combination of a stream of ad opportunities, treatments (i.e. impressions), and conversion activity such as product-views, add-to-carts, and purchases.

Roughly, construction of these timelines corresponds to the following SQL pseudocode:

SELECT collect_list(o.*, t.*, c.*)
  FROM opportunities o
 UNION treatments t
 UNION conversions c
 GROUP BY user_id
 ORDER GROUPED VALUES BY timestamp

TQL queries process each timeline as a unit, emitting a row for each event in the timeline (subject to where() and limit() clauses, among other things):

Because a particular user’s timeline contains all of the events observed for that user and sorted by timestamp, computing panel or windowed quantities becomes much simpler than with a traditional SQL system.