Concepts and Architecture

The Timeline

In TQL, a Timeline is simply a collection of Events observed for a single user, ordered by timestamp. Events may come from one or more TimeSeries, which are added to a Project, then synthesized into a set of Timelines and saved to persistent storage.

The TimeSeries

Every TQL project starts with one or more sources of event log data, stored in files on disk or in cloud storage, or in a database system. The TimeSeries object tells TQL how to read this data out of your storage system into your workspace. Each TimeSeries mapping must declare

  1. A name of the TimeSeries, such as bids

  2. A source of the data, either flat files or SQL SELECT statements

  3. A user_id and timestamp field.

For example:

from zeenk.tql import *
# define the bids timeseries
bids = create_timeseries('bids')
bids.from_files('/path/to/bid/logs')
bids.identified_by('user_id', 'timestamp')

# define the site activity timeseries
site_activity = create_timeseries('activity')
...

The Project

A Project consists of TimeSeries objects, usually more than one. For example, a typical online advertising project might consist of a set of advertising auction logs, and a separate set of user site activity logs, that share a common user_id. For example:

project = create_project('my_project')
project.from_timeseries(bids, site_activity)

Building Timelines

Timelines are constructed when project.build_timelines() is called. At this point, the data is actually read from your storage system (files or SQL). It groups all events in all TimeSeries by user_id, sorts the events by timestamp, and then writes them to persistent storage using a nested object schema that contains all fields from all the input time series side by side.

project.build_timelines()

The Timeline Data Structure

TODO

Executing Queries against Timelines

Each Timeline contains all the events from each TimeSeries for that particular user, and are the source of data the TQL queries are executed against. This is analogous to a table in traditional SQL system. When TQL queries SELECT one or more columns FROM a set of Timelines, one row is emitted for each event in each Timeline. The data processing model can be expressed in the following pseudocode:

for Timeline t in all Timelines:
    for Event e in t.events 
        for expression exp in SELECT(exp1, exp2, ...)
            emit row vector with one value for each expression

For example, If your input data has 10 timelines with 5 events each, a TQL query with no where clauses or limits will emit a 50 row dataset, one for each event.

[Diagram of Timelines + TQL Query (expressions) -> TQL ResultSet with one row per event]

For many machine learning projects, you may wish to combine more than one source of Events may also originate from more than one source. For example, a typically online advertising funnel use case, a user’s Timeline might be the combination of a stream of ad opportunities, treatments (i.e. impressions), and conversion activity such as product-views, add-to-carts, and purchases.

Roughly, construction of these timelines corresponds to the following SQL pseudocode:

SELECT collect_list(o.*, t.*, c.*)
  FROM opportunities o
 UNION treatments t
 UNION conversions c
 GROUP BY user_id
 ORDER GROUPED VALUES BY timestamp

TQL queries process each timeline as a unit, emitting a row for each event in the timeline (subject to where() and limit() clauses, among other things):

Because a particular user’s timeline contains all of the events observed for that user and sorted by timestamp, computing panel or windowed quantities becomes much simpler than with a traditional SQL system.