Concepts and Architecture
The Timeline
In TQL, a Timeline is simply a collection of Events observed for a single user, ordered by timestamp. Events may come from one or more TimeSeries, which are added to a Project, then synthesized into a set of Timelines and saved to persistent storage.
The TimeSeries
Every TQL project starts with one or more sources of event log data, stored in files on disk or in cloud storage, or in a database system. The TimeSeries object tells TQL how to read this data out of your storage system into your workspace. Each TimeSeries mapping must declare
A name of the TimeSeries, such as
bids
A source of the data, either flat files or SQL SELECT statements
A user_id and timestamp field.
For example:
from zeenk.tql import *
# define the bids timeseries
bids = create_timeseries('bids')
bids.from_files('/path/to/bid/logs')
bids.identified_by('user_id', 'timestamp')
# define the site activity timeseries
site_activity = create_timeseries('activity')
...
The Project
A Project consists of TimeSeries objects, usually more than one. For example, a typical online advertising project might consist of a set of advertising auction logs, and a separate set of user site activity logs, that share a common user_id. For example:
project = create_project('my_project')
project.from_timeseries(bids, site_activity)
Building Timelines
Timelines are constructed when project.build_timelines()
is called. At this point, the data is actually
read from your storage system (files or SQL). It groups all events in all TimeSeries by user_id, sorts
the events by timestamp, and then writes them to persistent storage using a nested object schema that
contains all fields from all the input time series side by side.
project.build_timelines()
The Timeline Data Structure
TODO
Executing Queries against Timelines
Each Timeline contains all the events from each TimeSeries for that particular user, and are
the source of data the TQL queries are executed against. This is analogous to a table in traditional
SQL system. When TQL queries SELECT
one or more columns FROM
a set of Timelines, one row is
emitted for each event in each Timeline. The data processing model can be expressed in the
following pseudocode:
for Timeline t in all Timelines:
for Event e in t.events
for expression exp in SELECT(exp1, exp2, ...)
emit row vector with one value for each expression
For example, If your input data has 10 timelines with 5 events each, a TQL query with no where clauses or limits will emit a 50 row dataset, one for each event.
[Diagram of Timelines + TQL Query (expressions) -> TQL ResultSet with one row per event]
For many machine learning projects, you may wish to combine more than one source of Events may also originate from more than one source. For example, a typically online advertising funnel use case, a user’s Timeline might be the combination of a stream of ad opportunities, treatments (i.e. impressions), and conversion activity such as product-views, add-to-carts, and purchases.
Roughly, construction of these timelines corresponds to the following SQL pseudocode:
SELECT collect_list(o.*, t.*, c.*)
FROM opportunities o
UNION treatments t
UNION conversions c
GROUP BY user_id
ORDER GROUPED VALUES BY timestamp
TQL queries process each timeline as a unit, emitting a row for each event in the timeline (subject
to where()
and limit()
clauses, among other things):
Because a particular user’s timeline contains all of the events observed for that user and sorted by timestamp, computing panel or windowed quantities becomes much simpler than with a traditional SQL system.