# Concepts and Architecture

## The Timeline
In TQL, a **Timeline** is simply a collection of **Events** observed for a single user, ordered
by timestamp.  Events may come from one or more **TimeSeries**, which are added to a **Project**,
then synthesized into a set of **Timelines** and saved to persistent storage.

### The TimeSeries
Every TQL project starts with one or more sources of event log data, stored in files on disk or in
cloud storage, or in a database system. The TimeSeries object tells TQL how to read this data out
of your storage system into your workspace.  Each TimeSeries mapping must declare
1. A name of the TimeSeries, such as `bids`
2. A source of the data, either flat files or SQL SELECT statements
3. A user_id and timestamp field.

For example:
```python
from zeenk.tql import *
# define the bids timeseries
bids = create_timeseries('bids')
bids.from_files('/path/to/bid/logs')
bids.identified_by('user_id', 'timestamp')

# define the site activity timeseries
site_activity = create_timeseries('activity')
...
```

### The Project
A Project consists of TimeSeries objects, usually more than one.  For example, a typical online
advertising project might consist of a set of advertising auction logs, and a separate set of
user site activity logs, that share a common user_id.  For example:
```python
project = create_project('my_project')
project.from_timeseries(bids, site_activity)
```

### Building Timelines
Timelines are constructed when `project.build_timelines()` is called.  At this point, the data is actually
read from your storage system (files or SQL).  It groups all events in all TimeSeries by user_id, sorts
the events by timestamp, and then writes them to persistent storage using a nested object schema that
contains all fields from all the input time series side by side.
```python
project.build_timelines()
```

### The Timeline Data Structure
TODO

## Executing Queries against Timelines

Each Timeline contains **all** the events from each TimeSeries for that particular user, and are
the source of data the TQL queries are executed against.  This is analogous to a table in traditional
SQL system.  When TQL queries `SELECT` one or more columns `FROM` a set of Timelines, **one row is
emitted for each event in each Timeline**.  The data processing model can be expressed in the
following pseudocode:

```python
for Timeline t in all Timelines:
    for Event e in t.events 
        for expression exp in SELECT(exp1, exp2, ...)
            emit row vector with one value for each expression
```

For example, If your input data has 10 timelines with 5 events each, a TQL query with no where
clauses or limits will emit a 50 row dataset, one for each event.

[Diagram of Timelines + TQL Query (expressions) -> TQL ResultSet with one row per event] 


For many machine learning projects, you may wish to combine more than one source of Events may also
originate from more than one source.  For example, a typically online advertising funnel use case,
a user's Timeline might be the combination of a stream of ad opportunities, treatments
(i.e. impressions), and conversion activity such as product-views, add-to-carts, and purchases.

Roughly, construction of these timelines corresponds to the following SQL pseudocode:
```
SELECT collect_list(o.*, t.*, c.*)
  FROM opportunities o
 UNION treatments t
 UNION conversions c
 GROUP BY user_id
 ORDER GROUPED VALUES BY timestamp
 ```
TQL queries process each timeline as a unit, emitting a row for each event in the timeline (subject
to `where()` and `limit()` clauses, among other things):

Because a particular user's timeline contains all of the events observed for that user and sorted
by timestamp, computing panel or windowed quantities becomes much simpler than with a traditional
SQL system.


![](https://lh5.googleusercontent.com/8DKADTmvm2395GHBPeMWepSgddW2hNjHkqtPLUx5To1z62wdNK3YxHhuMUdaXnz0oL9MHnspi8SzEE6SCoUhUGR1Xau39XY05GS5Gm4zvPjCUrZKfJX7TiX9ATyOV3-trIi1qQaKzgY)