# Concepts and Architecture ## The Timeline In TQL, a **Timeline** is simply a collection of **Events** observed for a single user, ordered by timestamp. Events may come from one or more **TimeSeries**, which are added to a **Project**, then synthesized into a set of **Timelines** and saved to persistent storage. ### The TimeSeries Every TQL project starts with one or more sources of event log data, stored in files on disk or in cloud storage, or in a database system. The TimeSeries object tells TQL how to read this data out of your storage system into your workspace. Each TimeSeries mapping must declare 1. A name of the TimeSeries, such as `bids` 2. A source of the data, either flat files or SQL SELECT statements 3. A user_id and timestamp field. For example: ```python from zeenk.tql import * # define the bids timeseries bids = create_timeseries('bids') bids.from_files('/path/to/bid/logs') bids.identified_by('user_id', 'timestamp') # define the site activity timeseries site_activity = create_timeseries('activity') ... ``` ### The Project A Project consists of TimeSeries objects, usually more than one. For example, a typical online advertising project might consist of a set of advertising auction logs, and a separate set of user site activity logs, that share a common user_id. For example: ```python project = create_project('my_project') project.from_timeseries(bids, site_activity) ``` ### Building Timelines Timelines are constructed when `project.build_timelines()` is called. At this point, the data is actually read from your storage system (files or SQL). It groups all events in all TimeSeries by user_id, sorts the events by timestamp, and then writes them to persistent storage using a nested object schema that contains all fields from all the input time series side by side. ```python project.build_timelines() ``` ### The Timeline Data Structure TODO ## Executing Queries against Timelines Each Timeline contains **all** the events from each TimeSeries for that particular user, and are the source of data the TQL queries are executed against. This is analogous to a table in traditional SQL system. When TQL queries `SELECT` one or more columns `FROM` a set of Timelines, **one row is emitted for each event in each Timeline**. The data processing model can be expressed in the following pseudocode: ```python for Timeline t in all Timelines: for Event e in t.events for expression exp in SELECT(exp1, exp2, ...) emit row vector with one value for each expression ``` For example, If your input data has 10 timelines with 5 events each, a TQL query with no where clauses or limits will emit a 50 row dataset, one for each event. [Diagram of Timelines + TQL Query (expressions) -> TQL ResultSet with one row per event] For many machine learning projects, you may wish to combine more than one source of Events may also originate from more than one source. For example, a typically online advertising funnel use case, a user's Timeline might be the combination of a stream of ad opportunities, treatments (i.e. impressions), and conversion activity such as product-views, add-to-carts, and purchases. Roughly, construction of these timelines corresponds to the following SQL pseudocode: ``` SELECT collect_list(o.*, t.*, c.*) FROM opportunities o UNION treatments t UNION conversions c GROUP BY user_id ORDER GROUPED VALUES BY timestamp ``` TQL queries process each timeline as a unit, emitting a row for each event in the timeline (subject to `where()` and `limit()` clauses, among other things): Because a particular user's timeline contains all of the events observed for that user and sorted by timestamp, computing panel or windowed quantities becomes much simpler than with a traditional SQL system. ![](https://lh5.googleusercontent.com/8DKADTmvm2395GHBPeMWepSgddW2hNjHkqtPLUx5To1z62wdNK3YxHhuMUdaXnz0oL9MHnspi8SzEE6SCoUhUGR1Xau39XY05GS5Gm4zvPjCUrZKfJX7TiX9ATyOV3-trIi1qQaKzgY)