Configuring TQL

This guide will show you how to configure and customize TQL to fit your needs.

The TQL Configuration File

TQL is configured by a yaml file. You can always see the location of the yaml configuration file by using TQL’s command line tool by running the command tql status:

(tql-venv) jshmoe$ tql status
------------------------------------------------------------------
           Version: 20.1.16
 Version Timestamp: 2021-07-21 17:34:23
       Version Age: 6 days, 17 hours, 56 minutes, 6 seconds
   Filesystem Root: /home/jshmoe/.tql/files
 Working Directory: /home/jshmoe/.tql
Configuration File: /home/jshmoe/.tql/tql_conf.yml
       Api Gateway: http://localhost:9000
    Service Status: Icarus: OFFLINE, Daedalus: OFFLINE
    Service Uptime: 
------------------------------------------------------------------

You can also see the current configuration by running the command tql conf:

(tql-venv) jshmoe$ tql conf
conf_path: /home/jshmoe/.tql/tql_conf.yml
database:
  nanml.standalone.db.h2-disk.directory: /home/jshmoe/.tql/db
  nanml.standalone.db.instance_type: h2-disk
filesystem:
  root: /home/jshmoe/.tql/files
icarus:
...
NOTE: The TQL configuration file will only be generated on first startup of the TQL backend. If you just installed TQL but have not yet tried to use it, first use the command tql start to generate a configuration file to edit.

Modifying the Configuration

Edit the configuration file at /home/jshmoe/.tql/tql_conf.yml using your editor of choice. Afterwords, restart TQL using the command tql restart to apply your configuration changes.

Configuration Sections

The configuration file is broken up into four sections: filesystem, pyspark, icarus, and database.

FileSystem

The backend reads and writes files as part of its normal operation. Configure this property by adding the following section to your conf:

filesystem:
  root: /path/to/filesystem/root

By default, the backend reads and writes to the local filesystem, in a directory ~/.tql/files/. However, if deploying the backend to a cluster environment, such an AWS EMR Cluster or GCP Dataproc, all processing nodes must have access to the filesystem. For this you should configure the backend to write to cloud storage, such as S3 or GCS.

Amazon S3 Filesystem

To read/write to Amazon S3, use the following configuration:

filesystem:
  root: s3://your-bucket/sub-folder/
  s3:
    s3_access_key: YOUR_ACCESS_KEY
    s3_secret_key: YOUR_SECRET_KEY
    s3_region: YOUR_REGION

Google GCS Filesystem:

To read/write to GCS, use the following configuration:

filesystem:
  root: gs://your-bucket/sub-folder/
  gcs:
    type: "service_account"
    project_id" YOUR_PROJECT_ID
    private_key_id: YOUR_PRIVATE_KEY_ID
    private_key: "YOUR_PRIVATE_KEY"
    client_email: YOUR_PROJECT_ID@appspot.gserviceaccount.com
    client_id: YOUR_CLIENT_ID
    auth_uri: "https://accounts.google.com/o/oauth2/auth"
    token_uri: "https://oauth2.googleapis.com/token"
    auth_provider_x509_cert_url: "https://www.googleapis.com/oauth2/v1/certs"
    client_x509_cert_url: "https://www.googleapis.com..."

PySpark

TQL installs with a basic default configuration of PySpark. By default, TQL uses a local, embedded instance of Spark, denoted by local[*] with 1024m of memory for both the driver and executor. You can customize the Spark instance using the following configuration, changing the values from their defaults shown here:

pyspark:
  conf:
    spark.master: local[*] 
    spark.driver.memory: 1024m
    spark.executor.memory: 1024m
    ...

You may add any Spark application properties you wish under pyspark -> conf. In particular, you may wish to connect to a different type of Spark master, such as yarn. This configuration change is required in order to use TQL on an Amazon EMR or Google Dataproc Cluster.

Icarus

TQL’s REST API server, named Icarus, acts as a gateway between the Python user interface and the Java query execution engine. By default Icarus runs on port 9000 and 9001 with 512M of memory. You can customize the web service’s memory and ports using the following configuration, changing the values from their defaults shown here:

icarus:
  configuration:
    server:
      adminConnectors:
      - port: 9001
        type: http
      applicationConnectors:
      - port: 9000
        type: http
  memory: 512m

Under the hood, Icarus uses the Dropwizard web framework, and you can further customize it’s configuration in this section of the config file. Read more about Dropwizard configuration here.

Database

TQL uses a SQL database to store it’s metadata about Projects, Timelines, Queries, and Resultsets. By default TQL uses a file-based H2 Database. With the following configuration:

database:
  nanml.standalone.db.instance_type: h2-disk
  nanml.standalone.db.h2-disk.directory: ~/.tql/db

However, you may wish to use a persistent MySQL datastore. To do so, instead you should use the following configuration:

database
  nanml.db.instance_type: mysql
  nanml.mysql.db.host: 172.25.0.99