Configuring TQL
This guide will show you how to configure and customize TQL to fit your needs.
The TQL Configuration File
TQL is configured by a yaml file. You can always see the location of the
yaml configuration file by using TQL’s command line tool by running the command tql status
:
(tql-venv) jshmoe$ tql status
------------------------------------------------------------------
Version: 20.1.16
Version Timestamp: 2021-07-21 17:34:23
Version Age: 6 days, 17 hours, 56 minutes, 6 seconds
Filesystem Root: /home/jshmoe/.tql/files
Working Directory: /home/jshmoe/.tql
Configuration File: /home/jshmoe/.tql/tql_conf.yml
Api Gateway: http://localhost:9000
Service Status: Icarus: OFFLINE, Daedalus: OFFLINE
Service Uptime:
------------------------------------------------------------------
You can also see the current configuration by running the command tql conf
:
(tql-venv) jshmoe$ tql conf
conf_path: /home/jshmoe/.tql/tql_conf.yml
database:
nanml.standalone.db.h2-disk.directory: /home/jshmoe/.tql/db
nanml.standalone.db.instance_type: h2-disk
filesystem:
root: /home/jshmoe/.tql/files
icarus:
...
Modifying the Configuration
Edit the configuration file at /home/jshmoe/.tql/tql_conf.yml
using your editor
of choice. Afterwords, restart TQL using the command tql restart
to apply
your configuration changes.
Configuration Sections
The configuration file is broken up into four sections: filesystem
, pyspark
,
icarus
, and database
.
FileSystem
The backend reads and writes files as part of its normal operation. Configure this property by adding the following section to your conf:
filesystem:
root: /path/to/filesystem/root
By default, the backend reads and writes to the local filesystem, in a directory ~/.tql/files/
.
However, if deploying the backend to a cluster environment, such an AWS EMR Cluster or GCP
Dataproc, all processing nodes must have access to the filesystem. For this you should
configure the backend to write to cloud storage, such as S3 or GCS.
Amazon S3 Filesystem
To read/write to Amazon S3, use the following configuration:
filesystem:
root: s3://your-bucket/sub-folder/
s3:
s3_access_key: YOUR_ACCESS_KEY
s3_secret_key: YOUR_SECRET_KEY
s3_region: YOUR_REGION
Google GCS Filesystem:
To read/write to GCS, use the following configuration:
filesystem:
root: gs://your-bucket/sub-folder/
gcs:
type: "service_account"
project_id" YOUR_PROJECT_ID
private_key_id: YOUR_PRIVATE_KEY_ID
private_key: "YOUR_PRIVATE_KEY"
client_email: YOUR_PROJECT_ID@appspot.gserviceaccount.com
client_id: YOUR_CLIENT_ID
auth_uri: "https://accounts.google.com/o/oauth2/auth"
token_uri: "https://oauth2.googleapis.com/token"
auth_provider_x509_cert_url: "https://www.googleapis.com/oauth2/v1/certs"
client_x509_cert_url: "https://www.googleapis.com..."
PySpark
TQL installs with a basic default configuration of PySpark. By default, TQL uses a local,
embedded instance of Spark, denoted by local[*]
with 1024m
of memory for both the
driver and executor. You can customize the Spark instance using the following configuration,
changing the values from their defaults shown here:
pyspark:
conf:
spark.master: local[*]
spark.driver.memory: 1024m
spark.executor.memory: 1024m
...
You may add any Spark application properties
you wish under pyspark -> conf
. In particular, you may wish to connect to a different type of Spark
master, such as yarn
. This configuration change is required in order to use TQL on an
Amazon EMR or
Google Dataproc Cluster.
Icarus
TQL’s REST API server, named Icarus, acts as a gateway between the Python user interface
and the Java query execution engine. By default Icarus runs on port 9000
and 9001
with 512M
of memory. You can customize the web service’s memory and ports using the
following configuration, changing the values from their defaults shown here:
icarus:
configuration:
server:
adminConnectors:
- port: 9001
type: http
applicationConnectors:
- port: 9000
type: http
memory: 512m
Under the hood, Icarus uses the Dropwizard web framework, and you can further customize it’s configuration in this section of the config file. Read more about Dropwizard configuration here.
Database
TQL uses a SQL database to store it’s metadata about Projects, Timelines, Queries, and Resultsets. By default TQL uses a file-based H2 Database. With the following configuration:
database:
nanml.standalone.db.instance_type: h2-disk
nanml.standalone.db.h2-disk.directory: ~/.tql/db
However, you may wish to use a persistent MySQL datastore. To do so, instead you should use the following configuration:
database
nanml.db.instance_type: mysql
nanml.mysql.db.host: 172.25.0.99