{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Tce3stUlHN0L" }, "source": [ "##### Copyright 2020 The TensorFlow IO Authors." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cellView": "form", "execution": { "iopub.execute_input": "2021-01-14T19:12:55.892327Z", "iopub.status.busy": "2021-01-14T19:12:55.891680Z", "iopub.status.idle": "2021-01-14T19:12:55.894233Z", "shell.execute_reply": "2021-01-14T19:12:55.893731Z" }, "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "qFdPvlXBOdUN" }, "source": [ "# Reading PostgreSQL database from TensorFlow IO" ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
\n", " View on TensorFlow.org\n", " \n", " Run in Google Colab\n", " \n", " View source on GitHub\n", " \n", " Download notebook\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "xHxb-dlhMIzW" }, "source": [ "## Overview\n", "\n", "This tutorial shows how to create `tf.data.Dataset` from a PostgreSQL database server, so that the created `Dataset` could be passed to `tf.keras` for training or inference purposes.\n", "\n", "A SQL database is an important source of data for data scientist. As one of the most popular open source SQL database, [PostgreSQL](https://www.postgresql.org) is widely used in enterprises for storing critial and transactional data across the board. Creating `Dataset` from a PostgreSQL database server directly and pass the `Dataset` to `tf.keras` for training or inference, could greatly simplify the data pipeline and help data scientist to focus on building machine learning models." ] }, { "cell_type": "markdown", "metadata": { "id": "MUXex9ctTuDB" }, "source": [ "## Setup and usage" ] }, { "cell_type": "markdown", "metadata": { "id": "upgCc3gXybsA" }, "source": [ "### Install required tensorflow-io packages, and restart runtime" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2021-01-14T19:12:55.904936Z", "iopub.status.busy": "2021-01-14T19:12:55.904274Z", "iopub.status.idle": "2021-01-14T19:12:58.590466Z", "shell.execute_reply": "2021-01-14T19:12:58.590925Z" }, "id": "uUDYyMZRfkX4" }, "outputs": [], "source": [ "try:\n", " %tensorflow_version 2.x\n", "except Exception:\n", " pass\n", "\n", "!pip install -q tensorflow-io" ] }, { "cell_type": "markdown", "metadata": { "id": "yZmI7l_GykcW" }, "source": [ "### Install and setup PostgreSQL (optional)\n", "\n", "**Warning: This notebook is designed to be run in a Google Colab only**. *It installs packages on the system and requires sudo access. If you want to run it in a local Jupyter notebook, please proceed with caution.*\n", "\n", "In order to demo the usage on Google Colab you will install PostgreSQL server. The password and an empty database is also needed.\n", "\n", "If you are not running this notebook on Google Colab, or you prefer to use an existing database, please skip the following setup and proceed to the next section." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2021-01-14T19:12:58.602136Z", "iopub.status.busy": "2021-01-14T19:12:58.601434Z", "iopub.status.idle": "2021-01-14T19:13:16.177670Z", "shell.execute_reply": "2021-01-14T19:13:16.177063Z" }, "id": "YUj0878jPyz7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Preconfiguring packages ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Selecting previously unselected package libpq5:amd64.\r\n", "(Reading database ... \r", "(Reading database ... 5%\r", "(Reading database ... 10%\r", "(Reading database ... 15%\r", "(Reading database ... 20%\r", "(Reading database ... 25%\r", "(Reading database ... 30%\r", "(Reading database ... 35%\r", "(Reading database ... 40%\r", "(Reading database ... 45%\r", "(Reading database ... 50%\r", "(Reading database ... 55%\r", "(Reading database ... 60%\r", "(Reading database ... 65%\r", "(Reading database ... 70%\r", "(Reading database ... 75%\r", "(Reading database ... 80%\r", "(Reading database ... 85%\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(Reading database ... 90%\r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "(Reading database ... 95%\r", "(Reading database ... 100%\r", "(Reading database ... 254633 files and directories currently installed.)\r\n", "Preparing to unpack .../0-libpq5_10.15-0ubuntu0.18.04.1_amd64.deb ...\r\n", "Unpacking libpq5:amd64 (10.15-0ubuntu0.18.04.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Selecting previously unselected package postgresql-client-common.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Preparing to unpack .../1-postgresql-client-common_190ubuntu0.1_all.deb ...\r\n", "Unpacking postgresql-client-common (190ubuntu0.1) ...\r\n", "Selecting previously unselected package postgresql-client-10.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Preparing to unpack .../2-postgresql-client-10_10.15-0ubuntu0.18.04.1_amd64.deb ...\r\n", "Unpacking postgresql-client-10 (10.15-0ubuntu0.18.04.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Selecting previously unselected package ssl-cert.\r\n", "Preparing to unpack .../3-ssl-cert_1.0.39_all.deb ...\r\n", "Unpacking ssl-cert (1.0.39) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Selecting previously unselected package postgresql-common.\r\n", "Preparing to unpack .../4-postgresql-common_190ubuntu0.1_all.deb ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Adding 'diversion of /usr/bin/pg_config to /usr/bin/pg_config.libpq-dev by postgresql-common'\r\n", "Unpacking postgresql-common (190ubuntu0.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Selecting previously unselected package postgresql-10.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Preparing to unpack .../5-postgresql-10_10.15-0ubuntu0.18.04.1_amd64.deb ...\r\n", "Unpacking postgresql-10 (10.15-0ubuntu0.18.04.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Selecting previously unselected package postgresql.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Preparing to unpack .../6-postgresql_10+190ubuntu0.1_all.deb ...\r\n", "Unpacking postgresql (10+190ubuntu0.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Selecting previously unselected package sysstat.\r\n", "Preparing to unpack .../7-sysstat_11.6.1-1ubuntu0.1_amd64.deb ...\r\n", "Unpacking sysstat (11.6.1-1ubuntu0.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Setting up sysstat (11.6.1-1ubuntu0.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Creating config file /etc/default/sysstat with new version\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "update-alternatives: using /usr/bin/sar.sysstat to provide /usr/bin/sar (sar) in auto mode\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Created symlink /etc/systemd/system/multi-user.target.wants/sysstat.service → /lib/systemd/system/sysstat.service.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Setting up ssl-cert (1.0.39) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Setting up libpq5:amd64 (10.15-0ubuntu0.18.04.1) ...\r\n", "Setting up postgresql-client-common (190ubuntu0.1) ...\r\n", "Setting up postgresql-common (190ubuntu0.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Adding user postgres to group ssl-cert\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Creating config file /etc/postgresql-common/createcluster.conf with new version\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Building PostgreSQL dictionaries from installed myspell/hunspell packages...\r\n", "Removing obsolete dictionary files:\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Created symlink /etc/systemd/system/multi-user.target.wants/postgresql.service → /lib/systemd/system/postgresql.service.\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Setting up postgresql-client-10 (10.15-0ubuntu0.18.04.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "update-alternatives: using /usr/share/postgresql/10/man/man1/psql.1.gz to provide /usr/share/man/man1/psql.1.gz (psql.1.gz) in auto mode\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Setting up postgresql-10 (10.15-0ubuntu0.18.04.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Creating new PostgreSQL cluster 10/main ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "/usr/lib/postgresql/10/bin/initdb -D /var/lib/postgresql/10/main --auth-local peer --auth-host md5\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "The files belonging to this database system will be owned by user \"postgres\".\r\n", "This user must also own the server process.\r\n", "\r\n", "The database cluster will be initialized with locale \"C.UTF-8\".\r\n", "The default database encoding has accordingly been set to \"UTF8\".\r\n", "The default text search configuration will be set to \"english\".\r\n", "\r\n", "Data page checksums are disabled.\r\n", "\r\n", "fixing permissions on existing directory /var/lib/postgresql/10/main ... ok\r\n", "creating subdirectories ... ok\r\n", "selecting default max_connections ... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "100\r\n", "selecting default shared_buffers ... 128MB\r\n", "selecting default timezone ... Etc/UTC\r\n", "selecting dynamic shared memory implementation ... posix\r\n", "creating configuration files ... ok\r\n", "running bootstrap script ... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "ok\r\n", "performing post-bootstrap initialization ... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "ok\r\n", "syncing data to disk ... " ] }, { "name": "stdout", "output_type": "stream", "text": [ "ok\r\n", "\r\n", "Success. You can now start the database server using:\r\n", "\r\n", " /usr/lib/postgresql/10/bin/pg_ctl -D /var/lib/postgresql/10/main -l logfile start\r\n", "\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Ver Cluster Port Status Owner Data directory Log file\r\n", "\u001b[31m10 main 5432 down postgres /var/lib/postgresql/10/main /var/log/postgresql/postgresql-10-main.log\u001b[0m\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "update-alternatives: using /usr/share/postgresql/10/man/man1/postmaster.1.gz to provide /usr/share/man/man1/postmaster.1.gz (postmaster.1.gz) in auto mode\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Setting up postgresql (10+190ubuntu0.1) ...\r\n", "Processing triggers for man-db (2.8.3-2ubuntu0.1) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Processing triggers for ureadahead (0.100.0-21) ...\r\n", "Processing triggers for libc-bin (2.27-3ubuntu1.2) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Processing triggers for systemd (237-3ubuntu10.38) ...\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "ALTER ROLE\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "NOTICE: database \"tfio_demo\" does not exist, skipping\r\n", "DROP DATABASE\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CREATE DATABASE\r\n" ] } ], "source": [ "# Install postgresql server\n", "!sudo apt-get -y -qq update\n", "!sudo apt-get -y -qq install postgresql\n", "!sudo service postgresql start\n", "\n", "# Setup a password `postgres` for username `postgres`\n", "!sudo -u postgres psql -U postgres -c \"ALTER USER postgres PASSWORD 'postgres';\"\n", "\n", "# Setup a database with name `tfio_demo` to be used\n", "!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS tfio_demo;'\n", "!sudo -u postgres psql -U postgres -c 'CREATE DATABASE tfio_demo;'" ] }, { "cell_type": "markdown", "metadata": { "id": "e4_59-RLMRgT" }, "source": [ "### Setup necessary environmental variables\n", "\n", "The following environmental variables are based on the PostgreSQL setup in the last section. If you have a different setup or you are using an existing database, they should be changed accordingly:\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2021-01-14T19:13:16.186713Z", "iopub.status.busy": "2021-01-14T19:13:16.186039Z", "iopub.status.idle": "2021-01-14T19:13:16.188471Z", "shell.execute_reply": "2021-01-14T19:13:16.188952Z" }, "id": "0dRotqDMswcK" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: TFIO_DEMO_DATABASE_NAME=tfio_demo\n", "env: TFIO_DEMO_DATABASE_HOST=localhost\n", "env: TFIO_DEMO_DATABASE_PORT=5432\n", "env: TFIO_DEMO_DATABASE_USER=postgres\n", "env: TFIO_DEMO_DATABASE_PASS=postgres\n" ] } ], "source": [ "%env TFIO_DEMO_DATABASE_NAME=tfio_demo\n", "%env TFIO_DEMO_DATABASE_HOST=localhost\n", "%env TFIO_DEMO_DATABASE_PORT=5432\n", "%env TFIO_DEMO_DATABASE_USER=postgres\n", "%env TFIO_DEMO_DATABASE_PASS=postgres" ] }, { "cell_type": "markdown", "metadata": { "id": "9reCVv0mE_9O" }, "source": [ "### Prepare data in PostgreSQL server\n", "\n", "For demo purposes this tutorial will create a database and populate the database with some data. The data used in this tutorial is from [Air Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Air+Quality), available from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml).\n", "\n", "Below is a sneak preview of a subset of the Air Quality Data Set:\n", "\n", "Date|Time|CO(GT)|PT08.S1(CO)|NMHC(GT)|C6H6(GT)|PT08.S2(NMHC)|NOx(GT)|PT08.S3(NOx)|NO2(GT)|PT08.S4(NO2)|PT08.S5(O3)|T|RH|AH|\n", "----|----|------|-----------|--------|--------|-------------|----|----------|-------|------------|-----------|-|--|--|\n", "10/03/2004|18.00.00|2,6|1360|150|11,9|1046|166|1056|113|1692|1268|13,6|48,9|0,7578|\n", "10/03/2004|19.00.00|2|1292|112|9,4|955|103|1174|92|1559|972|13,3|47,7|0,7255|\n", "10/03/2004|20.00.00|2,2|1402|88|9,0|939|131|1140|114|1555|1074|11,9|54,0|0,7502|\n", "10/03/2004|21.00.00|2,2|1376|80|9,2|948|172|1092|122|1584|1203|11,0|60,0|0,7867|\n", "10/03/2004|22.00.00|1,6|1272|51|6,5|836|131|1205|116|1490|1110|11,2|59,6|0,7888|\n", "\n", "More information about Air Quality Data Set and UCI Machine Learning Repository are availabel in [References](#references) section.\n", "\n", "To help simplify the data preparation, a sql version of the Air Quality Data Set has been prepared and is available as [AirQualityUCI.sql](https://github.com/tensorflow/io/blob/master/docs/tutorials/postgresql/AirQualityUCI.sql).\n", "\n", "The statement to create the table is:\n", "```\n", "CREATE TABLE AirQualityUCI (\n", " Date DATE,\n", " Time TIME,\n", " CO REAL,\n", " PT08S1 INT,\n", " NMHC REAL,\n", " C6H6 REAL,\n", " PT08S2 INT,\n", " NOx REAL,\n", " PT08S3 INT,\n", " NO2 REAL,\n", " PT08S4 INT,\n", " PT08S5 INT,\n", " T REAL,\n", " RH REAL,\n", " AH REAL\n", ");\n", "```\n", "\n", "The complete commands to create the table in database and populate the data are:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2021-01-14T19:13:16.196033Z", "iopub.status.busy": "2021-01-14T19:13:16.193914Z", "iopub.status.idle": "2021-01-14T19:13:24.152303Z", "shell.execute_reply": "2021-01-14T19:13:24.152767Z" }, "id": "W1eVidg3JrPV" }, "outputs": [], "source": [ "!curl -s -OL https://github.com/tensorflow/io/raw/master/docs/tutorials/postgresql/AirQualityUCI.sql\n", "\n", "!PGPASSWORD=$TFIO_DEMO_DATABASE_PASS psql -q -h $TFIO_DEMO_DATABASE_HOST -p $TFIO_DEMO_DATABASE_PORT -U $TFIO_DEMO_DATABASE_USER -d $TFIO_DEMO_DATABASE_NAME -f AirQualityUCI.sql" ] }, { "cell_type": "markdown", "metadata": { "id": "acEST3amdyDI" }, "source": [ "### Create Dataset from PostgreSQL server and use it in TensorFlow\n", "\n", "Create a Dataset from PostgreSQL server is as easy as calling `tfio.experimental.IODataset.from_sql` with `query` and `endpoint` arguments. The `query` is the SQL query for select columns in tables and the `endpoint` argument is the address and database name:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2021-01-14T19:13:24.158642Z", "iopub.status.busy": "2021-01-14T19:13:24.157967Z", "iopub.status.idle": "2021-01-14T19:13:31.202803Z", "shell.execute_reply": "2021-01-14T19:13:31.202255Z" }, "id": "h21RdP7meGzP" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'co': TensorSpec(shape=(), dtype=tf.float32, name=None), 'pt08s1': TensorSpec(shape=(), dtype=tf.int32, name=None)}\n" ] } ], "source": [ "import os\n", "import tensorflow_io as tfio\n", "\n", "endpoint=\"postgresql://{}:{}@{}?port={}&dbname={}\".format(\n", " os.environ['TFIO_DEMO_DATABASE_USER'],\n", " os.environ['TFIO_DEMO_DATABASE_PASS'],\n", " os.environ['TFIO_DEMO_DATABASE_HOST'],\n", " os.environ['TFIO_DEMO_DATABASE_PORT'],\n", " os.environ['TFIO_DEMO_DATABASE_NAME'],\n", ")\n", "\n", "dataset = tfio.experimental.IODataset.from_sql(\n", " query=\"SELECT co, pt08s1 FROM AirQualityUCI;\",\n", " endpoint=endpoint)\n", "\n", "print(dataset.element_spec)" ] }, { "cell_type": "markdown", "metadata": { "id": "8y-VpwcWNYTF" }, "source": [ "As you could see from the output of `dataset.element_spec` above, the element of the created `Dataset` is a python dict object with column names of the database table as keys.\n", "It is quite convenient to apply further operations. For example, you could select both `nox` and `no2` field of the `Dataset`, and calculate the difference:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2021-01-14T19:13:31.208127Z", "iopub.status.busy": "2021-01-14T19:13:31.207477Z", "iopub.status.idle": "2021-01-14T19:13:31.270351Z", "shell.execute_reply": "2021-01-14T19:13:31.269740Z" }, "id": "qCoueXYZOvqZ" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NOx - NO2:\n", "53.0\n", "11.0\n", "17.0\n", "50.0\n", "15.0\n", "-7.0\n", "-15.0\n", "-14.0\n", "-15.0\n", "0.0\n", "-13.0\n", "-12.0\n", "-14.0\n", "16.0\n", "62.0\n", "28.0\n", "14.0\n", "3.0\n", "9.0\n", "34.0\n" ] } ], "source": [ "dataset = tfio.experimental.IODataset.from_sql(\n", " query=\"SELECT nox, no2 FROM AirQualityUCI;\",\n", " endpoint=endpoint)\n", "\n", "dataset = dataset.map(lambda e: (e['nox'] - e['no2']))\n", "\n", "# check only the first 20 record\n", "dataset = dataset.take(20)\n", "\n", "print(\"NOx - NO2:\")\n", "for difference in dataset:\n", " print(difference.numpy())" ] }, { "cell_type": "markdown", "metadata": { "id": "xO2pheWEPQSU" }, "source": [ "The created `Dataset` is ready to be passed to `tf.keras` directly for either training or inference purposes now." ] }, { "cell_type": "markdown", "metadata": { "id": "AzQoxSlHVv1k" }, "source": [ "## References\n", "\n", "- Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.\n", "- S. De Vito, E. Massera, M. Piga, L. Martinotto, G. Di Francia, On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750-757, ISSN 0925-4005" ] } ], "metadata": { "colab": { "collapsed_sections": [ "Tce3stUlHN0L" ], "name": "postgresql.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 0 }