PostgreSQL¶

Here is a good comparision PostgreSQL vs MySQL

The choice of PostgresSQL was done since we are using closest managed service in AWS called Redshift

Setup¶

#install ubuntu packages
sudo apt-get install postgresql postgresql-contrib

# check the version
sudo -u postgres psql -c "SELECT version();"

# test the installation
sudo su - postgres
    psql #to launch the terminal
    \q #to quit

# or to run psql directly
sudo -i -u postgres psql

To cut short the permission configurations for new users, lets create a Ubuntu user with same name: sudo adduser sparkstreaming #password sparkstreaming

We need log into PostgresSQL to create users before they can use the DB. For our case we are going ot create a user called sparkstreaming and DB called sparkstreamingdb

sudo su - postgres
    psql #to launch the terminal
    # drop user sparkstreaming;
    CREATE USER sparkstreaming WITH PASSWORD 'sparkstreaming'; 
    \du #list users
    CREATE DATABASE sparkstreamingdb;
    grant all privileges on database sparkstreamingdb to sparkstreaming;
    \list # to see the DB created
    \q

# test the new user and DB
sudo -i -u sparkstreaming  psql -d sparkstreamingdb
    CREATE TABLE authors (code char(5) NOT NULL, name varchar(40) NOT NULL, city varchar(40) NOT NULL, joined_on date NOT NULL, PRIMARY KEY (code));
    INSERT INTO authors VALUES(1,'Ravi Saive','Mumbai','2012-08-15');
    \dt #list tables
    \conninfo #get connection info

To know the file storage in Postgres, this is to make sure the data in Docker can be mapped to a volume for offline storage.

SELECT name, setting FROM pg_settings WHERE category = 'File Locations';

If you would like to restart the service:

sudo service postgresql restart
sudo systemctl restart postgresql

To see how many active connections to the DB are made: SELECT pid, application_name, state FROM pg_stat_activity; This is important some times the connetion becomes stale and left hanging there, making new connection bounce back.

Service management commands:

sudo service postgresql stop
sudo service postgresql start
sudo service postgresql restart

Integration with Spark¶

Part of our work flow we need to read and write to Postresql from Apache Spark. Lets test whether we can do it from local pyspark terminal.

#it is mandate to give the posgres maven ids for runtime
pyspark --packages postgresql:postgresql:9.1-901-1.jdbc4

#read the table "authors" 
df = spark.read. \
    format("jdbc"). \
    option("url", "jdbc:postgresql://localhost:5432/sparkstreamingdb"). \
    option("dbtable", "authors"). \
    option("user", "sparkstreaming"). \
    option("password", "sparkstreaming"). \
    option("driver", "org.postgresql.Driver"). \
    load()

# display the table
df.printSchema()

Hive Metastore DB setup¶

We could use different metastore DB for Hive, below steps helps to use Postgresql as its external metastore, which then can be shared with Spark.

sudo adduser hive #password hive

sudo su - postgres
    psql #to launch the terminal
    CREATE USER hive WITH PASSWORD 'hive'; # drop user hive; (if needed)
    \du
    CREATE DATABASE hive;
    grant all privileges on database hive to hive;
    \list # to see the DB created
    \q

We need do some changes to connection configs, to enable login from different services:

# change the 3rd colum values to "all"
sudo vim /etc/postgresql/10/main/pg_hba.conf
    # "local" is for Unix domain socket connections only
    local   all   all                                     md5
    # IPv4 local connections:
    host    all   all             127.0.0.1/32            md5
    # IPv6 local connections:
    host    all   all             ::1/128                 md5

Fire a python shell and test out the connection

import psycopg2
conn = psycopg2.connect(host="localhost", port=5432, database="hive", user="hive", password="hive")
sql_command = "SELECT * FROM \"CDS\";"
print (sql_command)
# Load the data
data = pd.read_sql(sql_command, conn)
print(data)

Use Hive provided tool to setup the metastore tables an schema: /path/to/hive/bin/schematool -dbType postgres -initSchema

And then try running following commands, you should see bunch of tables there:

sudo -i -u hive  psql -d hive
#asks for two password, one for sudo and other one for DB `hive` which is `hive`
    hive=> \dt

PSQL¶

Magic words:

psql -U postgres

Some interesting flags (to see all, use -h or --help depending on your psql version):

-E: will describe the underlaying queries of the \ commands (cool for learning!)
-l: psql will list all databases and then exit (useful if the user you connect with doesn’t has a default database, like at AWS RDS)

Most \d commands support additional param of __schema__.name__ and accept wildcards like *.*

\q: Quit/Exit
\c __database__: Connect to a database
\d __table__: Show table definition including triggers
\d+ __table__: More detailed table definition including description and physical disk size
\l: List databases
\dy: List events
\df: List functions
\di: List indexes
\dn: List schemas
\dt *.*: List tables from all schemas (if *.* is omitted will only show SEARCH_PATH ones)
\dT+: List all data types
\dv: List views
\df+ __function__ : Show function SQL code.
\x: Pretty-format query results instead of the not-so-useful ASCII tables
\copy (SELECT * FROM __table_name__) TO 'file_path_and_name.csv' WITH CSV: Export a table as CSV

User Related:

\du: List users
\du __username__: List a username if present.
create role __test1__: Create a role with an existing username.
create role __test2__ noinherit login password __passsword__;: Create a role with username and password.
set role __test__;: Change role for current session to __test__.
grant __test2__ to __test1__;: Allow __test1__ to set its role as __test2__.

Drop all the tables in a database:

DROP SCHEMA public CASCADE;
CREATE SCHEMA public AUTHORIZATION {USER};
GRANT ALL ON schema public TO {USER};

Configuration¶

Changing verbosity & querying Postgres log:
1) First edit the config file, set a decent verbosity, save and restart postgres:

sudo vim /etc/postgresql/9.3/main/postgresql.conf

# Uncomment/Change inside:
log_min_messages = debug5
log_min_error_statement = debug5
log_min_duration_statement = -1

sudo service postgresql restart

Now you will get tons of details of every statement, error, and even background tasks like VACUUMs

tail -f /var/log/postgresql/postgresql-9.3-main.log

How to add user who executed a PG statement to log (editing postgresql.conf):

log_line_prefix = '%t %u %d %a '

Create command¶

There are many CREATE choices, like CREATE DATABASE __database_name__, CREATE TABLE __table_name__ … Parameters differ but can be checked at the official documentation.

Handy queries¶

SELECT * FROM pg_proc WHERE proname='__procedurename__': List procedure/function
SELECT * FROM pg_views WHERE viewname='__viewname__';: List view (including the definition)
SELECT pg_size_pretty(pg_total_relation_size('__table_name__'));: Show DB table space in use
SELECT pg_size_pretty(pg_database_size('__database_name__'));: Show DB space in use
show statement_timeout;: Show current user’s statement timeout
SELECT * FROM pg_indexes WHERE tablename='__table_name__' AND schemaname='__schema_name__';: Show table indexes
Get all indexes from all tables of a schema:

SELECT
   t.relname AS table_name,
   i.relname AS index_name,
   a.attname AS column_name
FROM
   pg_class t,
   pg_class i,
   pg_index ix,
   pg_attribute a,
    pg_namespace n
WHERE
   t.oid = ix.indrelid
   AND i.oid = ix.indexrelid
   AND a.attrelid = t.oid
   AND a.attnum = ANY(ix.indkey)
   AND t.relnamespace = n.oid
    AND n.nspname = 'kartones'
ORDER BY
   t.relname,
   i.relname

Execution data:
- Queries being executed at a certain DB:

SELECT datname, application_name, pid, backend_start, query_start, state_change, state, query 
  FROM pg_stat_activity 
  WHERE datname='__database_name__';

Get all queries from all dbs waiting for data (might be hung):

SELECT * FROM pg_stat_activity WHERE waiting='t'

Currently running queries with process pid:

SELECT pg_stat_get_backend_pid(s.backendid) AS procpid, 
  pg_stat_get_backend_activity(s.backendid) AS current_query
FROM (SELECT pg_stat_get_backend_idset() AS backendid) AS s;

Casting:

CAST (column AS type) or column::type
'__table_name__'::regclass::oid: Get oid having a table name

Query analysis:

EXPLAIN __query__: see the query plan for the given query
EXPLAIN ANALYZE __query__: see and query_to_df the query plan for the given query
ANALYZE [__table__]: collect statistics

Generating random data (source):

INSERT INTO some_table (a_float_value) SELECT random() * 100000 FROM generate_series(1, 1000000) i;

Keyboard shortcuts

CTRL + R: reverse-i-search

References

https://www.tecmint.com/install-postgresql-on-ubuntu/
https://linuxize.com/post/how-to-install-postgresql-on-ubuntu-18-04/
https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9
https://linuxize.com/post/how-to-create-a-sudo-user-on-ubuntu/
https://stackoverflow.com/questions/21898152/why-cant-you-start-postgres-in-docker-using-service-postgres-start
https://markheath.net/post/exploring-postgresql-with-docker
ptop and pg_top: top for PG. Available on the APT repository from apt.postgresql.org.
pg_activity: Command line tool for PostgreSQL server activity monitoring.
Unix-like reverse search in psql:

$ echo "bind "^R" em-inc-search-prev" > $HOME/.editrc
$ source $HOME/.editrc

PostgreSQL Exercises: An awesome resource to learn to learn SQL, teaching you with simple examples in a great visual way. Highly recommended.
A Performance Cheat Sheet for PostgreSQL: Great explanations of EXPLAIN, EXPLAIN ANALYZE, VACUUM, configuration parameters and more. Quite interesting if you need to tune-up a postgres setup.