PostgreSQL¶
Home url : https://www.postgresql.org/
Here is a good comparision PostgreSQL vs MySQL
The choice of PostgresSQL was done since we are using closest managed service in AWS called Redshift
Setup¶
#install ubuntu packages
sudo apt-get install postgresql postgresql-contrib
# check the version
sudo -u postgres psql -c "SELECT version();"
# test the installation
sudo su - postgres
psql #to launch the terminal
\q #to quit
# or to run psql directly
sudo -i -u postgres psql
To cut short the permission configurations for new users, lets create a Ubuntu user with same name:
sudo adduser sparkstreaming #password sparkstreaming
We need log into PostgresSQL to create users before they can use the DB.
For our case we are going ot create a user called sparkstreaming
and DB called sparkstreamingdb
sudo su - postgres
psql #to launch the terminal
# drop user sparkstreaming;
CREATE USER sparkstreaming WITH PASSWORD 'sparkstreaming';
\du #list users
CREATE DATABASE sparkstreamingdb;
grant all privileges on database sparkstreamingdb to sparkstreaming;
\list # to see the DB created
\q
# test the new user and DB
sudo -i -u sparkstreaming psql -d sparkstreamingdb
CREATE TABLE authors (code char(5) NOT NULL, name varchar(40) NOT NULL, city varchar(40) NOT NULL, joined_on date NOT NULL, PRIMARY KEY (code));
INSERT INTO authors VALUES(1,'Ravi Saive','Mumbai','2012-08-15');
\dt #list tables
\conninfo #get connection info
To know the file storage in Postgres, this is to make sure the data in Docker can be mapped to a volume for offline storage.
SELECT name, setting FROM pg_settings WHERE category = 'File Locations';
If you would like to restart the service:
sudo service postgresql restart
sudo systemctl restart postgresql
To see how many active connections to the DB are made:
SELECT pid, application_name, state FROM pg_stat_activity;
This is important some times the connetion becomes stale and left hanging there, making new connection bounce back.
Service management commands:
sudo service postgresql stop
sudo service postgresql start
sudo service postgresql restart
Integration with Spark¶
Part of our work flow we need to read and write to Postresql from Apache Spark.
Lets test whether we can do it from local pyspark
terminal.
#it is mandate to give the posgres maven ids for runtime
pyspark --packages postgresql:postgresql:9.1-901-1.jdbc4
#read the table "authors"
df = spark.read. \
format("jdbc"). \
option("url", "jdbc:postgresql://localhost:5432/sparkstreamingdb"). \
option("dbtable", "authors"). \
option("user", "sparkstreaming"). \
option("password", "sparkstreaming"). \
option("driver", "org.postgresql.Driver"). \
load()
# display the table
df.printSchema()
Hive Metastore DB setup¶
We could use different metastore DB for Hive, below steps helps to use Postgresql as its external metastore, which then can be shared with Spark.
sudo adduser hive #password hive
sudo su - postgres
psql #to launch the terminal
CREATE USER hive WITH PASSWORD 'hive'; # drop user hive; (if needed)
\du
CREATE DATABASE hive;
grant all privileges on database hive to hive;
\list # to see the DB created
\q
We need do some changes to connection configs, to enable login from different services:
# change the 3rd colum values to "all"
sudo vim /etc/postgresql/10/main/pg_hba.conf
# "local" is for Unix domain socket connections only
local all all md5
# IPv4 local connections:
host all all 127.0.0.1/32 md5
# IPv6 local connections:
host all all ::1/128 md5
Fire a python shell and test out the connection
import psycopg2
conn = psycopg2.connect(host="localhost", port=5432, database="hive", user="hive", password="hive")
sql_command = "SELECT * FROM \"CDS\";"
print (sql_command)
# Load the data
data = pd.read_sql(sql_command, conn)
print(data)
Use Hive provided tool to setup the metastore tables an schema:
/path/to/hive/bin/schematool -dbType postgres -initSchema
And then try running following commands, you should see bunch of tables there:
sudo -i -u hive psql -d hive
#asks for two password, one for sudo and other one for DB `hive` which is `hive`
hive=> \dt
PSQL¶
Magic words:
psql -U postgres
Some interesting flags (to see all, use -h
or --help
depending on your psql version):
-E
: will describe the underlaying queries of the\
commands (cool for learning!)-l
: psql will list all databases and then exit (useful if the user you connect with doesn’t has a default database, like at AWS RDS)
Most \d
commands support additional param of __schema__.name__
and accept wildcards like *.*
\q
: Quit/Exit\c __database__
: Connect to a database\d __table__
: Show table definition including triggers\d+ __table__
: More detailed table definition including description and physical disk size\l
: List databases\dy
: List events\df
: List functions\di
: List indexes\dn
: List schemas\dt *.*
: List tables from all schemas (if*.*
is omitted will only show SEARCH_PATH ones)\dT+
: List all data types\dv
: List views\df+ __function__
: Show function SQL code.\x
: Pretty-format query results instead of the not-so-useful ASCII tables\copy (SELECT * FROM __table_name__) TO 'file_path_and_name.csv' WITH CSV
: Export a table as CSV
User Related:
\du
: List users\du __username__
: List a username if present.create role __test1__
: Create a role with an existing username.create role __test2__ noinherit login password __passsword__;
: Create a role with username and password.set role __test__;
: Change role for current session to__test__
.grant __test2__ to __test1__;
: Allow__test1__
to set its role as__test2__
.
Drop all the tables in a database:
DROP SCHEMA public CASCADE;
CREATE SCHEMA public AUTHORIZATION {USER};
GRANT ALL ON schema public TO {USER};
Configuration¶
Changing verbosity & querying Postgres log:
1) First edit the config file, set a decent verbosity, save and restart postgres:
sudo vim /etc/postgresql/9.3/main/postgresql.conf
# Uncomment/Change inside:
log_min_messages = debug5
log_min_error_statement = debug5
log_min_duration_statement = -1
sudo service postgresql restart
Now you will get tons of details of every statement, error, and even background tasks like VACUUMs
tail -f /var/log/postgresql/postgresql-9.3-main.log
How to add user who executed a PG statement to log (editing
postgresql.conf
):
log_line_prefix = '%t %u %d %a '
Create command¶
There are many CREATE
choices, like CREATE DATABASE __database_name__
, CREATE TABLE __table_name__
… Parameters differ but can be checked at the official documentation.
Handy queries¶
SELECT * FROM pg_proc WHERE proname='__procedurename__'
: List procedure/functionSELECT * FROM pg_views WHERE viewname='__viewname__';
: List view (including the definition)SELECT pg_size_pretty(pg_total_relation_size('__table_name__'));
: Show DB table space in useSELECT pg_size_pretty(pg_database_size('__database_name__'));
: Show DB space in useshow statement_timeout;
: Show current user’s statement timeoutSELECT * FROM pg_indexes WHERE tablename='__table_name__' AND schemaname='__schema_name__';
: Show table indexesGet all indexes from all tables of a schema:
SELECT
t.relname AS table_name,
i.relname AS index_name,
a.attname AS column_name
FROM
pg_class t,
pg_class i,
pg_index ix,
pg_attribute a,
pg_namespace n
WHERE
t.oid = ix.indrelid
AND i.oid = ix.indexrelid
AND a.attrelid = t.oid
AND a.attnum = ANY(ix.indkey)
AND t.relnamespace = n.oid
AND n.nspname = 'kartones'
ORDER BY
t.relname,
i.relname
Execution data:
Queries being executed at a certain DB:
SELECT datname, application_name, pid, backend_start, query_start, state_change, state, query
FROM pg_stat_activity
WHERE datname='__database_name__';
Get all queries from all dbs waiting for data (might be hung):
SELECT * FROM pg_stat_activity WHERE waiting='t'
Currently running queries with process pid:
SELECT pg_stat_get_backend_pid(s.backendid) AS procpid,
pg_stat_get_backend_activity(s.backendid) AS current_query
FROM (SELECT pg_stat_get_backend_idset() AS backendid) AS s;
Casting:
CAST (column AS type)
orcolumn::type
'__table_name__'::regclass::oid
: Get oid having a table name
Query analysis:
EXPLAIN __query__
: see the query plan for the given queryEXPLAIN ANALYZE __query__
: see and query_to_df the query plan for the given queryANALYZE [__table__]
: collect statistics
Generating random data (source):
INSERT INTO some_table (a_float_value) SELECT random() * 100000 FROM generate_series(1, 1000000) i;
Keyboard shortcuts
CTRL
+R
: reverse-i-search
References
https://www.tecmint.com/install-postgresql-on-ubuntu/
https://linuxize.com/post/how-to-install-postgresql-on-ubuntu-18-04/
https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9
https://linuxize.com/post/how-to-create-a-sudo-user-on-ubuntu/
https://stackoverflow.com/questions/21898152/why-cant-you-start-postgres-in-docker-using-service-postgres-start
https://markheath.net/post/exploring-postgresql-with-docker
ptop
andpg_top
:top
for PG. Available on the APT repository fromapt.postgresql.org
.pg_activity: Command line tool for PostgreSQL server activity monitoring.
$ echo "bind "^R" em-inc-search-prev" > $HOME/.editrc
$ source $HOME/.editrc
PostgreSQL Exercises: An awesome resource to learn to learn SQL, teaching you with simple examples in a great visual way. Highly recommended.
A Performance Cheat Sheet for PostgreSQL: Great explanations of
EXPLAIN
,EXPLAIN ANALYZE
,VACUUM
, configuration parameters and more. Quite interesting if you need to tune-up a postgres setup.