# PostgreSQL

Home url : [https://www.postgresql.org/](https://www.postgresql.org/)

Here is a good comparision [PostgreSQL vs MySQL](https://www.postgresqltutorial.com/postgresql-vs-mysql/) 

The choice of PostgresSQL was done since we are using closest managed service in AWS called [Redshift](https://aws.amazon.com/redshift/)
 
## Setup

```
#install ubuntu packages
sudo apt-get install postgresql postgresql-contrib

# check the version
sudo -u postgres psql -c "SELECT version();"

# test the installation
sudo su - postgres
    psql #to launch the terminal
    \q #to quit

# or to run psql directly
sudo -i -u postgres psql

```

To cut short the permission configurations for new users, lets create a Ubuntu user with same name: 
`sudo adduser sparkstreaming #password sparkstreaming` 

We need log into PostgresSQL to create users before they can use the DB.
For our case we are going ot create a user called `sparkstreaming` and DB called `sparkstreamingdb`

```
sudo su - postgres
    psql #to launch the terminal
    # drop user sparkstreaming;
    CREATE USER sparkstreaming WITH PASSWORD 'sparkstreaming'; 
    \du #list users
    CREATE DATABASE sparkstreamingdb;
    grant all privileges on database sparkstreamingdb to sparkstreaming;
    \list # to see the DB created
    \q

# test the new user and DB
sudo -i -u sparkstreaming  psql -d sparkstreamingdb
    CREATE TABLE authors (code char(5) NOT NULL, name varchar(40) NOT NULL, city varchar(40) NOT NULL, joined_on date NOT NULL, PRIMARY KEY (code));
    INSERT INTO authors VALUES(1,'Ravi Saive','Mumbai','2012-08-15');
    \dt #list tables
    \conninfo #get connection info
```

To know the file storage in Postgres, this is to make sure the data in Docker can be mapped to a volume for offline storage.

`SELECT name, setting FROM pg_settings WHERE category = 'File Locations';`

If you would like to restart the service:

```
sudo service postgresql restart
sudo systemctl restart postgresql
```

To see how many active connections to the DB are made:
`SELECT pid, application_name, state FROM pg_stat_activity;`
This is important some times the connetion becomes stale and left hanging there, making new connection bounce back.

Service management commands:

```
sudo service postgresql stop
sudo service postgresql start
sudo service postgresql restart
```

## Integration with Spark

Part of our work flow we need to read and write to Postresql from Apache Spark. 
Lets test whether we can do it from local `pyspark` terminal.

```
#it is mandate to give the posgres maven ids for runtime
pyspark --packages postgresql:postgresql:9.1-901-1.jdbc4

#read the table "authors" 
df = spark.read. \
    format("jdbc"). \
    option("url", "jdbc:postgresql://localhost:5432/sparkstreamingdb"). \
    option("dbtable", "authors"). \
    option("user", "sparkstreaming"). \
    option("password", "sparkstreaming"). \
    option("driver", "org.postgresql.Driver"). \
    load()

# display the table
df.printSchema()
```

## Hive Metastore DB setup

We could use different metastore DB for Hive, below steps helps to use Postgresql as its external metastore,
which then can be shared with Spark.

```
sudo adduser hive #password hive

sudo su - postgres
    psql #to launch the terminal
    CREATE USER hive WITH PASSWORD 'hive'; # drop user hive; (if needed)
    \du
    CREATE DATABASE hive;
    grant all privileges on database hive to hive;
    \list # to see the DB created
    \q
```

We need do some changes to connection configs, to enable login from different services:

```
# change the 3rd colum values to "all"
sudo vim /etc/postgresql/10/main/pg_hba.conf
    # "local" is for Unix domain socket connections only
    local   all   all                                     md5
    # IPv4 local connections:
    host    all   all             127.0.0.1/32            md5
    # IPv6 local connections:
    host    all   all             ::1/128                 md5

```

Fire a python shell and test out the connection

```    
import psycopg2
conn = psycopg2.connect(host="localhost", port=5432, database="hive", user="hive", password="hive")
sql_command = "SELECT * FROM \"CDS\";"
print (sql_command)
# Load the data
data = pd.read_sql(sql_command, conn)
print(data)
```

Use Hive provided tool to setup the metastore tables an schema:
`/path/to/hive/bin/schematool -dbType postgres -initSchema`

And then try running following commands, you should see bunch of tables there:
```shell script
sudo -i -u hive  psql -d hive
#asks for two password, one for sudo and other one for DB `hive` which is `hive`
    hive=> \dt
```

------------------------------------------------------------------------------------------------------------------------

## PSQL

Magic words:
```bash
psql -U postgres
```
Some interesting flags (to see all, use `-h` or `--help` depending on your psql version):
- `-E`: will describe the underlaying queries of the `\` commands (cool for learning!)
- `-l`: psql will list all databases and then exit (useful if the user you connect with doesn't has a default database, like at AWS RDS)

Most `\d` commands support additional param of `__schema__.name__` and accept wildcards like `*.*`

- `\q`: Quit/Exit
- `\c __database__`: Connect to a database
- `\d __table__`: Show table definition including triggers
- `\d+ __table__`: More detailed table definition including description and physical disk size
- `\l`: List databases
- `\dy`: List events
- `\df`: List functions
- `\di`: List indexes
- `\dn`: List schemas
- `\dt *.*`: List tables from all schemas (if `*.*` is omitted will only show SEARCH_PATH ones)
- `\dT+`: List all data types
- `\dv`: List views
- `\df+ __function__` : Show function SQL code. 
- `\x`: Pretty-format query results instead of the not-so-useful ASCII tables
- `\copy (SELECT * FROM __table_name__) TO 'file_path_and_name.csv' WITH CSV`: Export a table as CSV

User Related:
- `\du`: List users
- `\du __username__`: List a username if present.
- `create role __test1__`: Create a role with an existing username.
- `create role __test2__ noinherit login password __passsword__;`: Create a role with username and password.
- `set role __test__;`: Change role for current session to `__test__`.
- `grant __test2__ to __test1__;`: Allow `__test1__` to set its role as `__test2__`.

Drop all the tables in a database:
```
DROP SCHEMA public CASCADE;
CREATE SCHEMA public AUTHORIZATION {USER};
GRANT ALL ON schema public TO {USER};
```

## Configuration

- Changing verbosity & querying Postgres log:
  <br/>1) First edit the config file, set a decent verbosity, save and restart postgres:
```
sudo vim /etc/postgresql/9.3/main/postgresql.conf

# Uncomment/Change inside:
log_min_messages = debug5
log_min_error_statement = debug5
log_min_duration_statement = -1

sudo service postgresql restart
```
  2) Now you will get tons of details of every statement, error, and even background tasks like VACUUMs
```
tail -f /var/log/postgresql/postgresql-9.3-main.log
```
  3) How to add user who executed a PG statement to log (editing `postgresql.conf`):
```
log_line_prefix = '%t %u %d %a '
```

## Create command

There are many `CREATE` choices, like `CREATE DATABASE __database_name__`, `CREATE TABLE __table_name__` ... Parameters differ but can be checked [at the official documentation](https://www.postgresql.org/search/?u=%2Fdocs%2F9.1%2F&q=CREATE).


## Handy queries
- `SELECT * FROM pg_proc WHERE proname='__procedurename__'`: List procedure/function
- `SELECT * FROM pg_views WHERE viewname='__viewname__';`: List view (including the definition)
- `SELECT pg_size_pretty(pg_total_relation_size('__table_name__'));`: Show DB table space in use
- `SELECT pg_size_pretty(pg_database_size('__database_name__'));`: Show DB space in use
- `show statement_timeout;`: Show current user's statement timeout
- `SELECT * FROM pg_indexes WHERE tablename='__table_name__' AND schemaname='__schema_name__';`: Show table indexes
- Get all indexes from all tables of a schema:
```sql
SELECT
   t.relname AS table_name,
   i.relname AS index_name,
   a.attname AS column_name
FROM
   pg_class t,
   pg_class i,
   pg_index ix,
   pg_attribute a,
    pg_namespace n
WHERE
   t.oid = ix.indrelid
   AND i.oid = ix.indexrelid
   AND a.attrelid = t.oid
   AND a.attnum = ANY(ix.indkey)
   AND t.relnamespace = n.oid
    AND n.nspname = 'kartones'
ORDER BY
   t.relname,
   i.relname
```
- Execution data:
  - Queries being executed at a certain DB:
```sql
SELECT datname, application_name, pid, backend_start, query_start, state_change, state, query 
  FROM pg_stat_activity 
  WHERE datname='__database_name__';
```
  - Get all queries from all dbs waiting for data (might be hung): 
```sql
SELECT * FROM pg_stat_activity WHERE waiting='t'
```
  - Currently running queries with process pid:
```sql
SELECT pg_stat_get_backend_pid(s.backendid) AS procpid, 
  pg_stat_get_backend_activity(s.backendid) AS current_query
FROM (SELECT pg_stat_get_backend_idset() AS backendid) AS s;
```

Casting:
- `CAST (column AS type)` or `column::type`
- `'__table_name__'::regclass::oid`: Get oid having a table name

Query analysis:
- `EXPLAIN __query__`: see the query plan for the given query
- `EXPLAIN ANALYZE __query__`: see and query_to_df the query plan for the given query
- `ANALYZE [__table__]`: collect statistics  

Generating random data ([source](https://www.citusdata.com/blog/2019/07/17/postgres-tips-for-average-and-power-user/)):
- `INSERT INTO some_table (a_float_value) SELECT random() * 100000 FROM generate_series(1, 1000000) i;`

Keyboard shortcuts
- `CTRL` + `R`: reverse-i-search


**References**

- https://www.tecmint.com/install-postgresql-on-ubuntu/
- https://linuxize.com/post/how-to-install-postgresql-on-ubuntu-18-04/
- https://medium.com/@thomaspt748/how-to-upsert-data-into-relational-database-using-spark-7d2d92e05bb9
- https://linuxize.com/post/how-to-create-a-sudo-user-on-ubuntu/
- https://stackoverflow.com/questions/21898152/why-cant-you-start-postgres-in-docker-using-service-postgres-start
- https://markheath.net/post/exploring-postgresql-with-docker
- `ptop` and `pg_top`: `top` for PG. Available on the APT repository from `apt.postgresql.org`.
- [pg_activity](https://github.com/julmon/pg_activity): Command line tool for PostgreSQL server activity monitoring.
- [Unix-like reverse search in psql](https://dba.stackexchange.com/questions/63453/is-there-a-psql-equivalent-of-bashs-reverse-search-history):
```bash
$ echo "bind "^R" em-inc-search-prev" > $HOME/.editrc
$ source $HOME/.editrc
``` 
- [PostgreSQL Exercises](https://pgexercises.com/): An awesome resource to learn to learn SQL, teaching you with simple examples in a great visual way. **Highly recommended**.
- [A Performance Cheat Sheet for PostgreSQL](https://severalnines.com/blog/performance-cheat-sheet-postgresql): Great explanations of `EXPLAIN`, `EXPLAIN ANALYZE`, `VACUUM`, configuration parameters and more. Quite interesting if you need to tune-up a postgres setup.