How to Run?

Most of the materials are presented as a reference materials based on local machine setup environment, based on your machine some steps may vary.

Based on the available memory and core on your machine, you must adjust the Spark cores/RAM memory values on *.sh file located in bin

Configuration

We use Gin-config and Python configparser to read the configs from *.ini file.

Update the spark master url with the new value in default_ssp_config.gin, if you have a different setup than mentioned here. Spark master url is configured to read the machine hostname and construct local Spark master url. eg: spark://IMCHLT276:7077, in the place of IMCHLT276 you see your machine name.

On Local Machine

Start Services

  • Start the services manually as a background processes (Lookout of errors int he jungle of service logs…)

/opt/binaries/hive/bin/hiveserver2 &
/opt/binaries/kafka/bin/zookeeper-server-start.sh /opt/binaries/kafka/config/zookeeper.properties &
/opt/binaries/kafka/bin/kafka-server-start.sh /opt/binaries/kafka/config/server.properties &
/opt/binaries/kafka/bin/kafka-server-start.sh /opt/binaries/kafka/config/server1.properties &
/opt/binaries/kafka/bin/kafka-server-start.sh /opt/binaries/kafka/config/server2.properties &
/opt/binaries/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 10 --topic ai_tweets_topic
/opt/binaries/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 10 --topic mix_tweets_topic
/usr/bin/supervisord -c docker/supervisor.conf # restart if you see any error
  • Start Spark and Hive services…

# start hdfs
$HADOOP_HOME/sbin/start-dfs.sh
# start yarn
$HADOOP_HOME/sbin/start-yarn.sh
# Start Spark standalone cluster
$SPARK_HOME/sbin/start-all.sh

To make sure all the services are up and running, execute the command jps, you should see the following list:

13601 Jps
11923 NameNode
12109 DataNode
12431 SecondaryNameNode
12699 ResourceManager
12919 NodeManager
9292 Kafka
9294 Kafka
10358 Kafka
3475 Main
13338 Master
13484 Worker

HDFS : NameNode, DataNode, SecondaryNameNode Hadoop : ResourceManager, NodeManager Kafka : Kafka, Kafka, Kafka Spark : Main, Master, Worker

  • If you wanna stop playing…

/opt/binaries/kafka/bin/kafka-server-stop.sh
/opt/binaries/kafka/bin/zookeeper-server-stop.sh
$HADOOP_HOME/sbin/stop-dfs.sh
$HADOOP_HOME/sbin/stop-yarn.sh
$SPARK_HOME/sbin/stop-all.sh
  • Activate ssp Python environment

source activate ssp

Since we stop and start Spark Streaming Kafka consumer, restart Kafka server, sometimes the offset can go for a toss. To solve the issue we need to clear the Kafka data and Spark warehouse data.

/opt/binaries/kafka/bin/kafka-server-stop.sh
/opt/binaries/kafka/bin/zookeeper-server-stop.sh

/opt/binaries/kafka/bin/kafka-topics.sh --delete --zookeeper localhost:2181 --topic ai_tweets_topic 
/opt/binaries/kafka/bin/kafka-topics.sh --delete --zookeeper localhost:2181 --topic mix_tweets_topic
rm -rf /var/lib/zookeeper/
rm -rf /tmp/kafka-logs*
rm -rf /opt/spark-warehouse/
hdfs dfs -rm -r /tmp/ssp/data/lake/checkpoint/

From here on you can try the use cases. (Use cases documents too repeat some of these steps for clarity!)

Docker

  • Start the container

docker run -v $(pwd):/host/ --hostname=$(hostname) -p 50075:50075 -p 50070:50070 -p 8020:8020 -p 2181:2181 -p 9870:9870 -p 9000:9000 -p 8088:8088 -p 10000:10000 -p 7077:7077 -p 10001:10001 -p 8080:8080 -p 9092:9092 -it sparkstructuredstreaming-pg:latest
  • Get the bash shell

# to get bash shell from running instance
docker exec -it $(docker ps | grep sparkstructuredstreaming-pg | cut -d' ' -f1) bash

From here on you can try the use cases. (Use cases documents too repeat some of these steps for clarity!) Twitter App credentials and update it here twitter_ssp_config.gin.