# Architecting a Real time Streaming Deep Learning Text Classification Pipeline With Python

The client startup company wants to retweet all AI tweets which  has links in them, they think doing so will attract 
more followers to their handle.

## Requirements  
  
- Refer the following link for general architecture of our continuous integration ML pipeline [theory on Architecting a Machine Learning Pipeline](https://towardsdatascience.com/architecting-a-machine-learning-pipeline-a847f094d1c7)   
- Create a pipeline similar to following illustration
    ![](../images/ml_pipeline.png)
- Build ground up dataset for tweet classification (AI tweet or not) using the live streaming data
- Dump the raw tweet data into a table in Postgresql DB called `streamingdb`
- Have configuration to prefix the name of the raw tweet data and version config to dump the tables into DB
- Use semi supervised methods to tag dataset, for example frameworks like [https://www.snorkel.org/](https://www.snorkel.org/
- Build a UI tool to annotate the semi supervised tagged data to create golden dataset for ML training
- Build a Naive Deep Learning/ Neural network model and evaluate it on golden/sem supervised dataset
- Deploy the model, classify the text
- Extract the web links from tweets and store the urls

------------------------------------------------------------------------------------------------------------------------

## Implementation

1. Problem Definition
    - Build a modular streaming ML/DL pipeline 
    
2. Data Ingestion / Data Collection:
    ```
    Tweets ---> Twitter Stream ---> Tweepy ---> Kafka Producer ---> Kafka Stream --
                   -> Spark Structured Streaming Consumer ---> Postgresql as one single raw data table
   ```

3. Data Preparation / Data Labelling and Segregation 
   ```shell script
   Posgresql ---> Raw Dataset Table ---> Split ---> Train/Test/Dev/Snorkel dataset tables --
                     -> SSPLabeler(Snorkel Labeler) --
                         -> Labelled Train/Test/Dev dataset stored in Postgresql & Disk
   
   Labelled Train/Test/Dev dataset on Posgresql---> Mannual UI Tagger --
                 ->  Train/Test/Dev dataset with golden label column on Posgresql
   ```

4. Model Training and Evaluation
    ```shell script
   Labelled Train/Test/Dev dataset ---> DL Model ---> Model Store
   ```

5. Deployment / Prediction on Live Stream
   ```shell script
   Model Store ---> Tensorflow Serving ---> TF API End Point
   Tweets ---> Twitter Stream ---> Tweepy ---> Kafka Producer ---> Kafka Stream  --
                 -> Spark Structured Streaming Consumer ---> UDF(TF API End Point) --
                       -> Filtered AI Tweets ---> Postgresql
   ```

6. Monitoring / Dashboard
   ```
   Postgresql ---> Flask API ---> Dashboard
   ```

![](../drawio/usecase6.png)


Dataset tables:
- Tables are suffixed with `run/version id` starting from `0`, refer respective bin/*.sh files for version configurations

|Table Name                        |Records|Info               |
|----------------------------------|-------|-------------------|
|raw_tweet_dataset_0               | 50K+  |Full Raw Dataset   |
|deduplicated_raw_tweet_dataset_0  | ~     |Depulicated on text column|
|test_dataset_0                    |1000   |Test dataset       |
|dev_dataset_0                     |500    |Dev dataset        |
|snorkel_train_dataset_0           |10K    |Snorkel train dataset | 
|train_dataset_0                   |~      |Model train dataset |

![](../drawio/6_full_ml_model_cycle.png)

------------------------------------------------------------------------------------------------------------------------

## Configuration
- [Tweets Keywords Used](https://gyan42.github.io/spark-streaming-playground/build/html/ssp/ssp.utils.html#ssp.utils.ai_key_words.AIKeyWords)
- Config file used : [default_ssp_config.gin](https://github.com/gyan42/spark-streaming-playground/blob/756ee7c204039c8a3bc890a95e1da78ac2d6a9ee/config/default_ssp_config.gin)
- [TwitterProducer](https://gyan42.github.io/spark-streaming-playground/build/html/ssp/ssp.kafka.producer.html)
- [TwitterDataset](https://gyan42.github.io/spark-streaming-playground/build/html/ssp/ssp.spark.streaming.consumer.html?highlight=twitterdataset#ssp.spark.streaming.consumer.twiteer_stream_consumer.TwitterDataset)
- [SSPMLDataset](https://gyan42.github.io/spark-streaming-playground/build/html/ssp/ssp.ml.dataset.html?highlight=sspmldataset#ssp.ml.dataset.prepare_dataset.SSPMLDataset)
- [tagger.gin](https://github.com/gyan42/spark-streaming-playground/blob/756ee7c204039c8a3bc890a95e1da78ac2d6a9ee/config/tagger.gin)
- [NaiveTextClassifier](https://gyan42.github.io/spark-streaming-playground/build/html/ssp/ssp.dl.tf.classifier.html?highlight=naivetextclassifier#ssp.dl.tf.classifier.naive_text_classifier.NaiveTextClassifier)
- [SreamingTextClassifier](https://gyan42.github.io/spark-streaming-playground/build/html/ssp/ssp.spark.streaming.nlp.html?highlight=sreamingtextclassifier#ssp.spark.streaming.nlp.spark_dl_text_classification.SreamingTextClassifier)

------------------------------------------------------------------------------------------------------------------------

## How to run?

There are two ways of running, that is on docker or on your local machine. In either case, opening the terminal
is the difference, once the terminal is launched, the steps are common. 

Start the docker container, if needed:
```
docker run -v $(pwd):/host/ --hostname=$(hostname) -p 50075:50075 -p 50070:50070 -p 8020:8020 -p 2181:2181 -p 9870:9870 -p 9000:9000 -p 8088:8088 -p 10000:10000 -p 7077:7077 -p 10001:10001 -p 8080:8080 -p 9092:9092 -it sparkstructuredstreaming-pg:latest
```

To get a new terminal for our docker instance run : `docker exec -it $(docker ps | grep sparkstructuredstreaming-pg | cut -d' ' -f1) bash`
Note: We pull our container run id with `$(docker ps | grep sparkstructuredstreaming-pg | cut -d' ' -f1)`

This example needs multiple terminals:

- On each terminal move to the source root folder:
    ```shell script
    # Local machine
    cd /path/to/spark-streaming-playground/ 
    
    # On Docker 'spark-streaming-playground' is mountes as a volume at /host/
    cd /host  
    
    export PYTHONPATH=$(pwd)/src/:$PYTHONPATH
    ```

- Data collection
    - There can be of two ways
    - First and easy way is use the dump as part of this repo      
    ```shell script
    export PYTHONPATH=$(pwd)/src/:$PYTHONPATH
    python src/ssp/posgress/dataset_base.py --mode=upload
    ```
    - Second way is dumping data from live stream, which may take few hours depending up on the 
    frequency of the AI/ML/Big Data tweets
    ```shell script
    #[producer] Guake terminal name! 
        vim bin/data/start_kafka_producer.sh
        bin/data/start_kafka_producer.sh
    
    #[dump data]
        #by default 50K tweets (25K AI tweets + 25K False positive) will be collected and dumbed into the table
        vim bin/data/dump_raw_data_into_postgresql.sh
        bin/data/dump_raw_data_into_postgresql.sh
    ```


- Data preparation for model training, with default snorkel labeller
    ```shell script
    #[ssp data]
        # check the version points to the one we wanted, to begin with it has to be 0
        # --mode switch has two options: 'split' for crating the train/test set from initial full data table and
        # 'download' to dump the labelled data, same outpath is used expect it is annotated with '_labelled'
        vim bin/data/prepare_ssp_dataset.sh 
        vim config/default_ssp_config.gin # check for `SSPMLDataset` params
        bin/data/prepare_ssp_dataset.sh
    ```
  Snorkell Label Function coverage will be printed as part of the logs, as follows:
  ```shell script
                          j Polarity  Coverage  Overlaps  Conflicts
    is_ai_tweet         0      [1]    0.3125    0.0587     0.0587
    is_not_ai_tweet     1      [0]    0.3469    0.1597     0.0000
    not_data_science    2      [0]    0.1084    0.0983     0.0482
    not_neural_network  3      [0]    0.0036    0.0030     0.0030
    not_big_data        4      [0]    0.1133    0.1048     0.0057
    not_nlp             5      [0]    0.0132    0.0120     0.0004
    not_ai              6      [0]    0.0084    0.0067     0.0059
    not_cv              7      [0]    0.0081    0.0068     0.0016
  ``` 

- Mannual tagger
    ![](../images/text_tagger.png)
    ```shell script
    #[tagger]
        bin/flask/tagger.sh
    ```
    
- Evalaute the Snorkel labeller with respect to hand labels
    ```shell script
    #[snorkell]
        vim bin/models/evalaute_snorkel_labeller.sh
        bin/models/evalaute_snorkel_labeller.sh
    ```

- Train Deep Learning model 
    ```shell script
    #[DL Text classification Model]
        vim config/default_ssp_config.gin #check the paths in NaiveTextClassifier params 
        bin/models/build_naive_dl_text_classifier.sh 
    ```
- Start Tensorflow Serving
    - Through **installation**
        ```shell script
        #[Tensorflow Serving]
            # give appropriate model path that needs to be served
            export MODEL_DIR=/home/mageswarand/ssp/model/raw_tweet_dataset_0/naive_text_classifier/exported/
            # test the model 
            saved_model_cli show --dir ${MODEL_DIR}/1/ --all
            # start the serving server
      
            tensorflow_model_server \
              --rest_api_port=8501 \
              --model_name="naive_text_clf" \
              --model_base_path="${MODEL_DIR}"
        ```
    - Through **docker**
        ```shell script
          # Download the TensorFlow Serving Docker image and repo
          docker pull tensorflow/serving
          export MODEL_DIR=/home/mageswarand/ssp/model/raw_tweet_dataset_0/naive_text_classifier/exported/
          docker run -p 8501:8501 \
          --mount type=bind,source=${MODEL_DIR},target=/models/naive_text_clf \
          -e MODEL_NAME=naive_text_clf -t tensorflow/serving
  
        ```
    - Through **Kubernetes**, using local machine as driver (TODO: try creating a private cluster with docker as driver)
        - you may have delete previous minikube driver
            ```shell script
              minikube stop
              minikube delete 
            ```
         - Start minikube
             ```shell script
               sudo minikube start --vm-driver=none #--cpus=6 --memory=16000mb # this will take few mins
               #eval $(minikube docker-env)
               #export DOCKER_CERT_PATH=${HOME}/.minikube/certs #bug fix
             ```
          
         - Get the default tensorflow serving docker image
            ```shell script
             docker pull tensorflow/serving
             # Make sure the port 8501 is free and not used by any application
             sudo netstat -tulpen | grep 8501 #you should see blank output
            ```
      
         - Prepare a new serving image with the model
            ```shell script
             #Start the default tensorflow serving docker image with a name "serving_base"
             docker run -d --name serving_base tensorflow/serving
             export MODEL_DIR=/home/mageswarand/ssp/model/raw_tweet_dataset_0/naive_text_classifier/exported/
          
             # copy the model artifacts
             docker cp ${MODEL_DIR} serving_base:/models/naive_text_clf
             # save the image with the model
             docker commit --change "ENV MODEL_NAME naive_text_clf" serving_base naive-text-clf-serving
             # kill and remove the default image
             docker kill serving_base
             docker rm serving_base
           ```
        
        - Test the image
            ```shell script
                # Run the image and test APIs are working fine
                # run below test script: tensorflow_serving_api_udf.py
                # after testing kill the docker image, jus make sure there are no multiple docker images running 
                docker inspect $(docker ps | grep naive-text-clf-serving | cut -d' ' -f1)
                docker kill $(docker ps | grep naive-text-clf-serving | cut -d' ' -f1)
            ```
        - Create kubernetes deployment and service pods
        
            ```shell script
              kubectl create -f  kubernetes/tensorflow_naive_classifier.yaml
              sudo netstat -tulpen | grep 30125
              
              kubectl get pods        
              kubectl get deployments
              kubectl get services
              kubectl describe service naive-text-clf-service
            ```
      
      ![](../images/tf_kubernetes_text_classifier.png)
        - Delete the kubernetes service cluster
        ```shell script
          # if you wanted to clear out all the kubernetes stuff
          kubectl delete -f  kubernetes/tensorflow_naive_classifier.yaml
        ```
      
- Test the serving REST end point
    ```shell script
    export PYTHONPATH=$(pwd)/src/:$PYTHONPATH
    python src/ssp/spark/udf/tensorflow_serving_api_udf.py
    ```

- Start live Spark streaming for AI/Data Science tweet classification
    ```shell script
    # [Spark Streaming]
        bin/nlp/spark_dl_text_classification_main.sh
    ```
- Start the AI Tweets dash board
    ```shell script
    [dashboard]
        bin/flask/ai_tweets_dashboard.sh 
    ```
    ![](../images/ai_tweets_dashboard.png)
    A Flask Web UI with the text and its prediction probability!
    
- Clean all Postgresql DB tables
```
DROP SCHEMA public CASCADE;
CREATE SCHEMA public AUTHORIZATION sparkstreaming;
GRANT ALL ON schema public TO sparkstreaming;
```
------------------------------------------------------------------------------------------------------------------------
 
## Take Aways / Learning's 

- TODOs

------------------------------------------------------------------------------------------------------------------------

Medium post on the use case @ [https://medium.com/@mageswaran1989/big-data-play-ground-for-engineers-architecting-a-realtime-streaming-deep-learning-pipeline-with-c0305407f21d](https://medium.com/@mageswaran1989/big-data-play-ground-for-engineers-architecting-a-realtime-streaming-deep-learning-pipeline-with-c0305407f21d)

------------------------------------------------------------------------------------------------------------------------

**References**
- [https://towardsdatascience.com/deploy-your-machine-learning-models-with-tensorflow-serving-and-kubernetes-9d9e78e569db](https://towardsdatascience.com/deploy-your-machine-learning-models-with-tensorflow-serving-and-kubernetes-9d9e78e569db)
- [https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65](https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65)
- [https://towardsdatascience.com/how-to-build-a-complex-reporting-dashboard-using-dash-and-plotl-4f4257c18a7f](https://towardsdatascience.com/how-to-build-a-complex-reporting-dashboard-using-dash-and-plotl-4f4257c18a7f)
- [https://github.com/ucg8j/awesome-dash](https://github.com/ucg8j/awesome-dash)
- [https://github.com/tensorflow/serving](https://github.com/tensorflow/serving)
- [https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/serving_kubernetes.md](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/serving_kubernetes.md)