ssp.spark.streaming.nlp


class ssp.spark.streaming.nlp.ner_extraction.NerExtraction(kafka_bootstrap_servers='localhost:9092', kafka_topic='ai_tweets_topic', checkpoint_dir='hdfs://localhost:9000/tmp/ssp/data/lake/checkpoint/', bronze_parquet_dir='hdfs://localhost:9000/tmp/ssp/data/lake/bronze/', warehouse_location='/opt/spark-warehouse/', spark_master='spark://IMCHLT276:7077', postgresql_host='localhost', postgresql_port='5432', postgresql_database='sparkstreamingdb', postgresql_user='sparkstreaming', postgresql_password='sparkstreaming', processing_time='5 seconds', is_live_stream=True, is_docker=False)[source]

Bases: ssp.spark.streaming.common.twitter_streamer_base.TwitterStreamerBase

Uses external REST endpoint to get NER tags

Parameters
  • kafka_bootstrap_servers – (str) host_url:port

  • kafka_topic – (str) Live stream Kafka topic

  • checkpoint_dir – (str) Spark Streaming checkpoint directory

  • bronze_parquet_dir – (str) Input stream directory path. For local paths prefix it with “file///”

  • warehouse_location – (str) Spark warehouse location

  • spark_master – (str) Spark master url

  • postgresql_host – (str) Postgresql host url

  • postgresql_port – (str) Postgres port

  • postgresql_database – (str) Database name

  • postgresql_user – (str) Postgresql user name

  • postgresql_password – (str) Postgresql user password

  • processing_time – (str) Spark Streaming process interval

  • is_live_stream – (bool) Use live stream or to use streamed directory as input

  • is_docker – (bool) Run environment local machine or docker, to use appropriate host name in REST endpoints

hdfs_process()[source]
online_process()[source]
process()[source]
class ssp.spark.streaming.nlp.spark_dl_text_classification.SreamingTextClassifier(kafka_bootstrap_servers='localhost:9092', kafka_topic='ai_tweets_topic', checkpoint_dir='hdfs://localhost:9000/tmp/ssp/data/lake/checkpoint/', bronze_parquet_dir='hdfs://localhost:9000/tmp/ssp/data/lake/bronze/', warehouse_location='/opt/spark-warehouse/', spark_master='spark://IMCHLT276:7077', postgresql_host='localhost', postgresql_port='5432', postgresql_database='sparkstreamingdb', postgresql_user='sparkstreaming', postgresql_password='sparkstreaming', processing_time='5 seconds', tokenizer_path=<object object>, is_live_stream=True, is_docker=False)[source]

Bases: ssp.spark.streaming.common.twitter_streamer_base.TwitterStreamerBase

Classifies the incoming tweet text using the DL model build using Tensorflow serving

Parameters
  • kafka_bootstrap_servers – (str) host_url:port

  • kafka_topic – (str) Live stream Kafka topic

  • checkpoint_dir – (str) Spark Streaming checkpoint directory

  • bronze_parquet_dir – (str) Input stream directory path. For local paths prefix it with “file///”

  • warehouse_location – (str) Spark warehouse location

  • spark_master – (str) Spark master url

  • postgresql_host – (str) Postgresql host url

  • postgresql_port – (str) Postgres port

  • postgresql_database – (str) Database name

  • postgresql_user – (str) Postgresql user name

  • postgresql_password – (str) Postgresql user password

  • processing_time – (str) Spark Streaming process interval

  • is_live_stream – (bool) Use live stream or to use streamed directory as input

  • is_docker – (bool) Run environment local machine or docker, to use appropriate host name in REST endpoints

  • tokenizer_path – Keras tokenizer store / saved path

hdfs_process()[source]
online_process()[source]
process()[source]