ssp.spark.streaming.consumer¶
SSP modules that handles all data at ingestion level from Twitter stream
-
class
ssp.spark.streaming.consumer.twiteer_stream_consumer.
TwitterDataset
(kafka_bootstrap_servers='localhost:9092', kafka_topic='mix_tweets_topic', checkpoint_dir='hdfs://localhost:9000/tmp/ssp/data/lake/checkpoint/', bronze_parquet_dir='hdfs://localhost:9000/tmp/ssp/data/lake/bronze/', warehouse_location='/opt/spark-warehouse/', spark_master='spark://IMCHLT276:7077', postgresql_host='localhost', postgresql_port='5432', postgresql_database='sparkstreamingdb', postgresql_user='sparkstreaming', postgresql_password='sparkstreaming', raw_tweet_table_name_prefix='raw_tweet_dataset', processing_time='5 seconds')[source]¶ Bases:
ssp.spark.streaming.common.twitter_streamer_base.TwitterStreamerBase
Twitter ingestion class
Starts the Spark Structured Streaming against the Kafka topic and dumps the data to HDFS / Postgresql
- Parameters
kafka_bootstrap_servers – (str) Kafka bootstram server
kafka_topic – (str) Kafka topic
checkpoint_dir – (str) Checkpoint directory for fault tolerance
bronze_parquet_dir – (str) Store path to dump tweet data
warehouse_location – Spark warehouse location
spark_master – Spark Master URL
postgresql_host – Postgresql host
postgresql_port – Postgresql port
postgresql_database – Postgresql Database
postgresql_user – Postgresql user name
postgresql_password – Postgresql user password
raw_tweet_table_name_prefix – Postgresql table name prefix
processing_time – Processing time trigger
-
ssp.spark.streaming.consumer.twiteer_stream_consumer.
check_n_mk_dirs
(path, is_remove=False)[source]¶
-
ssp.spark.streaming.consumer.twiteer_stream_consumer.
check_table_exists
(table_name, schema='public')[source]¶
-
ssp.spark.streaming.consumer.twiteer_stream_consumer.
filter_possible_ai_tweet_udf
(text)¶
-
ssp.spark.streaming.consumer.twiteer_stream_consumer.
get_raw_dataset
(path)[source]¶ Combines all parquet file as one in given path :param path: Folder path :return:
SSP modules that handles all data at ingestion level from Twitter stream