ssp.spark.streaming.common


class ssp.spark.streaming.common.streamer_base.StreamerBase(spark_master, checkpoint_dir, warehouse_location, kafka_bootstrap_servers, kafka_topic, processing_time='5 seconds')[source]

Bases: object

Good read on Spark Streaming: https://www.slideshare.net/databricks/deep-dive-into-stateful-stream-processing-in-structured-streaming-with-tathagata-das

When it comes to describing the semantics of a delivery mechanism, there are three basic categories:

at-most-once delivery means that for each message handed to the mechanism, that message is delivered once or not at all; in more casual terms it means that messages may be lost. at-least-once delivery means that for each message handed to the mechanism potentially multiple attempts are made at delivering it, such that at least one succeeds; again, in more casual terms this means that messages may be duplicated but not lost. exactly-once delivery means that for each message handed to the mechanism exactly one delivery is made to the recipient; the message can neither be lost nor duplicated.

The first one is the cheapest—highest performance, least implementation overhead—because it can be done in a fire-and-forget fashion without keeping state at the sending end or in the transport mechanism. The second one requires retries to counter transport losses, which means keeping state at the sending end and having an acknowledgement mechanism at the receiving end. The third is most expensive—and has consequently worst performance—because in addition to the second it requires state to be kept at the receiving end in order to filter out duplicate deliveries.

structured_streaming_dump(path, termination_time=None)[source]
visualize()[source]

For debugging purporse :return:

class ssp.spark.streaming.common.twitter_streamer_base.TwitterStreamerBase(spark_master, checkpoint_dir, warehouse_location, kafka_bootstrap_servers, kafka_topic, processing_time='5 seconds')[source]

Bases: ssp.spark.streaming.common.streamer_base.StreamerBase

Parameters
  • spark_master

  • checkpoint_dir

  • warehouse_location

  • kafka_bootstrap_servers

  • kafka_topic

  • processing_time

ssp.spark.streaming.common.twitter_streamer_base.pick_text(text, rtext, etext)[source]
ssp.spark.streaming.common.twitter_streamer_base.pick_text_udf(text, rtext, etext)