ssp.ml.dataset


class ssp.ml.dataset.prepare_dataset.SSPMLDataset(text_column='text', label_output_column='slabel', raw_tweet_table_name_prefix='raw_tweet_dataset', postgresql_host='localhost', postgresql_port='5432', postgresql_database='sparkstreamingdb', postgresql_user='sparkstreaming', postgresql_password='sparkstreaming', overwrite=False)[source]

Bases: ssp.posgress.dataset_base.PostgresqlDatasetBase

Reads the raw tweet data dump from Postgresql, splits the data and annotates the text with Snorkel.

Dumps the data into postgresql for annotation and conitnuous improvement purpose

Dumps the data into given path as train/dev/test/snorkell train data for model building

Parameters
  • text_column – Name of the text column

  • label_output_column – Name of the label column to be created using the snorkel labeler

  • raw_tweet_table_name_prefix – Raw tweet table dump name prefix

  • postgresql_host – Postgresql host

  • postgresql_port – Postgresql port

  • postgresql_database – Postgresql Database

  • postgresql_user – Postgresql user name

  • postgresql_password – Postgresql user password

  • overwrite – Overwrite the table and disk data

|Table Name                        |Records|Info                       |
|----------------------------------|-------|---------------------------|
|raw_tweet_dataset_0               | 50K+  |Full Raw Dataset           |
|deduplicated_raw_tweet_dataset_0  | ~     |Depulicated on text column |
|test_dataset_0                    |1000   |Test dataset               |
|dev_dataset_0                     |500    |Dev dataset                |
|snorkel_train_dataset_0           |10K    |Snorkel train dataset      |
|train_dataset_0                   |~      |Model train dataset        |
download_n_store(version=0)[source]
split_n_store(version=0)[source]
ssp.ml.dataset.prepare_dataset.insert_id_col(df)[source]

Inserts text_id column considering the number of rows and a label column. label column copies the values of slabel column, if exists or inserts -1 as value :param df: Pandas DataFrame :return: Pandas DataFrame