ssp.ml.dataset¶

class ssp.ml.dataset.prepare_dataset.SSPMLDataset(text_column='text', label_output_column='slabel', raw_tweet_table_name_prefix='raw_tweet_dataset', postgresql_host='localhost', postgresql_port='5432', postgresql_database='sparkstreamingdb', postgresql_user='sparkstreaming', postgresql_password='sparkstreaming', overwrite=False)[source]¶

Bases: ssp.posgress.dataset_base.PostgresqlDatasetBase

Reads the raw tweet data dump from Postgresql, splits the data and annotates the text with Snorkel.

Dumps the data into postgresql for annotation and conitnuous improvement purpose

Dumps the data into given path as train/dev/test/snorkell train data for model building

Parameters

text_column – Name of the text column
label_output_column – Name of the label column to be created using the snorkel labeler
raw_tweet_table_name_prefix – Raw tweet table dump name prefix
postgresql_host – Postgresql host
postgresql_port – Postgresql port
postgresql_database – Postgresql Database
postgresql_user – Postgresql user name
postgresql_password – Postgresql user password
overwrite – Overwrite the table and disk data

|Table Name                        |Records|Info                       |
|----------------------------------|-------|---------------------------|
|raw_tweet_dataset_0               | 50K+  |Full Raw Dataset           |
|deduplicated_raw_tweet_dataset_0  | ~     |Depulicated on text column |
|test_dataset_0                    |1000   |Test dataset               |
|dev_dataset_0                     |500    |Dev dataset                |
|snorkel_train_dataset_0           |10K    |Snorkel train dataset      |
|train_dataset_0                   |~      |Model train dataset        |

download_n_store(version=0)[source]¶

split_n_store(version=0)[source]¶

ssp.ml.dataset.prepare_dataset.insert_id_col(df)[source]¶: Inserts text_id column considering the number of rows and a label column. label column copies the values of slabel column, if exists or inserts -1 as value :param df: Pandas DataFrame :return: Pandas DataFrame