ssp.ml.dataset¶
-
class
ssp.ml.dataset.prepare_dataset.
SSPMLDataset
(text_column='text', label_output_column='slabel', raw_tweet_table_name_prefix='raw_tweet_dataset', postgresql_host='localhost', postgresql_port='5432', postgresql_database='sparkstreamingdb', postgresql_user='sparkstreaming', postgresql_password='sparkstreaming', overwrite=False)[source]¶ Bases:
ssp.posgress.dataset_base.PostgresqlDatasetBase
Reads the raw tweet data dump from Postgresql, splits the data and annotates the text with Snorkel.
Dumps the data into postgresql for annotation and conitnuous improvement purpose
Dumps the data into given path as train/dev/test/snorkell train data for model building
- Parameters
text_column – Name of the text column
label_output_column – Name of the label column to be created using the snorkel labeler
raw_tweet_table_name_prefix – Raw tweet table dump name prefix
postgresql_host – Postgresql host
postgresql_port – Postgresql port
postgresql_database – Postgresql Database
postgresql_user – Postgresql user name
postgresql_password – Postgresql user password
overwrite – Overwrite the table and disk data
|Table Name |Records|Info | |----------------------------------|-------|---------------------------| |raw_tweet_dataset_0 | 50K+ |Full Raw Dataset | |deduplicated_raw_tweet_dataset_0 | ~ |Depulicated on text column | |test_dataset_0 |1000 |Test dataset | |dev_dataset_0 |500 |Dev dataset | |snorkel_train_dataset_0 |10K |Snorkel train dataset | |train_dataset_0 |~ |Model train dataset |