DEBAT
[Getting Started Notebook] DEBAT Challange
This is a Baseline Code to get you started with the challenge.
You can use this code to start understanding the data and create a baseline model for further imporvments.
Starter Code for DEBAT Practice Challange
Note : Create a copy of the notebook and use the copy for submission. Go to File > Save a Copy in Drive to create a new copy
Downloading Dataset¶
Installing aicrowd-cli
!pip install aicrowd-cli
%load_ext aicrowd.magic
%aicrowd login
!rm -rf data
!mkdir data
%aicrowd ds dl -c debat -o data
Importing Libraries¶
In this baseline, we will be using skleanr library to train the model and generate the predictions
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize, LabelEncoder
from scipy.sparse import hstack
import os
from IPython.display import display
Reading the dataset¶
Here, we will read the train.csv
which contains both training samples & labels, and test.csv
which contains testing samples.
# Reading the CSV
# name=["unit_id", "golden_or_not", "unit_state", "trusted_judgments", "last_judgment_at", "agree_or_not_variance", "sentence", "agree_or_not"]
train_data_df = pd.read_csv("data/train.csv")
test_data_df = pd.read_csv("data/test.csv")
# train_data.shape, test_data.shape
display(train_data_df.head())
display(test_data_df.head())
print(train_data_df.shape, test_data_df.shape)
Data Preprocessing¶
In the preprocessing we have a lot of textual data so we will first One-Hot Encode the Possible Features and use TF IDF Tokens to convert the sentence to a possible feature and use it in the regression.
# removing some unneccesary data
train_data_df.drop(['tweet_id', 'tweet_created', 'name'], axis=1, inplace=True)
test_data_df.drop(['tweet_id', 'tweet_created', 'name'], axis=1, inplace=True)
# utility function to one hot encode the dataset
def one_hot_df(df):
df = pd.concat([df, pd.get_dummies(df["candidate"])],axis=1)
df.drop("candidate",axis=1, inplace=True)
df = pd.concat([df, pd.get_dummies(df["subject_matter"])],axis=1)
df.drop("subject_matter",axis=1, inplace=True)
df = pd.concat([df, pd.get_dummies(df["relevant_yn"])],axis=1)
df.drop("relevant_yn",axis=1, inplace=True)
return df
train_data_df = one_hot_df(train_data_df)
test_data_df = one_hot_df(test_data_df)
display(train_data_df)
display(test_data_df)
Transfroming Train Data for Submission¶
# For beginning, transform train_data_df['sentence'] to lowercase using text.lower()
train_data_df['text'].str.lower()
# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train_data_df['text'].replace('[^a-zA-Z0-9]', ' ', regex = True)
# Convert a collection of raw documents to a matrix of TF-IDF features with TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5)
X_tfidf = vectorizer.fit_transform(train_data_df['text'])
# merging final features to the Dataframe and removing the redundent colums
train_data_df = pd.concat([train_data_df,pd.DataFrame(X_tfidf.toarray())], axis=1)
train_data_df.drop("text", axis=1, inplace=True)
display(train_data_df)
# Separating data from the dataframe for final training
X = normalize(train_data_df.drop(["sentiment"], axis=1).to_numpy())
lable_encoder = LabelEncoder()
lable_encoder = lable_encoder.fit(train_data_df.sentiment)
train_data_df.sentiment = lable_encoder.transform(train_data_df.sentiment)
y = train_data_df.sentiment.to_numpy()
print(X.shape, y.shape)
Transfroming Test Data for Submission¶
# For beginning, transform test_data_df['sentence'] to lowercase using text.lower()
test_data_df['text'].str.lower()
# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
test_data_df['text'].replace('[^a-zA-Z0-9]', ' ', regex = True)
# Convert a collection of raw documents to a matrix of TF-IDF features with TfidfVectorizer
X_tfidf_test = vectorizer.transform(test_data_df['text'])
# merging final features to the Dataframe and removing the redundent colums
test_data_df = pd.concat([test_data_df,pd.DataFrame(X_tfidf_test.toarray())], axis=1)
test_data_df.drop("text", axis=1, inplace=True)
display(test_data_df)
Splitting the data¶
# Splitting the training set, and training & validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(y_train.shape)
X_train[0], y_train[0]
Training the Model¶
model = MLPClassifier()
model.fit(X_train, y_train)
Validation¶
model.score(X_val, y_val)
So, we are done with the baseline let's test with real testing data and see how we submit it to challange.
Predictions¶
# Separating data from the dataframe for final testing
X_test = normalize(test_data_df.to_numpy())
print(X_test.shape)
# Predicting the labels
predictions = model.predict(X_test)
predictions = lable_encoder.inverse_transform(predictions)
# Converting the predictions array into pandas dataset
submission = pd.DataFrame({"sentiment":predictions})
submission
# Saving the pandas dataframe
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"), index=False)
Submitting our Predictions¶
Note : Please save the notebook before submitting it (Ctrl + S)
!!aicrowd submission create -c debat -f assets/submission.csv
Content
Comments
You must login before you can post a comment.