BRAND
[Getting Started Notebook] BRAND Challange
This is a Baseline Code to get you started with the challenge.
You can use this code to start understanding the data and create a baseline model for further imporvments.
Starter Code for BRAND Practice Challange
Note : Create a copy of the notebook and use the copy for submission. Go to File > Save a Copy in Drive to create a new copy
Author: Gauransh Kumar¶
Downloading Dataset¶
Installing aicrowd-cli
!pip install aicrowd-cli
%load_ext aicrowd.magic
%aicrowd login
!rm -rf data
!mkdir data
%aicrowd ds dl -c brand -o data
Importing Libraries¶
In this baseline, we will be using skleanr library to train the model and generate the predictions
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import normalize, LabelEncoder
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import os
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import nltk
nltk.download('wordnet')
Reading the dataset¶
Here, we will read the train.csv
which contains both training samples & labels, and test.csv
which contains testing samples.
# Reading the CSV
train_data_df = pd.read_csv("data/train.csv")
test_data_df = pd.read_csv("data/test.csv")
# train_data.shape, test_data.shape
display(train_data_df.head())
display(test_data_df.head())
def tweet_processor(tweets_df, column):
# removing the @mentions in tweet
def remove_pattern(text, pattern_regex):
r = re.findall(pattern_regex, text)
for i in r:
text = re.sub(i, '', text)
return text
tweets_df[column] = np.vectorize(remove_pattern)(tweets_df[column], "@[\w]*: | *RT*")
# Here we are filtering out all the words that contains link
cleaned_tweets = []
for index, row in tweets_df.iterrows():
words_without_links = [word for word in row[column].split() if 'http' not in word]
cleaned_tweets.append(' '.join(words_without_links))
tweets_df[column] = cleaned_tweets
tweets_df[column] = tweets_df[column].str.replace(r"[^a-zA-Z# ]","", regex=True)
# Tokenization
tokenized_tweet = tweets_df[column].apply(lambda x: x.split())
# Finding Lemma for each word
word_lemmatizer = WordNetLemmatizer()
tokenized_tweet = tokenized_tweet.apply(lambda x: [word_lemmatizer.lemmatize(i) for i in x])
#joining words into sentences (from where they came from)
for i, tokens in enumerate(tokenized_tweet):
tokenized_tweet[i] = ' '.join(tokens)
tweets_df[column] = tokenized_tweet
return tweets_df
def tweet_vectorizer(tweets_df, column):
# TF-IDF features
tfidf_word_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2)
# TF-IDF feature matrix
tfidf_word_vectorizer = tfidf_word_vectorizer.fit(tweets_df[column])
tfidf_word_feature = tfidf_word_vectorizer.transform(tweets_df[column])
return tfidf_word_feature, tfidf_word_vectorizer
tweet_processor(train_data_df, "tweet_text")
# encoding the lables using sklean.preprocessing.LabelEncoder()
label_encoder = LabelEncoder()
label_encoder.fit(train_data_df["is_there_an_emotion_directed_at_a_brand_or_product"])
train_data_df["is_there_an_emotion_directed_at_a_brand_or_product"] = label_encoder.transform(train_data_df["is_there_an_emotion_directed_at_a_brand_or_product"])
train_data_df.head()
Data Preprocessing¶
# Separating data from the dataframe for final training
X, tfidf_vectorizer = tweet_vectorizer(train_data_df, "tweet_text")
y = train_data_df.is_there_an_emotion_directed_at_a_brand_or_product
print(X.shape, y.shape)
# Visualising the final lable classes for training
sns.countplot(y)
Splitting the data¶
# Splitting the training set, and training & validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(y_train.shape)
X_train[0], y_train[0]
Training the Model¶
model = MLPClassifier()
model.fit(X_train, y_train)
Validation¶
model.score(X_val, y_val)
So, we are done with the baseline let's test with real testing data and see how we submit it to challange.
Predictions¶
# Separating data from the dataframe for final testing
test_data_df = tweet_processor(test_data_df, "tweet_text")
X_test = tfidf_vectorizer.transform(test_data_df.tweet_text)
print(X_test.shape)
# Predicting the labels
predictions = model.predict(X_test)
predictions = label_encoder.inverse_transform(predictions)
predictions.shape
# Converting the predictions array into pandas dataset
submission = pd.DataFrame({"directed_emotion":predictions})
submission
# Saving the pandas dataframe
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"), index=False)
Submitting our Predictions¶
Note : Please save the notebook before submitting it (Ctrl + S)
!!aicrowd submission create -c brand -f assets/submission.csv
Content
Comments
You must login before you can post a comment.