Loading

BRAND

[Getting Started Notebook] BRAND Challange

This is a Baseline Code to get you started with the challenge.

gauransh_k

You can use this code to start understanding the data and create a baseline model for further imporvments.

Starter Code for BRAND Practice Challange

Note : Create a copy of the notebook and use the copy for submission. Go to File > Save a Copy in Drive to create a new copy

Author: Gauransh Kumar

Downloading Dataset

Installing aicrowd-cli

In [1]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Requirement already satisfied: aicrowd-cli in /home/gauransh/anaconda3/lib/python3.8/site-packages (0.1.10)
Requirement already satisfied: requests-toolbelt<1,>=0.9.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (0.9.1)
Requirement already satisfied: rich<11,>=10.0.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (10.15.2)
Requirement already satisfied: toml<1,>=0.10.2 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: requests<3,>=2.25.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (2.26.0)
Requirement already satisfied: pyzmq==22.1.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (22.1.0)
Requirement already satisfied: GitPython==3.1.18 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (3.1.18)
Requirement already satisfied: tqdm<5,>=4.56.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (4.62.2)
Requirement already satisfied: click<8,>=7.1.2 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (7.1.2)
Requirement already satisfied: gitdb<5,>=4.0.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from GitPython==3.1.18->aicrowd-cli) (4.0.9)
Requirement already satisfied: smmap<6,>=3.0.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from gitdb<5,>=4.0.1->GitPython==3.1.18->aicrowd-cli) (5.0.0)
Requirement already satisfied: idna<4,>=2.5 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (3.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.6)
Requirement already satisfied: certifi>=2017.4.17 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.0)
Requirement already satisfied: colorama<0.5.0,>=0.4.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.4.4)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.9.1)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.10.0)
In [2]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/7pcsnH-ZIilqPgflYsgbVWATLbbySqVJRzZhFH9jZWQ
Opening in existing browser session.
API Key valid
Saved API Key successfully!
In [3]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c brand -o data

Importing Libraries

In this baseline, we will be using skleanr library to train the model and generate the predictions

In [4]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import normalize, LabelEncoder
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import os
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

import nltk
nltk.download('wordnet')
[nltk_data] Downloading package wordnet to /home/gauransh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Out[4]:
True

Reading the dataset

Here, we will read the train.csv which contains both training samples & labels, and test.csv which contains testing samples.

In [5]:
# Reading the CSV
train_data_df = pd.read_csv("data/train.csv")
test_data_df = pd.read_csv("data/test.csv")

# train_data.shape, test_data.shape
display(train_data_df.head())
display(test_data_df.head())
tweet_text is_there_an_emotion_directed_at_a_brand_or_product
0 RT @mention Google to Launch Major New Soci... No emotion toward brand or product
1 RT @mention &quot;Google maps: Route around tr... Positive emotion
2 RT @mention Server Challenge is a huge hit at ... No emotion toward brand or product
3 Going to the urinal while holding ur ipad and ... No emotion toward brand or product
4 Digging John McRee's talk about designing for ... No emotion toward brand or product
tweet_text
0 RT @mention Need a Workspace? Book it from yo...
1 RT @mention On its second day in business, the...
2 @mention @mention finishing up beta on Android...
3 Hey foodies: if you're in Austin for SXSW down...
4 I guess no Google social network just yet. But...
In [6]:
def tweet_processor(tweets_df, column):
    
    # removing the @mentions in tweet
    
    def remove_pattern(text, pattern_regex):
        r = re.findall(pattern_regex, text)
        for i in r:
            text = re.sub(i, '', text)

        return text
    tweets_df[column] = np.vectorize(remove_pattern)(tweets_df[column], "@[\w]*: | *RT*")
    
    # Here we are filtering out all the words that contains link
    cleaned_tweets = []
    for index, row in tweets_df.iterrows():
        words_without_links = [word for word in row[column].split() if 'http' not in word]
        cleaned_tweets.append(' '.join(words_without_links))
    tweets_df[column] = cleaned_tweets
    tweets_df[column] = tweets_df[column].str.replace(r"[^a-zA-Z# ]","", regex=True)
    
    # Tokenization
    tokenized_tweet = tweets_df[column].apply(lambda x: x.split())
    
    # Finding Lemma for each word
    word_lemmatizer = WordNetLemmatizer()
    tokenized_tweet = tokenized_tweet.apply(lambda x: [word_lemmatizer.lemmatize(i) for i in x])
    #joining words into sentences (from where they came from)
    for i, tokens in enumerate(tokenized_tweet):
        tokenized_tweet[i] = ' '.join(tokens)

    tweets_df[column] = tokenized_tweet
    return tweets_df

def tweet_vectorizer(tweets_df, column):
    # TF-IDF features
    tfidf_word_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2)
    # TF-IDF feature matrix
    tfidf_word_vectorizer = tfidf_word_vectorizer.fit(tweets_df[column])
    tfidf_word_feature = tfidf_word_vectorizer.transform(tweets_df[column])
    return tfidf_word_feature, tfidf_word_vectorizer
In [7]:
tweet_processor(train_data_df, "tweet_text")
Out[7]:
tweet_text is_there_an_emotion_directed_at_a_brand_or_product
0 mention Google to Launch Major New Social Netw... No emotion toward brand or product
1 mention quotGoogle mapsoute around traffic is ... Positive emotion
2 mention Server Challenge is a huge hit at #sxs... No emotion toward brand or product
3 Going to the urinal while holding ur ipad and ... No emotion toward brand or product
4 Digging John Mcees talk about designing for Bo... No emotion toward brand or product
... ... ...
7268 mention mention GO BEYOND BODES link #edchat #... No emotion toward brand or product
7269 mention #seenatsxsw Best thing seen at #sxsw s... No emotion toward brand or product
7270 mention Hey Taariq howdy from Texas fav #sXsw ... Positive emotion
7271 EDCOSS to T mention If you can afford to atten... No emotion toward brand or product
7272 Hey #SXSW mover and shaker mention is publishi... No emotion toward brand or product

7273 rows × 2 columns

In [8]:
# encoding the lables using sklean.preprocessing.LabelEncoder()
label_encoder = LabelEncoder()
label_encoder.fit(train_data_df["is_there_an_emotion_directed_at_a_brand_or_product"])
train_data_df["is_there_an_emotion_directed_at_a_brand_or_product"] = label_encoder.transform(train_data_df["is_there_an_emotion_directed_at_a_brand_or_product"])
train_data_df.head()
Out[8]:
tweet_text is_there_an_emotion_directed_at_a_brand_or_product
0 mention Google to Launch Major New Social Netw... 2
1 mention quotGoogle mapsoute around traffic is ... 3
2 mention Server Challenge is a huge hit at #sxs... 2
3 Going to the urinal while holding ur ipad and ... 2
4 Digging John Mcees talk about designing for Bo... 2

Data Preprocessing

In [9]:
# Separating data from the dataframe for final training
X, tfidf_vectorizer = tweet_vectorizer(train_data_df, "tweet_text")
y = train_data_df.is_there_an_emotion_directed_at_a_brand_or_product
print(X.shape, y.shape)
(7273, 4784) (7273,)
In [10]:
# Visualising the final lable classes for training
sns.countplot(y)
/home/gauransh/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
Out[10]:
<AxesSubplot:xlabel='is_there_an_emotion_directed_at_a_brand_or_product', ylabel='count'>

Splitting the data

In [11]:
# Splitting the training set, and training & validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(y_train.shape)
(5818, 4784)
(5818,)
In [12]:
X_train[0], y_train[0]
Out[12]:
(<1x4784 sparse matrix of type '<class 'numpy.float64'>'
 	with 14 stored elements in Compressed Sparse Row format>,
 2)

Training the Model

In [13]:
model = MLPClassifier()
model.fit(X_train, y_train)
/home/gauransh/anaconda3/lib/python3.8/site-packages/sklearn/neural_network/_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
Out[13]:
MLPClassifier()

Validation

In [14]:
model.score(X_val, y_val)
Out[14]:
0.643298969072165

So, we are done with the baseline let's test with real testing data and see how we submit it to challange.

Predictions

In [15]:
# Separating data from the dataframe for final testing
test_data_df = tweet_processor(test_data_df, "tweet_text")
X_test = tfidf_vectorizer.transform(test_data_df.tweet_text)
print(X_test.shape)
(1819, 4784)
In [16]:
# Predicting the labels
predictions = model.predict(X_test)
predictions = label_encoder.inverse_transform(predictions)
predictions.shape
Out[16]:
(1819,)
In [21]:
# Converting the predictions array into pandas dataset
submission = pd.DataFrame({"directed_emotion":predictions})
submission
Out[21]:
directed_emotion
0 Positive emotion
1 Positive emotion
2 No emotion toward brand or product
3 Positive emotion
4 No emotion toward brand or product
... ...
1814 No emotion toward brand or product
1815 No emotion toward brand or product
1816 No emotion toward brand or product
1817 No emotion toward brand or product
1818 Positive emotion

1819 rows × 1 columns

In [22]:
# Saving the pandas dataframe
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"), index=False)

Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)

In [23]:
!!aicrowd submission create -c brand -f assets/submission.csv
Out[23]:
['submission.csv ━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 52.8/51.1 KB • 7.2 MB/s • 0:00:00',
 '                                  ╭─────────────────────────╮                                  ',
 '                                  │ Successfully submitted! │                                  ',
 '                                  ╰─────────────────────────╯                                  ',
 '                                        Important links                                        ',
 '┌──────────────────┬──────────────────────────────────────────────────────────────────────────┐',
 '│  This submission │ https://www.aicrowd.com/challenges/brand/submissions/169775              │',
 '│                  │                                                                          │',
 '│  All submissions │ https://www.aicrowd.com/challenges/brand/submissions?my_submissions=true │',
 '│                  │                                                                          │',
 '│      Leaderboard │ https://www.aicrowd.com/challenges/brand/leaderboards                    │',
 '│                  │                                                                          │',
 '│ Discussion forum │ https://discourse.aicrowd.com/c/brand                                    │',
 '│                  │                                                                          │',
 '│   Challenge page │ https://www.aicrowd.com/challenges/brand                                 │',
 '└──────────────────┴──────────────────────────────────────────────────────────────────────────┘',
 "{'submission_id': 169775, 'created_at': '2021-12-25T07:04:32.477Z'}"]
In [ ]:


Comments

You must login before you can post a comment.

Execute