Starter Code for AGREE Practice Challange

Note : Create a copy of the notebook and use the copy for submission. Go to File > Save a Copy in Drive to create a new copy

Downloading Dataset¶

Installing aicrowd-cli

In [1]:

!pip install aicrowd-cli
%load_ext aicrowd.magic

Requirement already satisfied: aicrowd-cli in /home/gauransh/anaconda3/lib/python3.8/site-packages (0.1.10)
Requirement already satisfied: rich<11,>=10.0.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (10.15.2)
Requirement already satisfied: requests<3,>=2.25.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (2.26.0)
Requirement already satisfied: requests-toolbelt<1,>=0.9.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (0.9.1)
Requirement already satisfied: GitPython==3.1.18 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (3.1.18)
Requirement already satisfied: toml<1,>=0.10.2 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: click<8,>=7.1.2 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (7.1.2)
Requirement already satisfied: pyzmq==22.1.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (22.1.0)
Requirement already satisfied: tqdm<5,>=4.56.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (4.62.2)
Requirement already satisfied: gitdb<5,>=4.0.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from GitPython==3.1.18->aicrowd-cli) (4.0.9)
Requirement already satisfied: smmap<6,>=3.0.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from gitdb<5,>=4.0.1->GitPython==3.1.18->aicrowd-cli) (5.0.0)
Requirement already satisfied: charset-normalizer~=2.0.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.0)
Requirement already satisfied: idna<4,>=2.5 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (3.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.5.30)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.6)
Requirement already satisfied: colorama<0.5.0,>=0.4.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.4.4)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.9.1)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.10.0)

In [2]:

%aicrowd login

Please login here: https://api.aicrowd.com/auth/IqLVXeP3B-1vtTLZ4BPTMoMuBiZL5CDXYLablaSoIkI
Opening in existing browser session.
API Key valid
Saved API Key successfully!

In [2]:

!rm -rf data
!mkdir data
%aicrowd ds dl -c agree -o data

In [3]:

# removing extra blank line at the end from the dataset
!sed -i '$ d' data/train.csv
!sed -i '$ d' data/test.csv

Importing Libraries¶

In this baseline, we will be using skleanr library to train the model and generate the predictions

In [4]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from scipy.sparse import hstack
import os
from IPython.display import display

Reading the dataset¶

Here, we will read the train.csv which contains both training samples & labels, and test.csv which contains testing samples.

In [5]:

# Reading the CSV
name=["unit_id", "golden_or_not", "unit_state", "trusted_judgments", "last_judgment_at", "agree_or_not_variance", "sentence", "agree_or_not"]
train_data_df = pd.read_csv("data/train.csv",names=name , encoding='ISO-8859–1')
test_data_df = pd.read_csv("data/test.csv",names=name[:-1], encoding='ISO-8859–1')

# train_data.shape, test_data.shape
display(train_data_df.head())
display(test_data_df.head())
print(train_data_df.shape, test_data_df.shape)

	unit_id	golden_or_not	unit_state	trusted_judgments	last_judgment_at	agree_or_not_variance	sentence	agree_or_not
0	703393947	False	finalized	3	4/12/15 17:24	0.471	captain can be used with the same meaning of m...	4.67
1	703395558	False	finalized	3	4/12/15 16:04	0.471	action can be used with the same meaning of en...	3.67
2	703398442	False	finalized	3	4/14/15 12:58	0.471	climb can be used as the opposite of go_down	4.67
3	703395452	False	finalized	3	4/12/15 10:45	0.000	baby is a kind of person	5.00
4	703401266	False	finalized	3	4/14/15 6:27	0.471	garden can be used with the same meaning of yard	4.67

	unit_id	golden_or_not	unit_state	trusted_judgments	last_judgment_at	agree_or_not_variance	sentence
0	703393947	False	finalized	3	4/12/15 17:24	0.471	captain can be used with the same meaning of m...
1	703395484	False	finalized	3	4/12/15 11:40	0.943	if lose is true
2	703401918	False	finalized	3	4/14/15 5:16	0.471	design can be used as the opposite of destroy
3	703395476	False	finalized	3	4/12/15 13:50	1.247	program is a kind of performance
4	703397489	False	finalized	3	4/14/15 7:18	0.471	sign can be used with the same meaning of mark

(6644, 8) (1662, 7)

Data Preprocessing¶

In the preprocessing we have a lot of textual data so we will first One-Hot Encode the Possible Features and use TF IDF Tokens to convert the sentence to a possible feature and use it in the regression.

In [6]:

# utility function to one hot encode the dataset
def one_hot_df(df):
    df = pd.concat([df, pd.get_dummies(df["golden_or_not"])],axis=1)
    df.drop("golden_or_not",axis=1, inplace=True)
    df = pd.concat([df, pd.get_dummies(df["unit_state"])],axis=1)
    df.drop("unit_state",axis=1, inplace=True)
    return df

In [7]:

train_data_df = one_hot_df(train_data_df)
test_data_df = one_hot_df(test_data_df)
display(train_data_df)
display(test_data_df)

	unit_id	trusted_judgments	last_judgment_at	agree_or_not_variance	sentence	agree_or_not	False	True	finalized	golden
0	703393947	3	4/12/15 17:24	0.471	captain can be used with the same meaning of m...	4.67	1	0	1	0
1	703395558	3	4/12/15 16:04	0.471	action can be used with the same meaning of en...	3.67	1	0	1	0
2	703398442	3	4/14/15 12:58	0.471	climb can be used as the opposite of go_down	4.67	1	0	1	0
3	703395452	3	4/12/15 10:45	0.000	baby is a kind of person	5.00	1	0	1	0
4	703401266	3	4/14/15 6:27	0.471	garden can be used with the same meaning of yard	4.67	1	0	1	0
...	...	...	...	...	...	...	...	...	...	...
6639	703401541	3	4/14/15 6:47	0.000	venus is part of person	1.00	1	0	1	0
6640	703401351	3	4/12/15 21:14	0.000	hate can be used as the opposite of love	5.00	1	0	1	0
6641	703400471	3	4/13/15 16:10	0.000	slice is a kind of business	1.00	1	0	1	0
6642	703399213	3	4/12/15 21:19	1.886	fence can be used with the same meaning of pale	2.33	1	0	1	0
6643	703397433	3	4/14/15 6:54	0.471	time is a kind of experience	4.33	1	0	1	0

6644 rows × 10 columns

	unit_id	trusted_judgments	last_judgment_at	agree_or_not_variance	sentence	False	True	finalized	golden
0	703393947	3	4/12/15 17:24	0.471	captain can be used with the same meaning of m...	1	0	1	0
1	703395484	3	4/12/15 11:40	0.943	if lose is true	1	0	1	0
2	703401918	3	4/14/15 5:16	0.471	design can be used as the opposite of destroy	1	0	1	0
3	703395476	3	4/12/15 13:50	1.247	program is a kind of performance	1	0	1	0
4	703397489	3	4/14/15 7:18	0.471	sign can be used with the same meaning of mark	1	0	1	0
...	...	...	...	...	...	...	...	...	...
1657	703396215	3	4/12/15 15:53	0.943	screen can be used with the same meaning of pick	1	0	1	0
1658	703394932	3	4/12/15 12:18	0.471	movement is part of clock	1	0	1	0
1659	703401404	3	4/14/15 10:14	0.471	communism is a kind of society	1	0	1	0
1660	703400337	3	4/14/15 8:41	0.000	defeat can be used as the opposite of win	1	0	1	0
1661	703395073	3	4/12/15 14:40	0.471	crowd can be used as the opposite of small	1	0	1	0

1662 rows × 9 columns

Transfroming Train Data for Submission¶

In [8]:

# For beginning, transform train_data_df['sentence'] to lowercase using text.lower()
train_data_df['sentence'].str.lower()

# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train_data_df['sentence'].replace('[^a-zA-Z0-9]', ' ', regex = True)

# Convert a collection of raw documents to a matrix of TF-IDF features with TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5)
X_tfidf = vectorizer.fit_transform(train_data_df['sentence']) 

# merging final features to the Dataframe and removing the redundent colums
train_data_df = pd.concat([train_data_df,pd.DataFrame(X_tfidf.toarray())], axis=1)
train_data_df.drop("sentence", axis=1, inplace=True)
display(train_data_df)

	unit_id	trusted_judgments	last_judgment_at	agree_or_not_variance	agree_or_not	False	True	finalized	golden	0	...	1046	1047	1048	1049	1050	1051	1052	1053	1054	1055
0	703393947	3	4/12/15 17:24	0.471	4.67	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	703395558	3	4/12/15 16:04	0.471	3.67	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	703398442	3	4/14/15 12:58	0.471	4.67	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	703395452	3	4/12/15 10:45	0.000	5.00	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	703401266	3	4/14/15 6:27	0.471	4.67	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6639	703401541	3	4/14/15 6:47	0.000	1.00	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6640	703401351	3	4/12/15 21:14	0.000	5.00	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6641	703400471	3	4/13/15 16:10	0.000	1.00	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6642	703399213	3	4/12/15 21:19	1.886	2.33	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
6643	703397433	3	4/14/15 6:54	0.471	4.33	1	0	1	0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

6644 rows × 1065 columns

In [9]:

# Separating data from the dataframe for final training
X = normalize(train_data_df.drop(["unit_id", "last_judgment_at", "agree_or_not"], axis=1).to_numpy())
y = train_data_df["agree_or_not"].to_numpy()
print(X.shape, y.shape)

(6644, 1062) (6644,)

Transfroming Test Data for Submission¶

In [10]:

# For beginning, transform test_data_df['sentence'] to lowercase using text.lower()
test_data_df['sentence'].str.lower()

# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
test_data_df['sentence'].replace('[^a-zA-Z0-9]', ' ', regex = True)

# Convert a collection of raw documents to a matrix of TF-IDF features with TfidfVectorizer
X_tfidf_test = vectorizer.transform(test_data_df['sentence']) 

# merging final features to the Dataframe and removing the redundent colums
test_data_df = pd.concat([test_data_df,pd.DataFrame(X_tfidf_test.toarray())], axis=1)
test_data_df.drop("sentence", axis=1, inplace=True)
display(test_data_df)

	unit_id	trusted_judgments	last_judgment_at	agree_or_not_variance	False	True	finalized	golden	0	1	...	1046	1047	1048	1049	1050	1051	1052	1053	1054	1055
0	703393947	3	4/12/15 17:24	0.471	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	703395484	3	4/12/15 11:40	0.943	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	703401918	3	4/14/15 5:16	0.471	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	703395476	3	4/12/15 13:50	1.247	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	703397489	3	4/14/15 7:18	0.471	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1657	703396215	3	4/12/15 15:53	0.943	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1658	703394932	3	4/12/15 12:18	0.471	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1659	703401404	3	4/14/15 10:14	0.471	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1660	703400337	3	4/14/15 8:41	0.000	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1661	703395073	3	4/12/15 14:40	0.471	1	0	1	0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

1662 rows × 1064 columns

Splitting the data¶

In [11]:

# Splitting the training set, and training & validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(y_train.shape)

(5315, 1062)
(5315,)

In [12]:

X_train[0], y_train[0]

Out[12]:

(array([0.85812971, 0.13472636, 0.28604324, ..., 0.        , 0.        ,
        0.        ]),
 1.67)

Training the Model¶

In [13]:

model = GradientBoostingRegressor()
model.fit(X_train, y_train)

Out[13]:

GradientBoostingRegressor()

Validation¶

In [14]:

model.score(X_val, y_val)

Out[14]:

0.20836791644202257

So, we are done with the baseline let's test with real testing data and see how we submit it to challange.

Predictions¶

In [15]:

# Separating data from the dataframe for final testing
X_test = normalize(test_data_df.drop(["unit_id", "last_judgment_at"], axis=1).to_numpy())
print(X_test.shape)

(1662, 1062)

In [16]:

# Predicting the labels
predictions = model.predict(X_test)
predictions.shape

Out[16]:

(1662,)

In [17]:

# Converting the predictions array into pandas dataset
submission = pd.DataFrame({"agree_or_not":predictions})
submission

Out[17]:

	agree_or_not
0	3.982392
1	2.019913
2	3.553756
3	2.976471
4	3.982392
...	...
1657	3.075700
1658	4.165608
1659	3.839829
1660	3.839829
1661	3.553756

1662 rows × 1 columns

In [18]:

# Saving the pandas dataframe
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"), index=False)

Submitting our Predictions¶

Note : Please save the notebook before submitting it (Ctrl + S)

In [19]:

!!aicrowd submission create -c agree -f assets/submission.csv

Out[19]:

['submission.csv ━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 32.3/30.7 KB • 8.0 MB/s • 0:00:00',
 '                                  ╭─────────────────────────╮                                  ',
 '                                  │ Successfully submitted! │                                  ',
 '                                  ╰─────────────────────────╯                                  ',
 '                                        Important links                                        ',
 '┌──────────────────┬──────────────────────────────────────────────────────────────────────────┐',
 '│  This submission │ https://www.aicrowd.com/challenges/agree/submissions/167693              │',
 '│                  │                                                                          │',
 '│  All submissions │ https://www.aicrowd.com/challenges/agree/submissions?my_submissions=true │',
 '│                  │                                                                          │',
 '│      Leaderboard │ https://www.aicrowd.com/challenges/agree/leaderboards                    │',
 '│                  │                                                                          │',
 '│ Discussion forum │ https://discourse.aicrowd.com/c/agree                                    │',
 '│                  │                                                                          │',
 '│   Challenge page │ https://www.aicrowd.com/challenges/agree                                 │',
 '└──────────────────┴──────────────────────────────────────────────────────────────────────────┘',
 "{'submission_id': 167693, 'created_at': '2021-12-12T20:50:44.348Z'}"]

In [ ]:

AGREE

[Getting Started Notebook] AGREE Challange