AI Blitz #9: Completed #educational Weight: 25.0

AIcrowd

5151

311

275

🌈 Welcome thread | 👥 Looking for teammates? | 🚀 Easy-2-Follow Code Notebooks

📝 Don't forget to participate in the Community Contribution Prize!

Introduction

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. If ML is a small child, we want to feed them more accurate information for them to interpret information well. A data-focused approach will give us a better result than a model-focused approach. Feature engineering helps us create better data that would be better understood by the model providing improved results.

Our starter kit provides an easy-to-follow guide on implementing these steps and getting a quick solution.

💪 Getting Started

Word2Vec is a technique to learn word embedding. Word embedding gives us a better understanding of the text. It is a representation of the document's vocabulary. It captures semantic and syntactic similarities, relations among the words, and so on to provide a better context of the data. Word2Vec is a method to construct embedding, two common approaches include skip-gram and bag of words.

Using the AIcrowd starter-kit you can easily set up the library, downloading the dataset, and creating the template. It will also guide in performing simple tokenization.

In our starter kit, we are using the Bag of Words method followed by a count vectorization and TF-IDF. Lastly, the word2vec approach is processed, trained, and tested.

💾 Dataset

The dataset is basically similar to the Research Paper Classification ( but a completely different set ). The text column contains the usual abstract of the research paper. The feature column is the vector your model will generate for the corresponding text. Each vector is should only contain 512 elements.

The original data.csv has 30,000 samples that will be evaluated after the notebook submission. Because the public dataset has only 10 samples ( which is meant to be used to testing your code locally ), We would also suggest you just play around with the Research Paper Classification Dataset ( without labels ) to get an intuition around how to generate good features based on that.

text	feature
Gaussian mixture models (GMM) and support vector machines (SVM) are introduced to classify faults in ... while the MLP produces 88% classification rates.	[0.34, 0.56, ....,0.2, 0.38]
his paper proposes a neuro-rough model ... and the transparency of Bayesian rough set model.	[0.39, 0.23, ....,0.22 0.24]

📁 Files

Following files are available in the resources section:

data.csv - (10 samples) This CSV file containing a text column as the sentence and a feature column as vectors of the corresponding text. Only for testing your code/notebook.

🚀 Submission

This challenge accepts the notebook as a submission.
During the evaluation, the Define preprocessing code 💻 and Prediction phase 🔎 parts notebook will be run, so please make sure it runs without any errors before submitting.
The notebook follows a particular format, please stick to it.
Do not delete the header of the cells in the notebook.

And Let us surely know in Discussion Section if you have any Doubts or Issues :)

Make your first submission here 🚀 !!

🖊 Evaluation

We are using a very different evaluation pipeline than we usually use in other blitz challenges. In this evaluator, after you submit your notebook. The notebook is run with the actual data.csv ( containing 30k samples )

After getting the output submission. the file is split into 3 parts, 50% for train, 25% for the public score, and the other 25% for the private score. The first 50% split is used to train a Machine Learning Model based on your features and the text/abstract's corresponding labels ( categories ) of the text/abstract.

And the second split ( 25% ) is used for public evaluation and the third split ( 25% ) is used for a private evaluation.

F1 score and Accuracy Score will be used to test the efficiency of the model where,

$F1 = 2 * \frac{precision*recall}{precision+recall}$

$\texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)$

We are using seed to make sure no randomization is any training/splitting process is happening!

Here's the sample evaluation code. Function such a CLASSIFIED_SKLEARN_MODEL are not mentioned intentionally.

# Importing Libraries
import pandas as pd
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from ast import literal_eval
import numpy as np
import random

# Setting seed to eliminate randomization in training and splitting dataset
random.seed(42)
np.random.seed(42)

# https://stackoverflow.com/a/60804119
def split_stratified_into_train_val_test(
    data,
    stratify_colname="label",
    frac_train=0.5,
    frac_val=0.25,
    frac_test=0.25,
    random_state=42,
):
    """
    Splits a Pandas dataframe into three subsets (train, val, and test)
    following fractional ratios provided by the user, where each subset is
    stratified by the values in a specific column (that is, each subset has
    the same relative frequency of the values in the column). It performs this
    splitting by running train_test_split() twice.

    Parameters
    ----------
    df_input : Pandas dataframe
        Input dataframe to be split.
    stratify_colname : str
        The name of the column that will be used for stratification. Usually
        this column would be for the label.
    frac_train : float
    frac_val   : float
    frac_test  : float
        The ratios with which the dataframe will be split into train, val, and
        test data. The values should be expressed as float fractions and should
        sum to 1.0.
    random_state : int, None, or RandomStateInstance
        Value to be passed to train_test_split().

    Returns
    -------
    df_train, df_val, df_test :
        Dataframes containing the three splits.
    """

    if frac_train + frac_val + frac_test != 1.0:
        raise ValueError(
            "fractions %f, %f, %f do not add up to 1.0"
            % (frac_train, frac_val, frac_test)
        )

    if stratify_colname not in data.columns:
        raise ValueError("%s is not a column in the dataframe" % (stratify_colname))

    X = data  # Contains all columns.
    y = data[[stratify_colname]]  # Dataframe of just the column on which to stratify.

    # Split original dataframe into train and temp dataframes.
    df_train, df_temp, y_train, y_temp = train_test_split(
        X, y, stratify=y, test_size=(1.0 - frac_train), random_state=random_state
    )

    # Split the temp dataframe into val and test dataframes.
    relative_frac_test = frac_test / (frac_val + frac_test)
    df_val, df_test, y_val, y_test = train_test_split(
        df_temp,
        y_temp,
        stratify=y_temp,
        test_size=relative_frac_test,
        random_state=random_state,
    )

    assert len(data) == len(df_train) + len(df_val) + len(df_test)

    return df_train, df_val, df_test


# Reading the submission file
submission = pd.read_csv(SUBMISSION_PATH)
ground_truth = pd.read_csv(GROUND_TRUTH_PATH)

# Sorting the dataset incase participant uploads a shuffled data
ground_truth = ground_truth.sort_values("id")
submission = submission.sort_values("id")

# Marging the dataset
data = pd.merge(submission, ground_truth, on="id")

# Getting the feature and converting them from python string to python list
data["feature"] = data["feature"].apply(lambda x: literal_eval(x)).tolist()

# Splitting the dataset into training dataset, public and private dataset
df_train, df_val, df_test = split_stratified_into_train_val_test(data)

X_train, y_train = df_train["feature"].tolist(), df_train["label"].tolist()
X_public, y_public = df_val["feature"].tolist(), df_val["label"].tolist()
X_private, y_private = df_test["feature"].tolist(), df_test["label"].tolist()

# Training the model
try:
    clf = CLASSIFIED_SKLEARN_MODEL(random_state=42)
    clf.fit(X_train, y_train)

except:
    Exception("Error while training model, check your inputs!")

# Public Predictions
y_pred_public = clf.predict(X_public, y_public)

# private Predictions
y_pred_private = clf.predict(X_private, y_private)

# Public F1 and Accuracy Score
public_f1 = f1_score(
    y_public,
    y_pred_public,
    average="weighted",
)
public_acc = accuracy_score(
    y_public,
    y_pred_public,
)

# Private F1 and Accuracy Score
private_f1 = f1_score(y_pred_private, y_private, average="weighted")
private_acc = accuracy_score(y_pred_private, y_private)

🔗 Links

💪 Challenge Page: https://www.aicrowd.com/challenges/nlp-feature-engineering
🗣️ Discussion Forum: https://www.aicrowd.com/challenges/nlp-feature-engineering/discussion
🏆 Leaderboard: https://www.aicrowd.com/challenges/nlp-feature-engineering/leaderboards

📱 Contact

Shubhamai