Data Purchasing Challenge 2022
[Explainer + Baseline] Get your Baseline right! (+0.84 LB)
High Accuracy Baseline
This is a very simple notebook with the implementation of the run.py code snippet to achieve a Baseline Accuracy of +0.84. Enjoy and leave a like if it was useful for you!
This notebook is different!¶
If you are new to the challenge I suggest you visit the following link for an amazing explanation on how to Fork the repository, setup your SSH, setup your environment, and make your first submission¶
Gaurav_Singhal's :[Create your baseline with 0.4+ on LB (Git Repo and Video) ](https://www.aicrowd.com/showcase/create-your-baseline-with-0-4-on-lb-git-repo-and-video)
This notebooks uses parts of the snippet provided by this gentleman, so go ahead and leave him a like.
This notebooks implements the ZEWDPCBaseRun class that you need to create in the run.py file to get your submission up to +0.84 in the LB.
Wait....WHAT?... Why would you even do that? -- is probably what you are asking...¶
Well, if you are like me, you probably just started your journey and your best way to learn is to look at someone else's code and slowly try to get your head around what is going on... but you also like to get good scoring models too!
So bear with me, and let me guide you a bit throughout the code and understand what is going on behind the scenes (or at least an aproximation), and get your name up there buddy!
And remember, the goal for this challenge is to address the buying stage, so I am still leaving you the fun part.
Disclaimer</i>: this notebook is not inteded for direct submissions, please, follow the steps described in Aicrowd to make GitLab submissions.
a) Understanding the Datasets¶
The directory structure looks like this:
Quick preview of images and labels.csv is as follows:
This means that we have our images and labels splitted by "type". We don't really care about how these are obtained, since they are instantiated (declared) by the provided code when calling the local_evaluation.py in the run.py file
b) Stages¶
As you may have already seen, there are different stages for this challenge:
- Pre-Training Phase
- Purchase Phase
- Evaluation Phase
Each one of these stages is implemented as functions within the ZEWDPCBaseRun class inside the run.py file. Then, our AICrowd friends, run these different stages on their end with different datasets.
So basically, we have to define those functions inside the ZEWDPCBaseRun class and save it in the run.py file.
To sum up, at the end of the notebook you will have a cell that compiles the whole thing we are going to be discussing, by copying that in your run.py you are ready to get those juicy +0.84 on the LB. But beware... with a great power comes a great responsability, so try to read what is going on.
1) Packages¶
The first part allows us to load all the things we need. Keep in mind that you may need to add packages into the requirements.txt file, I will provide the text for this file at the end.
#!/usr/bin/env python
import torch
from torch import nn
from torchvision import models
from torch.optim import Adam, SGD, lr_scheduler
from torchvision import transforms
from torch.utils.data import DataLoader
import numpy as np
import datetime
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from sklearn.metrics import hamming_loss
from evaluator.dataset import ZEWDPCBaseDataset, ZEWDPCProtectedDataset
2) Define some parameters and our model¶
Here we define some parameters for the class that will be accesible throughout the run.
We also define our MVP for the challenge, the model!
In my case I am using the small version of the EfficientNet, and calling the pretrained weights.
- Here is one of the most important steps: adding layers to fit the original output from the network, into our specific case-scenario!
- We also instantiate our model and move it to our CUDA friends.
- Then we define other required parameters like our zero-searching buddy, the Optimizer.
- We got our "Hey! you are stuck" parameter: Reduce on Plateau for our Learning Rate
- And finally our "I'll tell you if you suck": our Criterion
class ZEWDPCBaseRun:
def __init__(self):
self.evaluation_state = {}
# Model parameters
self.BATCH_SIZE = 32
self.NUM_WORKERS = 2
self.LEARNING_RATE = 0.001
self.NUM_CLASSES = 4
self.TOPK= 3
self.THRESHOLD = 0.5
self.NUM_EPOCS = 50
self.EVAL_FREQ = 5
class Classifier(nn.Module):
def __init__(self):
super(Classifier, self).__init__()
self.resnet = models.efficientnet_b0(pretrained=True) # The pretrained network!
self.l1 = nn.Linear(1000 , 256) # We make arrangements for these two to meet
self.dropout = nn.Dropout(0.5) # do whatever you want from here on
self.l2 = nn.Linear(256,4)
self.relu = nn.ReLU()
def forward(self, input):
x = self.resnet(input)
x = x.view(x.size(0),-1)
x = self.dropout(self.relu(self.l1(x)))
x = self.l2(x)
return x
self.model = Classifier()
self.model.cuda()
self.trainable_parameters = filter(lambda param: param.requires_grad, self.model.parameters())
self.optimizer = Adam(self.trainable_parameters, lr=self.LEARNING_RATE)
self.epoch = 0
self.lr_scheduler_ = lr_scheduler.ReduceLROnPlateau(
self.optimizer, mode='max', patience=2, verbose=True
)
self.criterion = nn.BCEWithLogitsLoss()
3) Pre-training phase¶
This snippet is following Gaurav_Singhal's original notebook and AICrowds original snippet!
This function basically creates a dataset for our training images, defines the steps for training, and then runs through the amount of epochs we defined.
Check the code to understand what is going on in the training stage!
def pre_training_phase(
self, training_dataset: ZEWDPCBaseDataset, register_progress=lambda x: False
):
print("\n================> Pre-Training Phase\n")
# Creating transformations
train_transform = transforms.Compose([
transforms.ToTensor(),
#transforms.RandomHorizontalFlip(p=0.5), #no augmentation for you buddy
#transforms.RandomVerticalFlip(p=0.5),
])
training_dataset.set_transform(train_transform) # We transform to Tensors basically
train_loader = DataLoader(
dataset=training_dataset,
batch_size=self.BATCH_SIZE,
shuffle=False,
num_workers=self.NUM_WORKERS,
) # we create our Pytorch DataLoader
def run_epoch():
for _, batch in enumerate(train_loader):
x, y = batch["image"].cuda(), batch["label"] # Here we are telling the model which is the data, and which are the labels
pred_y = self.model(x) # We make the model predict, whatever it comes out!
y = torch.cat(y, dim=0).reshape(
self.NUM_CLASSES, pred_y.shape[0]
).T.type(torch.FloatTensor)
## CHANGE CPU CUDA HERE. Comment for CPU
y = y.cuda()
loss = self.criterion(pred_y, y) # Applying our Criteria (Loss function) we decide if it is good or bad
self.optimizer.zero_grad() # We make our gradients zero
loss.backward() # We compute our gradients (might have heard of that somewhere)
self.optimizer.step() # And we update the gradients by making the optimizer optimize!
# 416 = BATCH_SIZE*13
if self.global_step % 416 == 0:
print("[{}] Training [epoch {}, step {}], loss: {:4f}".format(
datetime.datetime.now(), self.epoch, self.global_step, loss))
self.global_step += self.BATCH_SIZE
epoch_range = tqdm(range(self.epoch, self.NUM_EPOCS)) # Finally, we run our training schema on the amount of epochs defined
for i in epoch_range:
epoch_range.set_description(f"Epoch: {i}")
self.global_step = 0
run_epoch()
register_progress(i)
self.epoch += 1
print("Execution Complete of Training Phase.")
4) Purchase Phase¶
Here you are on your own buddy!
Remember, we want to select the best subset of images from the Purchase Dataset so as to make our model better!
def purchase_phase(
self,
unlabelled_dataset: ZEWDPCProtectedDataset,
training_dataset: ZEWDPCBaseDataset,
budget=1000,
register_progress=lambda x: False,
):
"""
# Purchase Phase
-------------------------
In this phase of the competition, you have access to
the unlabelled_dataset (an instance of `ZEWDPCProtectedDataset`)
and the training_dataset (an instance of `ZEWDPCBaseDataset`)
{see datasets.py for more details}, and a purchase budget.
You can iterate over both the datasets and access the images without restrictions.
However, you can probe the labels of the unlabelled_dataset only until you
run out of the label purchasing budget.
PARTICIPANT_TODO: Add your code here
"""
print("\n================> Purchase Phase | Budget = {}\n".format(budget))
register_progress(0.0) #Register Progress
for sample in tqdm(unlabelled_dataset):
idx = sample["idx"]
# image = unlabelled_dataset.__getitem__(idx)
# print(image)
# Budgeting & Purchasing Labels
if budget > 0:
label = unlabelled_dataset.purchase_label(idx)
budget -= 1
register_progress(1.0) #Register Progress
print("Execution Complete of Purchase Phase.")
5) Prediction Phase¶
Here we take our lovely model, and predict over the Validation Dataset, let's see how it goes!
def prediction_phase(
self,
test_dataset: ZEWDPCBaseDataset,
register_progress=lambda x: False,
):
"""
# Prediction Phase
-------------------------
In this phase of the competition, you have access to the test dataset, and you
are supposed to make predictions using your trained models.
Returns:
np.ndarray of shape (n, 4)
where n is the number of samples in the test set
and 4 refers to the 4 labels to be predicted for each sample
for the multi-label classification problem.
PARTICIPANT_TODO: Add your code here
"""
print(
"\n================> Prediction Phase : - on {} images\n".format(
len(test_dataset)
)
)
test_transform = transforms.Compose([
transforms.ToTensor(),
]) #We only transform to Tensors
test_dataset.set_transform(test_transform)
test_loader = DataLoader(
dataset=test_dataset,
batch_size=self.BATCH_SIZE,
shuffle=False,
num_workers=self.NUM_WORKERS,
) # We load the data
def convert_to_label(preds):
return np.array((torch.sigmoid(preds) > 0.5), dtype=int).tolist() # a simple function to convert our predictions to labels based on if their estimated probability is over 0.5
predictions = []
self.model.eval() # We will start to predict!
with torch.no_grad():
for _, batch in enumerate(test_loader):
X = batch['image'].cuda()
pred_y = self.model(X) #applying our model!
# Convert to labels
pred_y_labels = []
for arr in pred_y:
## CHANGE CPU CUDA HERE
pred_y_labels.append(convert_to_label(arr.cpu()))
# Save the results
predictions.extend(pred_y_labels)
register_progress(1.0)
predictions = np.array(predictions) # random predictions
print("Execution Complete of Purchase Phase.")
return predictions
6) Evaluate and Checkpoints¶
This last stage implements the Evaluation function to get our metrics, and the Checkpoints, which are essential to get our model through the different stages.
def evaluation(self, predictions, val_dataset_gt:ZEWDPCBaseDataset):
from evaluator.evaluation_metrics import accuracy_score, hamming_loss, exact_match_ratio
y_true = val_dataset_gt._get_all_labels()
y_pred = predictions
accuracy_score = accuracy_score(y_true, y_pred)
hamming_loss_score = hamming_loss(y_true, y_pred)
exact_match_ratio_score = exact_match_ratio(y_true, y_pred)
print("Accuracy Score : ", accuracy_score)
print("Hamming Loss : ", hamming_loss_score)
print("Exact Match Ratio : ", exact_match_ratio_score)
def save_checkpoint(self, checkpoint_path):
"""
Saves the checkpoint in the checkpoint_path directory. Each checkpoint will be saved for epoch_x
"""
save_dict = {
'epoch': self.epoch + 1,
'model_state_dict': self.model.state_dict(),
'optim_state_dict': self.optimizer.state_dict(),
}
torch.save(save_dict, checkpoint_path)
print(f"Checkpont epoch:{self.epoch} Model saved at {checkpoint_path}")
def load_checkpoint(self, checkpoint_path):
"""
Load the latest checkpoint from the experiment
"""
## CHANGE CPU CUDA HERE
checkpoint_model = torch.load(checkpoint_path, map_location="cuda:0")
# checkpoint_model = torch.load(checkpoint_path, map_location="cpu")
self.latest_epoch = checkpoint_model['epoch']
self.model.load_state_dict(checkpoint_model['model_state_dict'])
self.optimizer.load_state_dict(checkpoint_model['optim_state_dict'])
print('loading checkpoint success (epoch {})'.format(self.latest_epoch))
Adding some parameter for the code to run properly.
if __name__ == "__main__":
####################################################################################
## You need to implement `ZEWDPCBaseRun` class in this file for this challenge.
## Code for running all the phases locally is written in `main.py` for illustration
## purposes.
##
## Checkout the inline documentation of `ZEWDPCBaseRun` for more details.
####################################################################################
import local_evaluation
THE END¶
And there you go, that is the whole definition for the run.py file, where you are able to get all this running!
Full code:
#!/usr/bin/env python
import torch
from torch import nn
from torchvision import models
from torch.optim import Adam, SGD, lr_scheduler
from torchvision import transforms
from torch.utils.data import DataLoader
import numpy as np
import datetime
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from sklearn.metrics import hamming_loss
from evaluator.dataset import ZEWDPCBaseDataset, ZEWDPCProtectedDataset
class ZEWDPCBaseRun:
def __init__(self):
self.evaluation_state = {}
# Model parameters
self.BATCH_SIZE = 32
self.NUM_WORKERS = 2
self.LEARNING_RATE = 0.001
self.NUM_CLASSES = 4
self.TOPK= 3
self.THRESHOLD = 0.5
self.NUM_EPOCS = 50
self.EVAL_FREQ = 5
class Classifier(nn.Module):
def __init__(self):
super(Classifier, self).__init__()
self.resnet = models.efficientnet_b1(pretrained=True)
self.l1 = nn.Linear(1000 , 256)
self.dropout = nn.Dropout(0.5)
self.l2 = nn.Linear(256,4)
self.relu = nn.ReLU()
def forward(self, input):
x = self.resnet(input)
x = x.view(x.size(0),-1)
x = self.dropout(self.relu(self.l1(x)))
x = self.l2(x)
return x
#device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#classifier = Classifier().to(device)
self.model = Classifier()
#self.model = models.efficientnet_b0(pretrained=True,num_classes = self.NUM_CLASSES)
## CHANGE CPU CUDA HERE
self.model.cuda()
# self.model.cpu()
self.trainable_parameters = filter(lambda param: param.requires_grad, self.model.parameters())
self.optimizer = Adam(self.trainable_parameters, lr=self.LEARNING_RATE)
self.epoch = 0
self.lr_scheduler_ = lr_scheduler.ReduceLROnPlateau(
self.optimizer, mode='max', patience=2, verbose=True
)
self.criterion = nn.BCEWithLogitsLoss()
def pre_training_phase(
self, training_dataset: ZEWDPCBaseDataset, register_progress=lambda x: False
):
print("\n================> Pre-Training Phase\n")
# Creating transformations
train_transform = transforms.Compose([
#transforms.Grayscale(num_output_channels=3),
transforms.ToTensor(),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomVerticalFlip(p=0.5),
])
training_dataset.set_transform(train_transform)
train_loader = DataLoader(
dataset=training_dataset,
batch_size=self.BATCH_SIZE,
shuffle=False,
num_workers=self.NUM_WORKERS,
)
def run_epoch():
for _, batch in enumerate(train_loader):
## CHANGE CPU CUDA HERE
x, y = batch["image"].cuda(), batch["label"]
# x, y = batch["image"].cpu(), batch["label"]
pred_y = self.model(x)
# Change the shape of true labels here. Because for last batch the no. of images can be less
y = torch.cat(y, dim=0).reshape(
self.NUM_CLASSES, pred_y.shape[0]
).T.type(torch.FloatTensor)
## CHANGE CPU CUDA HERE. Comment for CPU
y = y.cuda()
loss = self.criterion(pred_y, y)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# 416 = BATCH_SIZE*13
if self.global_step % 416 == 0:
print("[{}] Training [epoch {}, step {}], loss: {:4f}".format(
datetime.datetime.now(), self.epoch, self.global_step, loss))
self.global_step += self.BATCH_SIZE
epoch_range = tqdm(range(self.epoch, self.NUM_EPOCS))
for i in epoch_range:
epoch_range.set_description(f"Epoch: {i}")
self.global_step = 0
run_epoch()
register_progress(i) # Epoch as progress
#if (i+1)%self.EVAL_FREQ == 0:
# predictions = self.prediction_phase(val_dataset)
# self.evaluation(predictions)
self.epoch += 1
print("Execution Complete of Training Phase.")
def purchase_phase(
self,
unlabelled_dataset: ZEWDPCProtectedDataset,
training_dataset: ZEWDPCBaseDataset,
budget=1000,
register_progress=lambda x: False,
):
"""
# Purchase Phase
-------------------------
In this phase of the competition, you have access to
the unlabelled_dataset (an instance of `ZEWDPCProtectedDataset`)
and the training_dataset (an instance of `ZEWDPCBaseDataset`)
{see datasets.py for more details}, and a purchase budget.
You can iterate over both the datasets and access the images without restrictions.
However, you can probe the labels of the unlabelled_dataset only until you
run out of the label purchasing budget.
PARTICIPANT_TODO: Add your code here
"""
print("\n================> Purchase Phase | Budget = {}\n".format(budget))
register_progress(0.0) #Register Progress
for sample in tqdm(unlabelled_dataset):
idx = sample["idx"]
# image = unlabelled_dataset.__getitem__(idx)
# print(image)
# Budgeting & Purchasing Labels
if budget > 0:
label = unlabelled_dataset.purchase_label(idx)
budget -= 1
register_progress(1.0) #Register Progress
print("Execution Complete of Purchase Phase.")
def prediction_phase(
self,
test_dataset: ZEWDPCBaseDataset,
register_progress=lambda x: False,
):
"""
# Prediction Phase
-------------------------
In this phase of the competition, you have access to the test dataset, and you
are supposed to make predictions using your trained models.
Returns:
np.ndarray of shape (n, 4)
where n is the number of samples in the test set
and 4 refers to the 4 labels to be predicted for each sample
for the multi-label classification problem.
PARTICIPANT_TODO: Add your code here
"""
print(
"\n================> Prediction Phase : - on {} images\n".format(
len(test_dataset)
)
)
test_transform = transforms.Compose([
transforms.ToTensor(),
])
test_dataset.set_transform(test_transform)
test_loader = DataLoader(
dataset=test_dataset,
batch_size=self.BATCH_SIZE,
shuffle=False,
num_workers=self.NUM_WORKERS,
)
def convert_to_label(preds):
return np.array((torch.sigmoid(preds) > 0.5), dtype=int).tolist()
predictions = []
self.model.eval()
with torch.no_grad():
for _, batch in enumerate(test_loader):
## CHANGE CPU CUDA HERE
# X= batch['image'].cpu()
X = batch['image'].cuda()
pred_y = self.model(X)
# Convert to labels
pred_y_labels = []
for arr in pred_y:
## CHANGE CPU CUDA HERE
pred_y_labels.append(convert_to_label(arr.cpu())) # For CUDA
# pred_y_labels.append(convert_to_label(arr)) # For CPU
# Save the results
predictions.extend(pred_y_labels)
register_progress(1.0)
predictions = np.array(predictions) # random predictions
print("Execution Complete of Purchase Phase.")
return predictions
def evaluation(self, predictions, val_dataset_gt:ZEWDPCBaseDataset):
from evaluator.evaluation_metrics import accuracy_score, hamming_loss, exact_match_ratio
y_true = val_dataset_gt._get_all_labels()
y_pred = predictions
accuracy_score = accuracy_score(y_true, y_pred)
hamming_loss_score = hamming_loss(y_true, y_pred)
exact_match_ratio_score = exact_match_ratio(y_true, y_pred)
print("Accuracy Score : ", accuracy_score)
print("Hamming Loss : ", hamming_loss_score)
print("Exact Match Ratio : ", exact_match_ratio_score)
def save_checkpoint(self, checkpoint_path):
"""
Saves the checkpoint in the checkpoint_path directory. Each checkpoint will be saved for epoch_x
"""
save_dict = {
'epoch': self.epoch + 1,
'model_state_dict': self.model.state_dict(),
'optim_state_dict': self.optimizer.state_dict(),
}
torch.save(save_dict, checkpoint_path)
print(f"Checkpont epoch:{self.epoch} Model saved at {checkpoint_path}")
def load_checkpoint(self, checkpoint_path):
"""
Load the latest checkpoint from the experiment
"""
## CHANGE CPU CUDA HERE
checkpoint_model = torch.load(checkpoint_path, map_location="cuda:0")
# checkpoint_model = torch.load(checkpoint_path, map_location="cpu")
self.latest_epoch = checkpoint_model['epoch']
self.model.load_state_dict(checkpoint_model['model_state_dict'])
self.optimizer.load_state_dict(checkpoint_model['optim_state_dict'])
print('loading checkpoint success (epoch {})'.format(self.latest_epoch))
if __name__ == "__main__":
####################################################################################
## You need to implement `ZEWDPCBaseRun` class in this file for this challenge.
## Code for running all the phases locally is written in `main.py` for illustration
## purposes.
##
## Checkout the inline documentation of `ZEWDPCBaseRun` for more details.
####################################################################################
import local_evaluation
Requirements.txt file!
click==8.0.3
imageio==2.14.1
jinja2==3.0.3
pandas
scikit-image
scikit-learn
scipy
timeout-decorator==0.5.0
tqdm==4.60.0
torch==1.10.2
torchvision
torchaudio
Remember to leave a like!¶
This notebook wouldn't have been possible without the snippets that the user gaurav_singhal provided by the community.
If this notebook also helps you to move forward, please, leave a LIKE 💗
Hope to see you up there!¶
Content
Comments
You must login before you can post a comment.
Comment deleted by azam_kamranian.
Thanks for the credit.