Data Purchasing Challenge 2022
[First Baseline + Explainer] Getting Started With ROUND 2
Boiler Plate Code With Abiding (COMPUTE AND PURCHASE BUDGET) For Getting Started With Round 2
Hey Guys
First of all, I would like to thank you for taking the time to check out this notebook.
A lot of things have changed in round 2, what exactly??? Check the changelog. In short, the focus of this round is strictly on purchase, and not on training bulky networks (you can try but you'll be defeated by new time constraints).
I understand how tedious setting up the boilerplate code can be. Almost all the participants here have a full-time job and TRUST ME, you don't want to waste your precious time with the basic details and just want to focus on the main problem i.e. PURCHASE.
I have put together the boilerplate code that will help you get started with the competition in no time 😎😎😎😎. This notebook is a stand-alone solution that will run on Google colab and set up all the basic things, you can also get it from the official notebook but let's face it, it's just a Skeleton.
NOTE: The code uses the PyTorch and NumPy seed which means you can have somewhat reproducible results. For better reproducibility, you must use deterministic algorithms.
With this notebook you will be able to do the following things:
1. Download the dataset and helper code
2. Boilerplate code for training, validation, purchase, and prediction. The code is implemented with EfficientNet-B0
3. Evaluation with respect to time constraints and budget, therefore you can test your pipeline in real-time.
Credits:
1. [AICrowd Official Notebook](https://colab.research.google.com/drive/1ZJQBK9DKus1zSjm97aEc6bQ2mSS3vSTD)
2. [Offical Starter Kit](https://gitlab.aicrowd.com/zew/data-purchasing-challenge-2022-starter-kit)
What more???
Well, you can use the same code from `ZEWDPCBaseRun` and make some minor changes and you would then be able to make a submission.
Let's begin
1) Login to AIcrowd 🤩¶
#@title Login to AIcrowd
!pip install -U aicrowd-cli > /dev/null
!aicrowd login 2> /dev/null
2) Setup magically, run the below cell 😉¶
#@title Magic Box ⬛ { vertical-output: true, display-mode: "form" }
try:
import os
if first_run and os.path.exists("/content/data-purchasing-challenge-2022-starter-kit/data/public_training"):
first_run = False
except:
first_run = True
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
purchased_labels = None
if first_run:
%cd /content/
!git clone http://gitlab.aicrowd.com/zew/data-purchasing-challenge-2022-starter-kit.git > /dev/null
%cd data-purchasing-challenge-2022-starter-kit
!aicrowd dataset list -c data-purchasing-challenge-2022
!aicrowd dataset download -c data-purchasing-challenge-2022 *-v0.2-rc4.zip
!mkdir -p data/
!mv *.zip data/ && cd data && echo "Extracting dataset..." && ls *.zip | xargs -n1 -I{} bash -c "unzip -q {}"
def run_pre_training_phase():
from run import ZEWDPCBaseRun
run = ZEWDPCBaseRun()
run.pre_training_phase = pre_training_phase
run.pre_training_phase(self=run, training_dataset=training_dataset)
# NOTE:It is critical that the checkpointing works in a self-contained way
# As, the evaluators might choose to run the different phases separately.
run.save_checkpoint("/tmp/pretrainig_phase_checkpoint.pickle")
def run_purchase_phase():
from run import ZEWDPCBaseRun
run = ZEWDPCBaseRun()
run.pre_training_phase = pre_training_phase
run.purchase_phase = purchase_phase
run.load_checkpoint("/tmp/pretrainig_phase_checkpoint.pickle")
# Hacky way to make it work in notebook
unlabelled_dataset.purchases = set()
global purchased_labels
purchased_labels = run.purchase_phase(self=run, unlabelled_dataset=unlabelled_dataset, training_dataset=training_dataset, purchase_budget=1500, compute_budget=51*60)
run.save_checkpoint("/tmp/purchase_phase_checkpoint.pickle")
del run
def run_prediction_phase():
from run import ZEWDPCBaseRun
run = ZEWDPCBaseRun()
run.pre_training_phase = pre_training_phase
run.purchase_phase = purchase_phase
run.prediction_phase = prediction_phase
run.load_checkpoint("/tmp/purchase_phase_checkpoint.pickle")
run.prediction_phase(self=run, test_dataset=val_dataset)
del run
def run_post_purchase_training_phase():
import torch
from evaluator.evaluation_metrics import get_zew_dpc_metrics
from evaluator.utils import instantiate_purchased_dataset
from evaluator.trainer import ZEWDPCTrainer
purchased_dataset = instantiate_purchased_dataset(unlabelled_dataset, purchased_labels)
aggregated_dataset = torch.utils.data.ConcatDataset(
[training_dataset, purchased_dataset]
)
print("Training Dataset Size: ", len(training_dataset))
print("Purchased Dataset Size: ", len(purchased_dataset))
print("Aggregataed Dataset Size: ", len(aggregated_dataset))
trainer = ZEWDPCTrainer(num_classes=6, use_pretrained=True)
trainer.train(
training_dataset, num_epochs=10, validation_percentage=0.1, batch_size=5
)
y_pred = trainer.predict(val_dataset)
y_true = val_dataset_gt._get_all_labels()
metrics = get_zew_dpc_metrics(y_true, y_pred)
f1_score = metrics["F1_score_macro"]
accuracy_score = metrics["accuracy_score"]
hamming_loss_score = metrics["hamming_loss"]
print("\n\n==================")
print("F1 Score: ", f1_score)
print("Accuracy Score: ", accuracy_score)
print("Hamming Loss: ", hamming_loss_score)
3) Writing your code implementation! ✍️¶
a) Runtime Packages¶
#@title a) Runtime Packages<br/><small>Important: Add the packages required by your code here. (space separated)</small> { run: "auto", display-mode: "form" }
apt_packages = "build-essential vim" #@param {type:"string"}
pip_packages = "scikit-image pandas timeout-decorator==0.5.0 numpy torchmetrics" #@param {type:"string"}
!apt install -y $apt_packages git-lfs
!pip install $pip_packages
b) Load Dataset¶
from evaluator.dataset import ZEWDPCBaseDataset, ZEWDPCProtectedDataset
DATASET_SHUFFLE_SEED = 1022022
# Instantiate Training Dataset
training_dataset = ZEWDPCBaseDataset(
images_dir="./data/training/images",
labels_path="./data/training/labels.csv",
shuffle_seed=DATASET_SHUFFLE_SEED,
)
# Instantiate Unlabelled Dataset
unlabelled_dataset = ZEWDPCProtectedDataset(
images_dir="./data/unlabelled/images",
labels_path="./data/unlabelled/labels.csv",
purchase_budget=1500, # Configurable Parameter
shuffle_seed=DATASET_SHUFFLE_SEED,
)
# Instantiate Validation Dataset
val_dataset = ZEWDPCBaseDataset(
images_dir="./data/validation/images",
labels_path="./data/validation/labels.csv",
drop_labels=True,
shuffle_seed=DATASET_SHUFFLE_SEED,
)
# A second instantiation of the validation test with the labels present
# - helpful later, when computing the scores.
val_dataset_gt = ZEWDPCBaseDataset(
images_dir="./data/validation/images",
labels_path="./data/validation/labels.csv",
drop_labels=False,
shuffle_seed=DATASET_SHUFFLE_SEED,
)
Training¶
import torch
from torch import nn
from torchvision import models
from torch.optim import Adam, SGD, lr_scheduler
from torchvision import transforms as T
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import abc
import datetime
from tqdm import tqdm
import copy
from evaluator.exceptions import OutOfBudetException
from evaluator.evaluation_metrics import get_zew_dpc_metrics
from evaluator.dataset import ZEWDPCBaseDataset, ZEWDPCProtectedDataset, ZEWDPCRuntimeDataset
from evaluator.utils import (
instantiate_purchased_dataset,
AverageMeter,
)
import torchmetrics
torch.manual_seed(17)
np.random.seed(17)
Base Code¶
class ZEWDPCBaseRun:
def __init__(self):
self.evaluation_state = {}
self.BATCH_SIZE = 32
self.NUM_WORKERS = 2
self.LEARNING_RATE = 0.00009
self.NUM_CLASSES = 6
self.THRESHOLD = 0.5
self.NUM_EPOCS = 2
self.CHECKPOINT_FREQUENCY = 10
self.EVAL_FREQ = 1
self.validation_percentage = 0.1
self.seed = 42
# Use any torchvision model you like here
self.model = models.efficientnet_b0(pretrained=True)
# Change last layer if using pretrained model
self.model.classifier = torch.nn.Sequential(
torch.nn.Dropout(p=self.model.classifier[0].p),
torch.nn.Linear(self.model.classifier[1].in_features, out_features=self.NUM_CLASSES)
)
self.model.cuda()
self.device = "cuda:0"
self.activation = torch.nn.Sigmoid()
self.optimizer = Adam(self.model.parameters(), lr=self.LEARNING_RATE, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0009, amsgrad=False)
self.lr_sched = lr_scheduler.ReduceLROnPlateau(
self.optimizer, mode='max', patience=2, verbose=True
)
self.criterion = nn.BCEWithLogitsLoss()
def pre_training_phase(
self,
training_dataset: ZEWDPCBaseDataset,
compute_budget=10**10,
register_progress=lambda x: False,
):
print("\n================> Pre-Training Phase\n")
# Creating transformations
# Setup Transforms
self.setup_transforms(training_dataset)
# Prepare Validation Set
training_dataset, validation_dataset = self.setup_validation_set(
training_dataset, validation_percentage=self.validation_percentage
)
# Setup Dataloaders
train_dataloader, val_dataloader = self.setup_dataloaders(
training_dataset, validation_dataset, batch_size=self.BATCH_SIZE
)
# Setup Metric Meters
val_loss_avg_meter = AverageMeter()
train_loss_avg_meter = AverageMeter()
val_f1 = torchmetrics.F1Score(num_classes=self.NUM_CLASSES, average="macro")
train_f1 = torchmetrics.F1Score(num_classes=self.NUM_CLASSES, average="macro")
# Setup Metric Meters
val_loss_avg_meter = AverageMeter()
train_loss_avg_meter = AverageMeter()
val_f1 = torchmetrics.F1Score(num_classes=self.NUM_CLASSES, average="macro")
train_f1 = torchmetrics.F1Score(num_classes=self.NUM_CLASSES, average="macro")
########################################################################
########################################################################
#
# Iterate over Epochs
########################################################################
for epoch in range(self.NUM_EPOCS):
self.epoch = epoch
self.model.train()
train_loss_avg_meter.reset()
train_f1.reset()
tqdm_iter = tqdm(train_dataloader, total=len(train_dataloader))
tqdm_iter.set_description(f"Epoch {epoch}")
for sample in tqdm_iter:
# Reset Optimizer Gradients
self.optimizer.zero_grad()
# Gather Data Sample
idx = sample["idx"].to(self.device)
image = sample["image"].to(self.device)
label = torch.vstack(sample["label"]).T
# Forward Pass
output = self.model(image)
# Compute Loss
loss = self.criterion(output, label.to(self.device).float())
# Update Metric Meters
train_loss_avg_meter.update(loss.item(), image.shape[0])
output_with_activation = self.activation(output.detach()).cpu()
train_f1.update(output_with_activation, label)
tqdm_iter.set_postfix(
iter_train_loss=loss.item(), avg_train_loss=train_loss_avg_meter.avg
)
# Backpropagate
loss.backward()
self.optimizer.step()
print(
"Epoch %d - Average Train Loss: %.5f \t Train F1: %.5f"
% (epoch, train_loss_avg_meter.avg, train_f1.compute().item())
)
# Checkpointing policy
if (self.epoch+1)%self.CHECKPOINT_FREQUENCY == 0:
pass
# self.save_checkpoint("./")
####################################################################################
####################################################################################
#
# Validation
####################################################################################
VALIDATION_INTERVAL = self.EVAL_FREQ
if (
validation_dataset is not None
and (epoch + 1) % VALIDATION_INTERVAL == 0
):
self.model.eval()
val_loss_avg_meter.reset()
val_f1.reset()
for sample in val_dataloader:
with torch.no_grad():
idx = sample["idx"].to(self.device)
image = sample["image"].to(self.device)
label = torch.vstack(sample["label"]).T
output = self.model(image)
loss = self.criterion(output, label.to(self.device).float())
output_with_activation = self.activation(
output.detach()
).cpu()
val_f1.update(output_with_activation, label)
val_loss_avg_meter.update(loss.item(), image.shape[0])
self.lr_sched.step(val_loss_avg_meter.avg)
print(
"Epoch %d - Average Val Loss: %.5f \t Val F1: %.5f \t Learning Rate %0.5f"
% (
epoch,
val_loss_avg_meter.avg,
val_f1.compute().item(),
self.optimizer.param_groups[0]["lr"],
)
)
print()
train_metrics = {"f1": train_f1.compute().item()}
val_metrics = {"f1": val_f1.compute().item()}
info = {"learning_rate": self.optimizer.param_groups[0]["lr"]}
print("Execution Complete of Training Phase.")
def purchase_phase(
self,
unlabelled_dataset: ZEWDPCProtectedDataset,
training_dataset: ZEWDPCBaseDataset,
purchase_budget=1000,
compute_budget=10**10,
register_progress=lambda x: False,
):
"""
# Purchase Phase
-------------------------
In this phase of the competition, you have access to
the unlabelled_dataset (an instance of `ZEWDPCProtectedDataset`)
and the training_dataset (an instance of `ZEWDPCBaseDataset`)
{see datasets.py for more details}, a purchase budget, and a compute budget.
You can iterate over both the datasets and access the images without restrictions.
However, you can probe the labels of the unlabelled_dataset only until you
run out of the label purchasing budget.
The `compute_budget` argument holds a floating point number representing
the time available (in seconds) for **BOTH** the pre_training_phase and
the `purchase_phase`.
Exceeding the time will lead to a TimeOut error.
PARTICIPANT_TODO: Add your code here
"""
print("\n================> Purchase Phase | Budget = {}\n".format(purchase_budget))
register_progress(0.0) # Register Progress
purchased_labels = {}
for sample in tqdm(unlabelled_dataset):
idx = sample["idx"]
# Budgeting & Purchasing Labels
if purchase_budget > 0:
label = unlabelled_dataset.purchase_label(idx)
purchased_labels[idx] = label
purchase_budget -= 1
register_progress(1.0) # Register Progress
print("Execution Complete of Purchase Phase.")
return purchased_labels
def prediction_phase(
self,
test_dataset: ZEWDPCBaseDataset,
register_progress=lambda x: False,
):
"""
# Prediction Phase
-------------------------
In this phase of the competition, you have access to the test dataset, and you
are supposed to make predictions using your trained models.
Returns:
np.ndarray of shape (n, 6)
where n is the number of samples in the test set
and 6 refers to the 6 labels to be predicted for each sample
for the multi-label classification problem.
PARTICIPANT_TODO: Add your code here
"""
print(
"\n================> Prediction Phase : - on {} images\n".format(
len(test_dataset)
)
)
test_transform = T.Compose([
T.ToTensor(),
])
test_dataset.set_transform(test_transform)
test_loader = DataLoader(
dataset=test_dataset,
batch_size=self.BATCH_SIZE,
shuffle=False,
num_workers=self.NUM_WORKERS,
)
self.model.eval()
predictions = []
with torch.no_grad():
for _, batch in enumerate(test_loader):
X = batch['image'].cuda()
output = self.model(X)
output_with_activation = self.activation(
output.detach()
).cpu()
predictions.extend(output_with_activation)
register_progress(1.0)
predictions = np.array(predictions) # random predictions
print("Execution Complete of Purchase Phase.")
return predictions
def save_checkpoint(self, checkpoint_folder):
"""
Self-contained checkpoint code to be included here,
which can capture the state of your run (including any trained models, etc)
at the provided folder path.
This is critical to implement, as the execution of the different phases can
happen using different instances of the BaseRun. See below for examples.
PARTICIPANT_TODO: Add your code here
"""
# checkpoint_path = os.path.join(checkpoint_folder, "model.pth")
save_dict = {
'model_state_dict': self.model.state_dict(),
'optim_state_dict': self.optimizer.state_dict(),
}
torch.save(save_dict, checkpoint_folder)
print(f"Saving checkpont at {checkpoint_folder}")
def load_checkpoint(self, checkpoint_folder):
"""
Self-contained checkpoint code to be included here,
which can load the state of your run (including any trained models, etc)
from a provided checkpoint_folder path
(previously saved using `self.save_checkpoint`)
This is critical to implement, as the execution of the different phases can
happen using different instances of the BaseRun. See below for examples.
PARTICIPANT_TODO: Add your code here
"""
# checkpoint_path = os.path.join(checkpoint_folder, "model.pth")
checkpoint_model = torch.load(checkpoint_folder, map_location=self.device)
self.model.load_state_dict(checkpoint_model['model_state_dict'])
self.optimizer.load_state_dict(checkpoint_model['optim_state_dict'])
print('Loading checkpoint success')
def setup_validation_set(self, training_dataset, validation_percentage=0.05):
"""
Creates a Validation Set from the Training Dataset
"""
assert (
0 < validation_percentage < 1
), "Expected : validation_percentage ∈ [0, 1]. Received validataion_percentage = {}".format(
validation_percentage
)
validation_size = int(validation_percentage * len(training_dataset))
training_dataset, validation_dataset = torch.utils.data.random_split(
training_dataset,
[
len(training_dataset) - validation_size,
validation_size,
],
generator=torch.Generator().manual_seed(self.seed),
)
return training_dataset, validation_dataset
def setup_dataloaders(self, training_dataset, validation_dataset, batch_size=32):
"""
Sets up necessary dataloader
"""
train_dataloader = torch.utils.data.DataLoader(
training_dataset, batch_size=batch_size, shuffle=True
)
val_dataloader = torch.utils.data.DataLoader(
validation_dataset, batch_size=batch_size, shuffle=True
)
return train_dataloader, val_dataloader
def setup_transforms(self, training_dataset):
"""
Sets up the necessary transforms for the training_dataset
"""
## Setup necessary Transformations
train_transform = T.Compose(
[
T.ToTensor(), # Converts image to [0, 1]
T.RandomVerticalFlip(p=0.5),
T.RandomHorizontalFlip(p=0.5),
T.GaussianBlur(kernel_size=3),
T.ColorJitter(brightness=0.2, contrast=0.2),
# *self.model.required_transforms,
]
)
if isinstance(training_dataset, ZEWDPCBaseDataset):
training_dataset.set_transform(train_transform)
elif isinstance(training_dataset, torch.utils.data.ConcatDataset):
for dataset in training_dataset.datasets:
if isinstance(dataset, ZEWDPCRuntimeDataset):
dataset.set_transform(train_transform)
elif isinstance(dataset, ZEWDPCBaseDataset):
dataset.set_transform(train_transform)
else:
raise NotImplementedError()
else:
raise NotImplementedError()
Evaluator¶
import os
import tempfile
import time
from evaluator.trainer import ZEWDPCTrainer, ZEWDPCDebugTrainer
# Location to save your checkpoint
checkpoint_folder_path = tempfile.TemporaryDirectory().name
### NOTE: This folder doesnot clean up itself.
### You are responsible for cleaning up the contents of this folder after
## the desired usage.
####################################################################################
####################################################################################
##
## Setup Compute & Purchase Budgets
####################################################################################
time_started = time.time()
PURCHASE_BUDGET = 500
COMPUTE_BUDGET = 60 * 60 # 1 hour
####################################################################################
####################################################################################
##
## Phase 1 : Pre-Training Phase
####################################################################################
run = ZEWDPCBaseRun()
run.pre_training_phase(training_dataset, compute_budget=COMPUTE_BUDGET)
run.save_checkpoint(checkpoint_folder_path)
# NOTE:It is critical that the checkpointing works in a self-contained way
# As, the evaluators might choose to run the different phases separately.
del run
time_available = COMPUTE_BUDGET - (time_started - time.time())
print("Time remaining: ", time_available)
####################################################################################
####################################################################################
##
## Phase 2 : Purchase Phase
####################################################################################
run = ZEWDPCBaseRun()
run.load_checkpoint(checkpoint_folder_path)
purchased_labels = run.purchase_phase(
unlabelled_dataset, training_dataset, purchase_budget=PURCHASE_BUDGET, compute_budget=time_available
)
run.save_checkpoint(checkpoint_folder_path)
del run
####################################################################################
####################################################################################
##
## Phase 3 : Post Purchase Training Phase
####################################################################################
# Create a runtime instance of the purchased dataset with the right labels
purchased_dataset = instantiate_purchased_dataset(unlabelled_dataset, purchased_labels)
aggregated_dataset = torch.utils.data.ConcatDataset(
[training_dataset, purchased_dataset]
)
print("Training Dataset Size : ", len(training_dataset))
print("Purchased Dataset Size : ", len(purchased_dataset))
print("Aggregataed Dataset Size : ", len(aggregated_dataset))
DEBUG_MODE = os.getenv("AICROWD_DEBUG_MODE", False)
if DEBUG_MODE:
TRAINER_CLASS = ZEWDPCDebugTrainer
else:
TRAINER_CLASS = ZEWDPCTrainer
trainer = ZEWDPCTrainer(num_classes=6, use_pretrained=True)
trainer.train(
training_dataset, num_epochs=10, validation_percentage=0.1, batch_size=5
)
y_pred = trainer.predict(val_dataset)
y_true = val_dataset_gt._get_all_labels()
####################################################################################
####################################################################################
##
## Phase 4 : Evaluation Phase
####################################################################################
metrics = get_zew_dpc_metrics(y_true, y_pred)
f1_score = metrics["F1_score_macro"]
accuracy_score = metrics["accuracy_score"]
hamming_loss_score = metrics["hamming_loss"]
print()
print("F1 Score : ", f1_score)
print("Accuracy Score : ", accuracy_score)
print("Hamming Loss : ", hamming_loss_score)
I hope this notebook succeeds in helping you "Getting Started", if it does how about leaving some love 🤎🤎🤎🤎🤎...¶
Content
Comments
You must login before you can post a comment.
Comment deleted by vrv.