Loading
0 Follower
0 Following
sergey_zlobin
Sergey Zlobin

Location

RU

Badges

1
1
0

Activity

Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Mon
Wed
Fri

Challenge Categories

Loading...

Challenges Entered

A benchmark for image-based food recognition

Latest submissions

No submissions made in this challenge.

Using AI For Building’s Energy Management

Latest submissions

No submissions made in this challenge.

What data should you label to get the most value for your money?

Latest submissions

See All
failed 184202
graded 179185
graded 179000

Behavioral Representation Learning from Animal Poses.

Latest submissions

No submissions made in this challenge.

Latest submissions

No submissions made in this challenge.
Participant Rating
Participant Rating
sergey_zlobin has not joined any teams yet...

Data Purchasing Challenge 2022

πŸš€ Share your solutions! πŸš€

Over 2 years ago

Hello, I want to share my solution.

The competition was very interesting and unusual. And it was my first competition on AI crowd platform and guides/pages/discussions were very helpful for me. So thanks to organizers!!!

Actually my solution is very similar to xiaozhou_wang’s.

I have two strategies. First strategy is based on the idea to collect samples with β€œhard” classes (it went from Round 1). Suppose we have a trained model and we know F1-measure for all six classes from validation. Let us sum class predictions with weights equal to 1 - f1_validataion. And then choose samples with maximum of weighted predictions.


def choose_unlabelled_by_sum_probs(self, unlabelled_indices, unlabelled_preds, choose_size):
    assert len(unlabelled_indices) == len(unlabelled_preds)

    if len(unlabelled_indices) <= choose_size:
        return unlabelled_indices

    _, best_f1s = self.best_states['best_thrs_0']

    choose_scores = unlabelled_preds[:, 0] * (1 - best_f1s[0])
    for x in range(1, n_classes):
        choose_scores += unlabelled_preds[:, x] * (1 - best_f1s[x])
    sorted_indices = np.argsort(-choose_scores)
    return [unlabelled_indices[x] for x in sorted_indices[:choose_size]]

The second strategy is to collect samples with higher uncertainty. I consider the prediction 0.5 is the most uncertain, so I just sum the absolute value of 0.5 – over all classes.

def choose_unlabelled_by_uncertainty(self, unlabelled_indices, unlabelled_preds, choose_size):
    assert len(unlabelled_indices) == len(unlabelled_preds)

    if len(unlabelled_indices) <= choose_size:
        return unlabelled_indices

    _, best_f1s = self.best_states['best_thrs_0']

    choose_scores = np.sum(0.5 - np.abs(unlabelled_preds - 0.5), axis=1)
    sorted_indices = np.argsort(-choose_scores)
    return [unlabelled_indices[x] for x in sorted_indices[:choose_size]]

I also considered the third strategy from hosts: β€œmatch labels to target distribution”, but it was worse than without it. PS. to organizers – I have this code in my solution since I exprimented, but take very little samples by it and I think it doesn’t matter for score.

I tried several ratios of first strategies, but I didn’t see an obvious advantage of one of them. So finally I used both strategies with the equal budget.

I saw the idea of β€œActive Learning” in one of papers and decided to make several iterations (let’s say, L).

  1. Train a model with current known samples
  2. Take ~purchase_budget//L samples by two strategies (the last one batch can be bigger by 1).

The problem was to calculate the number L of iterations. My way is not so clever as xiaozhou_wang’s. I noticed that ~300 samples are enough for one iteration. Even more, in my experiments sometimes more iterations worsened a result. I looked at the submissions table to estimate training time and inference time. So I came to the formula (I have Pretraining Phase, so the first iteration doesn’t need training)

max_choose_size = min(len(unlabelled_dataset), purchase_budget)
n_loops = max(1, min(1 + (compute_budget - 50) // 220, int_ceil(max_choose_size, 290)))

For training I used efficientnet_b3, 5 epochs with

CosineAnnealingLR(optimizer, T_max=5, eta_min=1e-5)

and the following augmentations

return A.Compose([

    A.OneOf([A.GaussianBlur(), A.MotionBlur()], p=0.5),
    A.ToGray(p=0.01),
    A.HorizontalFlip(p=0.5),
    A.VerticalFlip(p=0.5),
    A.RandomRotate90(p=0.5),
])

Code for End of Competition Training pipelines

Almost 3 years ago

Each of 5 training pipilines will go with its own budget, right ?

:aicrowd: [Update] Round 2 of Data Purchasing Challenge is now live!

Almost 3 years ago

Hi, it seems theres’s a bug in local_evaluation.py.
I think you should change
time_available = COMPUTE_BUDGET - (time_started - time.time())
β†’
time_available = COMPUTE_BUDGET - (time.time() - time_started)

0.9+ Baseline Solution for Part 1 of Challenge

Almost 3 years ago

Thanks for publishing your solution!
Do you know how much β€œpseudolabel remaining dataset” gives in terms of accuracy? (a boost)
I didn’t use it.

Experiments with β€œunlabelled” data

Almost 3 years ago

I’ve checked it locally.
Using all 10K images is better than my 3K choosing by 0.006. Maybe I can take some of it by changing purchasing algorithm. But still I feel I need to tune my model.

Experiments with β€œunlabelled” data

Almost 3 years ago

I wrote scores from the leaderboard. I can’t check 10K there…
Local scores are a little bit higher than LB, but correlated with LB.
Yeah maybe I’ll check it locally.

Experiments with β€œunlabelled” data

Almost 3 years ago

Here are just my results. I used the same model, but different purchase modes.

  1. Train with initial 5000 images only: LB 0.869
  2. Add 3000 random images from unlabelled dataset: 0.881
  3. β€œsmart” purchasing (at least non random): 0.888

So we see, that using some β€œsmart” purchasing is helpful, but not so many, maybe ~0.01.
Probably tuning models would be more helpful to push further.

First round doesn't matter?

Almost 3 years ago

If I understood correctly, then the first round means a little and is preliminary. The second round is decisive, right?

Size of Datasets

Almost 3 years ago

Ahh… I see so AICrowd runs the whole pipeline twice, and I can see logs only from the debug version.
Great, thanks!

Size of Datasets

Almost 3 years ago

Hello!
During submission sizes of datasets are only 100 (both training dataset and unlabelled dataset).
Probably it is the debug version.
Is it intentionally?

Potential loop hole in purchasing phase

Almost 3 years ago

I think local evaluation can be modified somehow.
Maybe in ZEWDPCProtectedDataset class, that it doesn’t give you the label in a sample.

Allowance of Pre-trained Model

Almost 3 years ago

Sorry, what’s the right way to use pre-trained model?
I’ve tried β€œmodels.resnet18(pretrained=True)” but it has failed with
urllib.error.URLError: <urlopen error [Errno 99] Cannot assign requested address>

sergey_zlobin has not provided any information yet.