Data Purchasing Challenge 2022

Ways to Select Which Data to Purchase - Episode 1

Intro¶

Saw some interesing discussion post by @gaurav_singhal here .

I managed to read some great resources about active learning :

A Sequential Algorithm for Training Text Classifiers, SIGIR, 1994

Active Hidden Markov Models for Information Extraction, IDA, 2001

Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009

Deep Bayesian Active Learning with Image Data, ICML, 2017

DeepAL: Deep Active Learning in Python, Kuan-Hao Huang, 2021 -> this implementation heavily based off

Human-in-the-Loop Machine Learning

And I ended up trying lot's of them.

PURCHASE METHOD / ACTIVE LEARNING LIST¶

SO here's some implementation NOTEBOOKS using this challenge's data :

LeastConfidence

It's basically find the maximum probability from each label, then find the lowest one from there.

Bayesian Active Learning Disagreement (BALD)

find the difference of probability entropy vs it's entropy means then get the lowest one.

MarginSampling

Sort the highest probability then find the difference between each label probability. The formula is quite weird, I'm not confidence about using this one.

KmeansSampling

It's the slowest one! basically collect the embeddings, cluster it, calculate the distance of each unlabelled data, and find the farthest one from any cluster.

RESULTS¶

And continuing the experiment before here : https://www.aicrowd.com/showcase/lb-0-880-my-experiment-results-baseline-too-i-guess

here's the results from each method !

Method	% Score Increase*
Random	0.12%
LeastConfidence	2.33%
BALD	0.28%
MarginSampling	-0.59%
KmeansSampling	1.29%

*I'll rerun it again multiple times to get the std interval (±) result

TIPS!!:

On the paper implementation, its usually consist of multiple 'rounds' of buying the label so the end-result is good which I think it's difficult to achieve using this competition limited runtime (well that's the challenges). So make sure to optimize it however you like between your training epoch vs rounds, and still pay attention to 3 hours running time. The Notebook default setting is obviously not the best one!

I'm planning to add more methods soon.

Feel free to comment or correct me if there's an improvement, correction, or anything for this implementation!

Hope this will help you guys!

Pls leave some likes 💖 too, thanks!¶

Content

2271

Show Comments

Comments

You must login before you can post a comment.