Data Purchasing Challenge 2022
Ways to Select Which Data to Purchase - Episode 1
Active Learning Methods
Intro¶
Saw some interesing discussion post by @gaurav_singhal here .
I managed to read some great resources about active learning :
A Sequential Algorithm for Training Text Classifiers, SIGIR, 1994
Active Hidden Markov Models for Information Extraction, IDA, 2001
Deep Bayesian Active Learning with Image Data, ICML, 2017
DeepAL: Deep Active Learning in Python, Kuan-Hao Huang, 2021 -> this implementation heavily based off
Human-in-the-Loop Machine Learning
And I ended up trying lot's of them.
PURCHASE METHOD / ACTIVE LEARNING LIST¶
SO here's some implementation NOTEBOOKS using this challenge's data :
It's basically find the maximum probability from each label, then find the lowest one from there.
find the difference of probability entropy vs it's entropy means then get the lowest one.
Sort the highest probability then find the difference between each label probability. The formula is quite weird, I'm not confidence about using this one.
It's the slowest one! basically collect the embeddings, cluster it, calculate the distance of each unlabelled data, and find the farthest one from any cluster.
RESULTS¶
And continuing the experiment before here : https://www.aicrowd.com/showcase/lb-0-880-my-experiment-results-baseline-too-i-guess
here's the results from each method !
Method | % Score Increase* |
---|---|
Random | 0.12% |
LeastConfidence | 2.33% |
BALD | 0.28% |
MarginSampling | -0.59% |
KmeansSampling | 1.29% |
*I'll rerun it again multiple times to get the std interval (±) result
TIPS!!:
On the paper implementation, its usually consist of multiple 'rounds' of buying the label so the end-result is good which I think it's difficult to achieve using this competition limited runtime (well that's the challenges). So make sure to optimize it however you like between your training epoch vs rounds, and still pay attention to 3 hours running time. The Notebook default setting is obviously not the best one!
I'm planning to add more methods soon.
Feel free to comment or correct me if there's an improvement, correction, or anything for this implementation!
Hope this will help you guys!
Pls leave some likes 💖 too, thanks!¶
Content
Comments
You must login before you can post a comment.