🚀 3rd and Final Round live! | Make your first submission with the starter kit
💡 Try out this interesting new method
🕵️ Introduction
There are so many distinct odors in everything we see or interact with. Our reactions to different smells are almost always instant and instinctual, not cultivated. A particular smell can sometimes trigger a specific memory too. Still, most of us would not know how our brain categorizes different smells from different sensory inputs.
What happens when particles responsible for smell enter our nose?
Our noses have more than 400 types of olfactory receptors expressed in 1 million+ olfactory sensory neurons, which are all on a small tissue - olfactory epithelium. The olfactory sensory neurons send signals to the olfactory bulb in the brain and then to more structures from there, to understand the smell.
We are turning this process digital!
In our noses, what finally goes in are particles that have odorant molecules responsible for the smell. These molecules are the actual building blocks of all fragrances. For this challenge, we take these molecular compounds as an input, parse them through, and predict what multitude of fragrances they contain out of 100+ different ones.
jasmin
ethereal
,jasmin
,aldehydic
,fruity
green
,herbal
,powdery
,grass
cacao
,floral
,honey
💾 Dataset
The dataset contains the description of molecules (as its SMILES string), and the odors it possesses. The challenge is a multiclassification problem, each molecule has multiple odors written in a form of a sentence with a single ,
between each odor. Following are the columns in the dataset with their description:
-
SMILES: Simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings.
-
SENTENCE (target): Its a combination of the odors of the molecules. Each odor is separated by a
,
to form an (odor) sentence.
📁 Files
v0.1
Following files are available in the resources
section:
train.csv
- (4316
molecules) : This csv file contains the attributes describing the molecules along with their "Sentence" .test.csv
- (1079
molecules) (Round-1) : File that will be used for actual evaluation for the leaderboard score but does not have the "Sentence" for molecules.vocabulary.txt
: A file containing the list of all odors present in the dataset
🖊 Evaluation Criteria
The evaluation of the submissions is done using the Jaccard Index / Tanimoto Similarity Score.
Description of odour can be heteregenous based on personal experience, perfumer, company, so it is hard to expect to get an unique and perfect description. In this case, we can evaluate the best sentence matching in proposed Top 5 sentences.
For example, if for a single molecule, the ground truth is : floral
, green
, rose
and the top-5 proposed sentences are :
rose
,green
,apricot
floral
,muguet
,jasmin
floral
,rose
,green
floral
,green
,melon
muguet
,rose
,woody
Then the Jaccard Index is computed for all the top-5 sentences in comparison to the ground truth, and the best score across all the 5 predictions is considered for the said molecule. When comparison the individual sentences with the ground-truth sentences, only the first 3 words from the ground truth sentence are considered.
The overall score is computed by taking the mean of the said score across all the molecules in the test set.
Round-2 Evaluation Criteria
For Round 2, you can choose a subset of the whole vocabulary(composed of 109 smell words) and create your own - if you believe that it improves your accuracy.
Read on to understand how it works 👇
Lets define :
voc_gt
: (the ground truth vocabulary) as the set of smell words in the actual challenge dataset (ground truth).109
distinct smell words as present in the training set and test set of Round-1.voc_x
: (submission vocabulary) as a subset ofvoc_gt
, on which participants choose to train their models on, and sample their predictions from.voc_x
has to be composed of atleast 60 distinct smell words. This is estimated as the set of all distinct smell words used across all the predictions made by the model.model_compression
: We define the model compression as :
1 - [len(voc_x) / len(voc_gt)]
.- For every
1%
model compression, we expect to have an improvement in accuracy of atleast0.5%
. top_5_TSS_voc_x
,top_2_TSS_voc_x
: This refers to thetop_5_TSS
andtop_2_TSS
computed using the vocabulary used by the participants. When computing this metric, any smell word which is not present invoc_x
is removed from the ground truth sentences.top_5_TSS
: The Jaccard Index computed using the top-5 sentences in comparison to the ground truth (as described for Round 1 above)top_2_TSS
: The Jaccard Index computed using the top-2 sentences in comparison to the ground truth (as opposed to top 5 for top_5_tss)
top_5_TSS_voc_gt
,top_2_TSS_voc_gt
: This refers to thetop_5_TSS
andtop_2_TSS
computed using the vocabulary present in the ground truth data. Here, this is exactly the same astop_5_TSS
andtop_2_TSS
.- Finally,
adjusted_top_5_TSS
,adjusted_top_2_TSS
- The adjusted scores are computed like this 👇
if (top_5_TSS_voc_x - top_5_TSS_voc_gt) >= 0.5 * model_compression :
adjusted_top_5_TSS = top_5_TSS_voc_x
adjusted_top_2_TSS = top_2_TSS_voc_x
else:
adjusted_top_5_TSS = top_5_TSS_voc_gt
adjusted_top_2_TSS = top_2_TSS_voc_gt
So, if the improvement in accuracy between voc_x
and voc_gt
is greater than the expected 0.5 * model_compression
, then we use the improved voc_x
accuracy, else we use the original voc_gt
accuracy.
The leaderboard is sorted based on adjusted_top_5_TSS
as the primary score, and the adjusted_top_2_TSS
as the secondary score.
During the course of Round-2, all the scores are based on 60% of the whole test data, and the final leaderboards on the whole test data will be released at the end of Round-2.
Round 2 submissions are code-based as compared to csv-based submissions in Round 1. More on that below.
🚀 Submission
Round - 1
-
Prepare a CSV file containing header as
SMILES, PREDICTIONS
. -
The
SMILES
column has to contain theSMILES
values as mentioned in the test set -
The
PREDICTIONS
column has to contain the the top-5 predictions of your model separated by;
where each of the odors in each sentence is separated by,
For example, if the value of thePREDICTIONS
column for a particular row is :
coconut,cooling,watery;ambergris,plum,ripe;almond,gourmand,pungent;cognac,dry,medicinal;geranium,lactonic,medicinal
Then, the top-5 predictions of your model are : -
coconut
,cooling
,watery
-
ambergris
,plum
,rip
-
almond
,gourmand
,pungent
-
cognac
,dry
,medicinal
-
geranium
,lactonic
,medicinal
Note: If any of the sentences contain more than 3 words, then only the first 3 words will be considered for evaluation.
-
Sample submission format available at
sample_submission.csv
in the Resources section.
Round-2
Round-2 requires participants to submit their code which will be evaluated on our evaluation infrastructure. Each submission will have access to the following resources during evaluation :
- 4 CPU cores
- 16 GB RAM
- 1 NVIDIA K80 (optional, needs to be enabled in
aicrowd.json
)
All submissions will have a 10 minute setup time for loading their models, any preprocessing that they need, and then they are expected to make a single prediction in less than 1 second (per smile string).
🚀 For more instructions on how to make a submission, check out this getting_started_kit
📅 Rounds
The competiton consists of 3 separate Rounds.
- Round-1 : September 8th, 2020 - October 27th, 2020
- Round-2 : November 23rd - Jan 10th, 2021
- Round-3 : Jan 15th, 2021 - Feb 15th, 2021
🏆 Prizes
The top 2
participants of the Round-3
will be awarded a cash prize of:
- 1st Prize :
CHF 4,000
- 2nd Prize :
CHF 2,000
Round-1 Community Contribution Prize: CHF 1,000 Prize Pool
🔗 Links
- 💪 Challenge Page: https://www.aicrowd.com/challenges/learning-to-smell
- 🗣️ Discussion Forum: https://www.aicrowd.com/challenges/learning-to-smell/discussion
- 🏆 Leaderboard: https://www.aicrowd.com/challenges/learning-to-smell/leaderboards
📱 Contact
📚 Acknowledgement
We have the permission to use Olfactive descriptions and Molecules from "PMP database" authored by Mans Boelens and distributed by Leffingwell & Associates for this challenge.
Participants
Getting Started
Notebooks
5
|
0
|
|
11
|
0
|
|
4
|
0
|
|
12
|
0
|
|
4
|
0
|