Learning to Smell
Right fingerprint is all you need
Using PubChem along with k-nearest neighbours and more!
Dear Community,
My submission is also a very basic one, despite that it gives a high score on the current leaderboard. I hope that I’ll manage to find some spare time to write something more interesting in the upcoming rounds, that’s why I’ve decided to publish it.
I wrote a Medium post with a short explanation and some thoughts about what to do next.
Google Colab can be found here, and GitHub repository with the full source code is there.
import os
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
!wget https://www.dropbox.com/s/3b2ta3qr706d1ua/aicrowd-learning-to-smell-data.zip
!unzip -o aicrowd-learning-to-smell-data.zip
os.listdir("./data")
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
vocab = pd.read_csv("data/vocabulary.txt", header=None)
I used precomputed fingerprints from PubChem. To reproduce, you can run python download_data_from_pubchem.py
, which is available on github, or simply download file with collected fingerprints from there:
!wget https://raw.githubusercontent.com/latticetower/learning-to-smell-baseline/main/pubchem_fingerprints.csv
fingerprints = pd.read_csv("pubchem_fingerprints.csv")
train_df = train.merge(fingerprints, on="SMILES", how="left")
test_df = test.merge(fingerprints, on="SMILES", how="left")
print(train_df.fingerprint.isnull().sum(), "train molecules have no associated fingerprint")
print(test_df.fingerprint.isnull().sum(), "test molecules have no associated fingerprint")
I use only molecules which have fingerprint available to find k nearest neighbours, that's why I filter both train and test data and use unpacked fingerprints to compute K nearest neighbours.
def to_bits(x):
try:
unpacked = np.unpackbits(np.frombuffer(bytes.fromhex(x), dtype=np.uint8))
except Exception as e:
print(e)
print(x)
return unpacked
train_df = train_df[~train_df.fingerprint.isnull()]
train_fingerprints = train_df.fingerprint.apply(to_bits)#lambda fingerprint_string: [x=='1' for x in fingerprint_string])
train_fingerprints = np.stack(train_fingerprints.values)
test_df = test_df[~test_df.fingerprint.isnull()]
test_fingerprints = test_df.fingerprint.apply(to_bits)#lambda fingerprint_string: [x=='1' for x in fingerprint_string])
test_fingerprints = np.stack(test_fingerprints.values)
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(train_fingerprints)
distances, neighbour_indices = nbrs.kneighbors(test_fingerprints)
for i, neighbours in zip(test_df.index, neighbour_indices):
test.loc[i, "PREDICTIONS"] = ";".join([train.loc[train_df.index[x], "SENTENCE"] for x in neighbours])
test.PREDICTIONS.isnull().sum()
We still need to fill several predictions, for this we use top-5 most common molecular scents from train dataset.
train.SENTENCE.value_counts()[:5]
default_prediction = ";".join(train.SENTENCE.value_counts()[:5].index)
test.loc[test.PREDICTIONS.isnull(), "PREDICTIONS"] = default_prediction
test.to_csv("baseline_submission.csv", index=None)
from google.colab import files
files.download("baseline_submission.csv")
Content
Comments
You must login before you can post a comment.