Learning to Smell
Tutorial on how to build a baseline logistic regression model
Basics of SMILES, data exploration and evaluating with Jaccard Index/Tanimoto Similarity Score
Hi everyone,
I wrote a tutorial on how to build a baseline logistic regression model, hope you find it useful. Some of the topics I covered are:
- Basics of the SMILES format
- Data exploration - frequency of the odors
- Converting the data into a format that can be used as input to the model
- Building / training a logistic regression model with Keras
- Making and formatting predictions for submission
- Evaluating the model on the training set with Jaccard Index / Tanimoto Similarity Score
Here is the Colab notebook:
https://colab.research.google.com/drive/143hNxXLAje5_HvbEPnt4N6l7ZHcv9yWY?usp=sharing
Download the data and install the dependencies
References: https://stackoverflow.com/questions/63152656/installing-rdkit-in-google-colab
!pip install numpy matplotlib pandas scikit-learn tensorflow gdown
!gdown https://drive.google.com/uc?id=1McblfXbazPNWpWyU7gYEAzZrK88yBZ7K
!unzip data.zip
!wget -q -c https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.3-Linux-x86_64.sh
!time bash ./Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -q -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from rdkit import Chem, DataStructs
from rdkit.Chem import rdMolDescriptors, Draw
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer
Introduction¶
SMILES (Simplified Molecular Input Line Entry System) is a chemical notation in string format that allows a user to represent a chemical structure in a way that can be used by the computer. SMILES supports all elements in the periodic table.
- Upper case letters refer to non-aromatic atoms; lower case letters refer to aromatic atoms. If the atomic symbol has more than one letter the second letter must be lower case.
- The structures that are entered using SMILES are hydrogen-suppressed, the molecules are represented without hydrogens.
- A branch from a chain is specified by placing the SMILES symbol(s) for the branch between parenthesis.
- The number of charges on an atom may be explicitly stated ({-1}) or not ({-}) with curly brackets.
- Atomic bonds are denoted as follows:
Bond type | Symbol |
---|---|
Single bond | - |
Double bond | = |
Triple bond | # |
Aromatic bond | * |
Disconnected structures | . |
Reference: https://archive.epa.gov/med/med_archive_03/web/html/smiles.html
This tutorial is divided into four main sections:
- Setup
- Data exploration
- Training and test set generation
- Model building
- Model evaluation
i) Setup¶
- Set environment seeds to ensure experiment reproducibility
- Load training and test sets and the vocabulary
# Set environment seeds
RANDOM_SEED = 17
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
# Training and test set paths
TRAINING_DATASET_PATH = "data/train.csv"
TEST_DATASET_PATH = "data/test.csv"
# Vocabulary path
VOCABULARY_PATH = "data/vocabulary.txt"
# Load datasets and vocabulary
train_df = pd.read_csv(TRAINING_DATASET_PATH)
test_df = pd.read_csv(TEST_DATASET_PATH)
vocabulary = open(VOCABULARY_PATH).read().split()
# Training set preview
train_df.head()
ii) Data Exploration¶
- What are the most / least common odors?
- What does the data distribution look like?
- Are all odors represented in the training set?
# Get a flattened list of all labels
labels = [label.split(',') for label in train_df.SENTENCE.tolist()]
labels = [item for sublist in labels for item in sublist]
# Count the number of occurances for each label
counts = dict((x, labels.count(x)) for x in set(labels))
counts
# 10 most common labels
most_common = sorted(counts.items(), key=lambda x:-x[1])[:10]
for x in most_common:
print("{}: {}".format(x[0], x[1]))
# 10 least common labels
least_common = sorted(counts.items(), key=lambda x:-x[1], reverse=True)[:10]
for x in least_common:
print("{}: {}".format(x[0], x[1]))
# Check if any labels are not included in the training set
[label for label in labels if label not in vocabulary]
# Plot the representation percentage of labels in the training seet
keys = counts.keys()
vals = counts.values()
vals_perc = [round(val/len(train_df), 2) for val in vals]
fr = plt.bar(keys, vals_perc, label="Percentage of label occurances")
plt.ylabel('Percentage')
plt.xlabel('Labels')
plt.xticks([])
plt.legend(bbox_to_anchor=(1, 1), loc="upper right", borderaxespad=0.)
plt.show()
Notes: As can be seen in the analysis above, the dataset is highly imbalanced. While this notebook doesn't cover it, it might be useful to experiment with data augmentation techniques or perform a more in-depth analysis of the under-represented labels.
iii) Training and Test Set Generation¶
We need to convert the SMILES string into a format we can work it. In this notebook, we will use the Morgan Fingerprints of the molecules to convert the SMILES data to 256 bits. Molecular fingerprints are one of the ways to represent molecules as mathematical objects (bits).
We will also need to process the labels and convert them into a numerical format so that they can be fed into the model. In order to do this, we will:
- Convert the SENTENCE column of the dataframe to a list of lists
- Create a tokenizer with the available vocabulary and assign an integer value to each unique label
- Use the tokenizer to transform the values in the list of lists (of labels) into their respective token ids
- Use scikit-learn's CountVectorizer to convert the label set of each molecule into a one-hot-encoded format.
References:
https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf
https://docs.chemaxon.com/display/docs/Extended+Connectivity+Fingerprint+ECFP
https://towardsdatascience.com/a-practical-introduction-to-the-use-of-molecular-fingerprints-in-drug-discovery-7f15021be2b1
def convert_to_bits(SMILES):
# Convert SMILES strings into molecules
mols = [Chem.rdmolfiles.MolFromSmiles(SMILES_string) for SMILES_string in SMILES]
fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(m, radius=2, bitInfo= {}, nBits=256, useChirality=True) for m in mols]
# Convert training fingerprints into binary
np_fps = []
for fp in fps:
arr = np.zeros((1,), dtype= int)
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps.append(arr)
return np_fps
# Convert training and test set samples to 256 bits (arrays of size (256, ))
np_fps = convert_to_bits(train_df.SMILES.tolist())
test_np_fps = convert_to_bits(test_df.SMILES.tolist())
# Print the fingerprint of first training sample
np_fps[0]
# Tokenize the vocabulary - assigns a unique id to each odor
tokenizer = Tokenizer()
tokenizer.fit_on_texts(vocabulary)
word_index = tokenizer.word_index
# Convert SENTENCE column to a list of lists, replace values with their respective id
labels = [label.split(',') for label in train_df.SENTENCE.tolist()]
labels_arr = []
for label in labels:
labels_arr.append([word_index[elem] for elem in label])
# Print the labels (integer ids) of the first molecule
print(labels_arr[0])
# Training and test inputs
train_x = np.array(np_fps).astype(np.float32)
test_x = np.array(test_np_fps).astype(np.float32)
# Training set labels - vectorize the tokens
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(","), binary='true')
train_y = vectorizer.fit_transform(train_df.SENTENCE).toarray()
# Index - label map
ind2word = {i: x for i, x in enumerate(vectorizer.get_feature_names())}
print(ind2word)
iv) Model Building¶
For this part we will build a simple multinomial logistic regression model with Keras. Since we have 109 unique labels, this model will output the probability of each label, with their sums equaling to 1: $\sum_{i=0}^{109} p(i) = 1$
# Print input and target shapes of the training data
print(train_x.shape)
print(train_y.shape)
# Number of unique labels
N_OUTPUT = 109
# Number of epochs and batch size
N_EPOCHS = 50
BATCH_SIZE = 32
# Define logistic regression model
model = Sequential()
model.add(Dense(N_OUTPUT, input_dim=256, activation='softmax'))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.summary()
# Start training
model.fit(train_x, train_y, validation_split=0.1, batch_size=BATCH_SIZE, epochs= N_EPOCHS, shuffle=True, verbose=1)
v) Model Evaluation¶
Finally, we will evaluate the model and create a submission file with the following steps:
- Predict on the test set
- For each sample, filter the top 15 predictions (labels) and sort them by their probability
- Use the top 15 labels to generate 5 sets of 3 labels for each input sample
- Predict on the training set and compute the average Jaccard Index / Tanimoto Similarity Score for the top-5 proposed sentences
# Predict on the test set
preds = model.predict(test_x)
# Choose the top 15 predictions for each sample and group by 3
preds_clean = []
for i in range(preds.shape[0]):
labels = [ind2word[i] for i in list(preds[i, :].argsort()[-15:][::-1])]
labels_seq = []
for i in range(0, 15, 3):
labels_seq.append(",".join(labels[i:(i+3)]))
preds_clean.append(";".join(labels_seq))
# Top-5 proposed sentences for the first test sample
preds_clean[0]
Note: Since we are following a one-vs-rest classification approach, the predicted labels are not grouped together. Hence, each label occurs only once in any set of top-5 proposed sentences.
# Compute the jaccard index
def jaccard_index(list1, list2):
s1 = set(list1)
s2 = set(list2)
return len(s1.intersection(s2)) / len(s1.union(s2))
# Compute the jaccard index / TSS of the training set predictions
train_labels = [label.split(',') for label in train_df.SENTENCE.tolist()]
train_preds = model.predict(train_x)
jaccard_scores = []
for i in range(train_preds.shape[0]):
_preds = [ind2word[i] for i in list(train_preds[i, :].argsort()[-15:][::-1])]
_true = train_labels[i]
# Compute the jaccard index of each prediction set (e.g. ['woody', 'fruity', 'resinous'])
scores = [jaccard_index(_true, _preds[j:(j+3)]) for j in range(0, 15, 3)]
jaccard_scores.append(max(scores))
score = sum(jaccard_scores) / len(jaccard_scores)
print("Top-5 TSS: {}".format(round(score, 3)))
# Create the submission.csv file
test_df["PREDICTIONS"] = preds_clean
test_df.to_csv("submission.csv",index=False)
Content
Comments
You must login before you can post a comment.