Tutorial 2. Training the Machine Learning Model

Learning Objectives

  • Use Scikit-learn for simple classification tasks
  • Use RDKit for molecular representations, such as Morgan Fingerprint
  • Use the ROC curve to represent model accuracy

Definition of Classification vs. Regression Problems

Classification and regression are two fundamental types of supervised machine learning tasks. Classification involves predicting a discrete label or category. For example, identifying if an email is “spam” or “not spam” or determining whether a tumor is “benign” or “malignant.” Regression, on the other hand, predicts continuous values, such as forecasting house prices or estimating a person’s weight based on their height.

Examples of Classification vs. Regression Tasks

Classification Example: Determining if a molecule is cancerous or non-cancerous. Regression Example: Predicting the boiling point of a chemical compound based on its structure. This is a Classification Task In our case, distinguishing between cancerous and non-cancerous PAH molecules is a classification task. The model learns to assign a molecule to one of these two categories based on its features.

Commonly Used Classification Algorithms

Logistic Regression: This algorithm models the probability that an input belongs to a particular category. It uses a logistic function to squeeze predicted values between 0 and 1, making it effective for binary classification problems. To learn more, visit Logistic Regression - Scikit-Learn

Random Forest: This is an ensemble method that builds multiple decision trees during training and combines their outputs for improved accuracy and robustness. It reduces overfitting and handles large datasets efficiently. More details can be found here: Random Forest - Scikit-Learn

Support Vector Machine (SVM): SVMs classify data by finding the optimal hyperplane that separates different classes with the largest margin. It’s particularly effective for high-dimensional spaces. Learn more at Support Vector Machines - Scikit-Learn

These algorithms are widely used in classification tasks and can be explored further through these resources.

How to Train a Machine Learning Model

Training a machine learning (ML) model requires dividing your data into two essential subsets: the training set and the test set. The training set is used to teach the model, while the test set evaluates its generalization to unseen data. A common and effective split ratio is 80% for training and 20% for testing. For classification tasks, metrics such as accuracy, true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) offer insights into the model’s performance, with the ROC curve providing a deeper understanding of trade-offs between sensitivity and specificity. Comparing training set performance to test set performance reveals if the model is overfitting or underfitting. This tutorial builds on the previous one, where we curated molecular data, preparing it for machine learning (ML) model training. Now, with the data processed and ready, we can use ML techniques to classify molecular structures and predict their properties.

Code:

 

Set up


import os    
import tempfile
os.environ['MPLCONFIGDIR'] = tempfile.mkdtemp()
import operator
import numpy as np
import sklearn.preprocessing
import sklearn.utils
from sklearn.decomposition import PCA 
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import mean_absolute_error, accuracy_score
import sklearn.metrics as sklm
from sklearn.ensemble import RandomForestRegressor
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib import rc
import matplotlib
import pandas as pd
from rdkit.Chem import AllChem
import pubchempy as pcp
from rdkit import Chem



fs = 10 # font size
fs_label = 10 # tick label size
fs_lgd = 10 # legend font size
ss = 20 # symbol size
ts = 3 # tick size
slw = 1 # symbol line width
framelw = 1 # line width of frame
lw = 2 # line width of the bar box
rc('axes', linewidth=framelw)
plt.rcParams.update({
    "text.usetex": False,
    "font.weight":"bold",
    "axes.labelweight":"bold",
    "font.size":fs,
    'pdf.fonttype':'truetype'
})
plt.rcParams['mathtext.fontset']='stix'
  • These code blocks set up the basic infrastructure for the model, including importing the necessary programming libraries and defining parameters for the data tables.

 
 

Pre-processing the data set

 


datapath = 'PAH/' # path to your data folder
filein_test= os.path.join(datapath,'testset_0.ds') # read in the CSV file containing the features. This file is just for example
filein_train= os.path.join(datapath,'trainset_0.ds')
# The dataframe for molecule name and classis
df_test = pd.read_csv(filein_test, sep=" ",  header=None, names=['molecule', 'cancerous'])
df_train = pd.read_csv(filein_train, sep=" ",  header=None, names=['molecule', 'cancerous'])
  • This code block processes the data (different cancerous and uncancerous molecules) listed in a specified data folder    


df_test
  • This code block runs the function df_test, outputting data from a folder into a table with parameters defined by the previous code blocks, as shown below:

image

  • This data is from the test set
  • The testing set consists of data that is not used during training. It acts as a reference to evaluate the model’s performance. These molecules have known cancerous or non-cancerous properties, allowing you to measure how accurately the model can make predictions based on the patterns it learned. By keeping this data separate, we prevent the model from simply memorizing the training data, ensuring it can generalize to new, unseen examples.

       


df_train

image

  • This data is from the training set
  • The training set contains data that the model uses to learn patterns and relationships. In this case, these are molecules with unknown cancerous properties. The model analyzes their structures and attempts to find patterns that correlate with being cancerous or not. The goal is to expose the model to diverse data so it can generalize and make predictions about unseen molecules.

 
 


def getSMILES(df):
    mols=df['molecule'].values
    smiles_list = []
    for mol in mols:
        # Get rid of the ".ct" suffix
        # Search Pubchem by the compound name
        results = pcp.get_compounds(mol[:-3], 'name')
        smiles = ""
        if len(results) > 0:
            # Get the SMILES string of the compound
            smiles = results[0].isomeric_smiles
            smiles_list.append(smiles)
            print(mol[:-3],smiles)
        else:
            smiles_list.append(smiles)
            print(mol[:-3],'molecule not found in PubChem')
    df['SMILES'] = smiles_list

mymol = pcp.get_compounds('naphthalene', 'name', record_type='3d')[0]

mydict=mymol.to_dict(properties=['atoms'])

mydict['atoms']
  • This code block uses the external library PubChemPy, a tool to access PubChem, the world’s largest collection of freely accessible chemical information.
  • The name of each molecule will be found in the PubChem database, and a SMILES string will then be generated based on its molecular structure.
  • The data is converted into a dictionary using the to_dict function, resulting in an output of the molecule (in this case, naphthalene) as a list of its elements and their positions, a partial representation seen below:

    image

       


getSMILES(df_train)
  • This function converts all the molecules in the training set into SMILES strings.

image

 

 


getSMILES(df_test)
  • This function converts all the molecules in the testing set into SMILES strings

    image

 

 
 


fpgen = AllChem.GetMorganGenerator(radius=2)
mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)n(C)c(=O)n2C")
fp = fpgen.GetFingerprintAsNumPy(mol)

for i in fp:
    print(i)
  • This code block utilizes AllChem’s Morgan Fingerprint generator to take the structural data processed from the SMILES strings. It then denotes the molecules as an array of integers  
     


def getData(df):
    fpgen = AllChem.GetMorganGenerator(radius=2)
    MFP_list = []
    for smiles in df['SMILES'].values:
        mol = Chem.MolFromSmiles(smiles)
        MFP = fpgen.GetFingerprintAsNumPy(mol)
        MFP_list.append(MFP)
    X = np.array(MFP_list)

    y_list = []
    for y in df['cancerous']:
        if y == 1:
            y_list.append(1)
        else:
            y_list.append(0)
    y = np.array(y_list)
    return X,y

X_test,y_test = getData(df_test)

print(X_test)
X_train,y_train = getData(df_train)
  • This code block converts the SMILES strings from the test set into an array which represents its Morgan Fingerprint. It also converts its cancerous or uncancerous labels into a binary format
  • The output can be seen below:

image

 
 

 

Train a SVM and get the ROC curve

 


from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
  • This code block imports the SVM module from scikit-learn which is then used to train the SVM classifier using the training data X_train and y_train    


clf.predict(X_test)
  • This function uses the SVM to predict the cancerous labels of the Molecules in X test

image

 
 


y_test
  • y_test is the actual set of cancerous labels that apply to the molecules in the test set

image

  • A comparison between the predicted labels and the actual labels shows that the SVM is not fully accurate    


from sklearn.metrics import RocCurveDisplay
svc_disp = RocCurveDisplay.from_estimator(clf, X_test, y_test)
plt.show()
  • This code block will plot the ROC (Receiver operating characteristic) Curve of the previously trained classifier. Evaluating a machine learning model is essential to ensure it performs well on unseen data. Metrics like accuracy are useful but may not tell the full story. The ROC (Receiver Operating Characteristic) curve and its AUC (Area Under the Curve) provide a deeper insight, showing the trade-off between true positive and false positive rates. A significant gap between training and test set performance often signals overfitting, where the model memorizes training data instead of generalizing patterns. To address this, techniques like cross-validation, regularization, or adding more diverse training data can help improve the model’s accuracy and reliability for computational chemistry applications.

image

  • The positive rate and false positive rate are plotted along the two different axes, the area under the curve denoting its accuracy

 
 

Train a random forest and get the ROC curve

 
 


from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(X_train, y_train)
ax = plt.gca()
rfc_disp = RocCurveDisplay.from_estimator(rfc, X_test, y_test, ax=ax, alpha=0.8)
svc_disp.plot(ax=ax, alpha=0.8)
plt.show()

 
 


from sklearn.linear_model import LogisticRegression

clf_lg = LogisticRegression(random_state=0).fit(X_train, y_train)
clf.predict(X_test)

* image

*This shows the accuracy of the model when Random Forest Classifier and Logistic Regression is used

 
 

 
 

Upsampling

  • Upsampling in machine learning is a way to balance datasets by increasing the number of samples in a smaller class. This is done by duplicating data, slightly modifying existing examples, or creating new ones. It helps the model treat all classes fairly and avoid bias toward the larger class.  


from sklearn.utils import resample,shuffle
df_0 = df_train[df_train['cancerous'] == -1]
df_1 = df_train[df_train['cancerous'] == 1]

len(df_0), len(df_1)
df_0_upsampled = resample(df_0,random_state=42,n_samples=50,replace=True)
len(df_0_upsampled)
df_0_upsampled

 
 

image image  
 


df_upsampled = pd.concat([df_0_upsampled,df_1])

X_train_up,y_train_up = getData(df_upsampled)

clf_lg = LogisticRegression(random_state=0).fit(X_train_up, y_train_up)
clf.predict(X_test)
  • This code will use Logistic Regression on the Upsampled data, plotting it on an ROC curve to show its accuracy as seen below: image

 

Written on August 14, 2024