Tutorial 2. Training the Machine Learning Model
Learning Objectives
- Use Scikit-learn for simple classification tasks
- Use RDKit for molecular representations, such as Morgan Fingerprint
- Use the ROC curve to represent model accuracy
Definition of Classification vs. Regression Problems
Classification and regression are two fundamental types of supervised machine learning tasks. Classification involves predicting a discrete label or category. For example, identifying if an email is “spam” or “not spam” or determining whether a tumor is “benign” or “malignant.” Regression, on the other hand, predicts continuous values, such as forecasting house prices or estimating a person’s weight based on their height.
Examples of Classification vs. Regression Tasks
Classification Example: Determining if a molecule is cancerous or non-cancerous. Regression Example: Predicting the boiling point of a chemical compound based on its structure. This is a Classification Task In our case, distinguishing between cancerous and non-cancerous PAH molecules is a classification task. The model learns to assign a molecule to one of these two categories based on its features.
Commonly Used Classification Algorithms
Logistic Regression: This algorithm models the probability that an input belongs to a particular category. It uses a logistic function to squeeze predicted values between 0 and 1, making it effective for binary classification problems. To learn more, visit Logistic Regression - Scikit-Learn
Random Forest: This is an ensemble method that builds multiple decision trees during training and combines their outputs for improved accuracy and robustness. It reduces overfitting and handles large datasets efficiently. More details can be found here: Random Forest - Scikit-Learn
Support Vector Machine (SVM): SVMs classify data by finding the optimal hyperplane that separates different classes with the largest margin. It’s particularly effective for high-dimensional spaces. Learn more at Support Vector Machines - Scikit-Learn
These algorithms are widely used in classification tasks and can be explored further through these resources.
How to Train a Machine Learning Model
Training a machine learning (ML) model requires dividing your data into two essential subsets: the training set and the test set. The training set is used to teach the model, while the test set evaluates its generalization to unseen data. A common and effective split ratio is 80% for training and 20% for testing. For classification tasks, metrics such as accuracy, true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) offer insights into the model’s performance, with the ROC curve providing a deeper understanding of trade-offs between sensitivity and specificity. Comparing training set performance to test set performance reveals if the model is overfitting or underfitting. This tutorial builds on the previous one, where we curated molecular data, preparing it for machine learning (ML) model training. Now, with the data processed and ready, we can use ML techniques to classify molecular structures and predict their properties.
Code:
Set up
- These code blocks set up the basic infrastructure for the model, including importing the necessary programming libraries and defining parameters for the data tables.
Pre-processing the data set
- This code block processes the data (different cancerous and uncancerous molecules) listed in a specified data folder
- This code block runs the function df_test, outputting data from a folder into a table with parameters defined by the previous code blocks, as shown below:
- This data is from the test set
-
The testing set consists of data that is not used during training. It acts as a reference to evaluate the model’s performance. These molecules have known cancerous or non-cancerous properties, allowing you to measure how accurately the model can make predictions based on the patterns it learned. By keeping this data separate, we prevent the model from simply memorizing the training data, ensuring it can generalize to new, unseen examples.
- This data is from the training set
- The training set contains data that the model uses to learn patterns and relationships. In this case, these are molecules with unknown cancerous properties. The model analyzes their structures and attempts to find patterns that correlate with being cancerous or not. The goal is to expose the model to diverse data so it can generalize and make predictions about unseen molecules.
- This code block uses the external library PubChemPy, a tool to access PubChem, the world’s largest collection of freely accessible chemical information.
- The name of each molecule will be found in the PubChem database, and a SMILES string will then be generated based on its molecular structure.
-
The data is converted into a dictionary using the to_dict function, resulting in an output of the molecule (in this case, naphthalene) as a list of its elements and their positions, a partial representation seen below:
- This function converts all the molecules in the training set into SMILES strings.
-
This function converts all the molecules in the testing set into SMILES strings
- This code block utilizes AllChem’s Morgan Fingerprint generator to take the structural data processed from the SMILES strings. It then denotes the molecules as an array of integers
- This code block converts the SMILES strings from the test set into an array which represents its Morgan Fingerprint. It also converts its cancerous or uncancerous labels into a binary format
- The output can be seen below:
Train a SVM and get the ROC curve
- This code block imports the SVM module from scikit-learn which is then used to train the SVM classifier using the training data X_train and y_train
- This function uses the SVM to predict the cancerous labels of the Molecules in X test
- y_test is the actual set of cancerous labels that apply to the molecules in the test set
- A comparison between the predicted labels and the actual labels shows that the SVM is not fully accurate
- This code block will plot the ROC (Receiver operating characteristic) Curve of the previously trained classifier. Evaluating a machine learning model is essential to ensure it performs well on unseen data. Metrics like accuracy are useful but may not tell the full story. The ROC (Receiver Operating Characteristic) curve and its AUC (Area Under the Curve) provide a deeper insight, showing the trade-off between true positive and false positive rates. A significant gap between training and test set performance often signals overfitting, where the model memorizes training data instead of generalizing patterns. To address this, techniques like cross-validation, regularization, or adding more diverse training data can help improve the model’s accuracy and reliability for computational chemistry applications.
- The positive rate and false positive rate are plotted along the two different axes, the area under the curve denoting its accuracy
Train a random forest and get the ROC curve
*
*This shows the accuracy of the model when Random Forest Classifier and Logistic Regression is used
Upsampling
- Upsampling in machine learning is a way to balance datasets by increasing the number of samples in a smaller class. This is done by duplicating data, slightly modifying existing examples, or creating new ones. It helps the model treat all classes fairly and avoid bias toward the larger class.
- This code will use Logistic Regression on the Upsampled data, plotting it on an ROC curve to show its accuracy as seen below: