Tutorial 1. Configuring the Data Set

Learning Objectives

Understand the basics of Chemoinformatics
Interpret the raw dataset
Use APIs like PubChem and RDKit to transform data

Background

Identifying cancerous molecules is crucial in scientific research, especially when these molecules are used in various studies. Polycyclic aromatic hydrocarbons (PAHs) are molecules formed by multiple carbon rings. While some PAH molecules are carcinogenic, others are not, making it essential to distinguish between them in research settings.

Chemoinformatics is the scientific field that studies or predicts a molecule’s properties through informational techniques. By applying the similarity principle—a fundamental concept in chemistry stating that molecules with similar structures exhibit similar properties—we can develop a machine learning model to classify PAH molecules. This model will compare the structures of unknown PAHs with those of molecules that have already been classified, predicting their carcinogenic potential.

For more on Chemoinformatics, this paper is a useful resource: Cheminformatics Paper.

Additionally, we will use external libraries such as PubChemPy and RDKit. These tools will first represent molecular structures as SMILES (Simplified Molecular Input Line Entry System) strings. The SMILES strings will then be converted into Morgan fingerprints, a binary array format that represents a molecule’s structure. This conversion is essential, as molecular names stored as strings are insufficient for computational processing.

Setting up the Data set

For our purposes, we will be using a this set of PAH molecules

PAH.tar.gz

In the file there should be 82 different chemical compounds, as seen below:

There will be 10 chemical compounds within the test set, as seen below:

Regression in machine learning is a technique used to model and understand the relationship between input features (x) and a continuous target output (y). The goal is to find a function that best maps inputs to outputs, minimizing the difference between the predicted and actual values. In our case, regression helps predict a molecule’s carcinogenicity score based on its structural features. By learning these patterns from the data, the model can make accurate and continuous predictions for new chemical compounds.

We cannot create a machine learning (ML) model using only the names of chemical compounds; we need a different representation of the chemicals that the model can actually read and interpret.

Using one of our APIs (Application Programming Interfaces), PubChem, we can access the structural information of each molecule from just its name. For example:

When searching up 1,2-Dimethylbenzo[a]pyrene in the PubChem website, all of its strucural data will be given, including a 2D depiction like this one.

PubChem also provides a “SMILES String” for each molecule. A SMILES (Simplified Molecular Input Line Entry System) string is a simple notation for describing the structure of chemical species using short ASCII strings, like so:

We can directly convert a set of molecule names into a set of SMILE strings using code by defining a function to do so, as shown in the code below:

def getSMILES(df):
    mols=df['molecule'].values
    smiles_list = []
    for mol in mols:
        # Get rid of the ".ct" suffix
        # Search Pubchem by the compound name
        try:
            results = pcp.get_compounds(mol[:-3], 'name')
            smiles = ""
            if len(results) > 0:
                # Get the SMILES string of the compound
                smiles = results[0].isomeric_smiles
                smiles_list.append(smiles)
                print(mol[:-3],smiles)
            else:
                smiles_list.append(smiles)
                print(mol[:-3],'molecule not found in PubChem')
        except:
            print(f"{mol} can not be extracted")
            smiles_list.append('nan')
    df['SMILES'] = smiles_list

Some SMILES strings extracted using this code are shown below:

Morgan fingerprints represent chemical molecules as binary vectors by capturing local structural features. The algorithm starts by assigning identifiers to each atom and iteratively updates them based on neighboring atoms within a specified radius. These substructure identifiers are then hashed into a binary vector, encoding the molecule’s unique features. For a more detailed description of Morgan fingerprints in chemoinformatics, you can read this article on Morgan Fingerprints.

Morgan Finger prints

An example of a Morgan finger print is as shown below:

Usig the Rdkit get Morgan finger print function, you can turn a SMILES string into a morgan finger print, as shown below:

fpgen = AllChem.GetMorganGenerator(radius=2)
mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)n(C)c(=O)n2C")
fp = fpgen.GetFingerprintAsNumPy(mol)

There are also other fingerprints or different ways to represent a molecule. For example, MACCS Keys are keys that typically cointain 166 bits, each bit representing whether a particular molecular substructure or feature is present. As seen below, each bit of the key can be assigned to a particular structure or grouping, with a 1 or a zero representing whether they exist or not in the molecule.

Written on August 14, 2024

Machine Learning Tutorial

Classification Machine Learning for Computational Chemistry

Tutorial 1. Configuring the Data Set

Learning Objectives

Background

Setting up the Data set