Machine Learning Methods for LogP Prediction: Pt. 1

 

The octanol-water partition coefficient, or logP, is one of the most important properties for determining a compound’s suitability as a drug. Currently, most of the available regression models for in silico logP prediction are trained on the PHYSPROP database of experimental logP values. However most of the compounds in this database are not highly representative of the drug-like chemical space. Unfortunately, there is currently a lack of publicly available experimental logP datasets for biological compounds which can be used to train better prediction tools.

In this small test, I have decided to use the experimental logP data released in the paper: “Large, chemically diverse dataset of logP measurements for benchmarking studies” by Martel et al1. As this is a preliminary study, we are interested in finding which featurization methods work best for predicting logP.

Most of the popular tools for logP prediction are based on physical descriptors, such as atom type counts, or polar surface area, or on topological descriptors. Here, we will calculate different physical descriptors, as well as structural fingerprints for the molecules, and benchmark their performance using three different regression models: neural network, random forest, and support vector machines.

We first import some libraries including RDKit and scikit-learn tools (The utility script contains custom functions for generating TPATF and TPAPF fingerprints):

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Descriptors

from utility import FeatureGenerator

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR

The utility script can be found in this Gist.

Reading experimetal logP data

The supplementary pdf file from the Martel et al. paper was converted to csv text format using the Linux pdftotext utility from the Poppler library. The experimental data is read as a csv file, and the SMILES strings are converted to RDKit molecules.

data = pd.read_csv("training_data/logp_759_data.csv")
data_logp = data[data.Status == "Validated"]
print("Shape:", data_logp.shape)
data_logp.head()

Shape: (707, 7)

ID ZINC (2010) Status Supplier SMILES logPexp pH_of_analysis
0 1 ZINC00036522 Validated Specs Cc1cc2c(cc1C)NC(=O)C[C@H]2c3ccccc3OC 4.17 5.0
1 3 ZINC00185379 Validated ChemBridge COc1ccc2c(c1)O[C@@](CC2=O)(C(F)(F)F)O 2.79 5.0
2 4 ZINC12402487 Validated ChemBridge CC1(O[C@H]([C@H](O1)C(=O)N)C(=O)N)C(C)(C)C 1.60 6.5
3 5 ZINC00055459 Validated Specs CCOc1cc(cc(c1OCC)OCC)c2nnc(o2)c3ccco3 3.96 10.5
4 6 ZINC00056871 Validated Enamine CN(C)c1ccc(cc1)C(=C)c2ccc(cc2)N(C)C 5.30 7.3

Convert SMILES to 2D molecules:

molecules = data_logp.SMILES.apply(Chem.MolFromSmiles)

Next, we use RDKit to calculate some physical descriptors:

data_logp.loc[:, 'MolLogP'] = molecules.apply(Descriptors.MolLogP)
data_logp.loc[:, 'HeavyAtomCount'] = molecules.apply(Descriptors.HeavyAtomCount)
data_logp.loc[:, 'HAccept'] = molecules.apply(Descriptors.NumHAcceptors)
data_logp.loc[:, 'Heteroatoms'] = molecules.apply(Descriptors.NumHeteroatoms)
data_logp.loc[:, 'HDonor'] = molecules.apply(Descriptors.NumHDonors)
data_logp.loc[:, 'MolWt'] = molecules.apply(Descriptors.MolWt)
data_logp.loc[:, 'RotableBonds'] = molecules.apply(Descriptors.NumRotatableBonds)
data_logp.loc[:, 'RingCount'] = molecules.apply(Descriptors.RingCount)
data_logp.loc[:, 'Ipc'] = molecules.apply(Descriptors.Ipc)
data_logp.loc[:, 'HallKierAlpha'] = molecules.apply(Descriptors.HallKierAlpha)
data_logp.loc[:, 'NumValenceElectrons'] = molecules.apply(Descriptors.NumValenceElectrons)
data_logp.loc[:, 'SaturatedRings'] = molecules.apply(Descriptors.NumSaturatedRings)
data_logp.loc[:, 'AliphaticRings'] = molecules.apply(Descriptors.NumAliphaticRings)
data_logp.loc[:, 'AromaticRings'] = molecules.apply(Descriptors.NumAromaticRings)

As a baseline, we calculate the performance of RDKit’s calculated MolLogP vs the experimental logP.

r2 = r2_score(data_logp.logPexp, data_logp.MolLogP)
mse = mean_squared_error(data_logp.logPexp, data_logp.MolLogP)
plt.scatter(data_logp.logPexp, data_logp.MolLogP,
            label = "MSE: {:.2f}\nR^2: {:.2f}".format(mse, r2))
plt.legend()
plt.show()

svg

As we can see above, RDKit’s logP predictions have a relatively high mean square error, and a weak coefficient of determination for this dataset. RDKit’s MolLogP implementation is based on atomic contributions. Hence, we will first try to train our own simple logP model using the RDKit physical descriptors that we generated above.

Model with simple descriptors

These are the descriptors that we will use for the model:

X = data_logp.iloc[:, 8:]
y = data_logp.logPexp
X.head()
HeavyAtomCount HAccept Heteroatoms HDonor MolWt RotableBonds RingCount Ipc HallKierAlpha NumValenceElectrons SaturatedRings AliphaticRings AromaticRings
0 21 2 3 1 281.355 2 3 69759.740168 -2.29 108 0 1 2
1 18 4 7 1 262.183 1 2 7977.096898 -1.76 98 0 1 1
2 16 4 6 2 230.264 2 1 2165.098769 -1.14 92 1 1 0
3 25 7 7 0 344.367 8 3 819166.201010 -2.96 132 0 0 3
4 20 2 2 0 266.388 4 2 32168.378171 -2.22 104 0 0 2

For the regression, we will use a Random Forest with the default parameters from scikit-learn, and set aside one third of the data for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

models = {"rf": RandomForestRegressor(n_estimators=100, random_state=42)}

scores = {}
for m in models:
    models[m].fit(X_train, y_train)
    scores[m + "_train"] = models[m].score(X_train, y_train, )
    y_pred = models[m].predict(X_test)
    scores[m + "_test"] = r2_score(y_test, y_pred)
    scores[m + "_mse_test"] = mean_squared_error(y_test, y_pred)

The scores of our model are:

scores = pd.Series(scores).T
scores

rf_train 0.909276
rf_test 0.451319
rf_mse_test 0.792195
dtype: float64

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
plt.scatter(y_test, y_pred, label = "MSE: {:.2f}\nR^2: {:.2f}".format(mse, r2))
plt.legend()
plt.show()

svg

As we can see, using these simple descriptors coupled with scikit-learn’s default random forest gets us a higher R2 and MSE performance than the RDKit logP predictor. This, however is likely due to the differences in the training set that we used, versus the one that they used to develop their model. It would be interesting to see how much we can improve the performance by tuning the random forest parameters, and then measure the performance on the PHYSPROP dataset.

Calculating fingerprints

Now that we saw the performace of the simple molecular descriptors, we would like to assess the performance of some of the most popular molecular fingerprints. Among the many available methods, we will test Morgan fingerprints (ECFP4 and ECFP6), RDKFingerprints, and topological pharmacophore fingerprints (TPAPF and TPATF), the scripts for which are available from MayaChemTools.

I created a function for parallelizing a DataFrame’s apply() function. This makes TPATF and TPAPF fingerprint calculation much faster. This function has become one of my most useful snippets of code:

import multiprocessing
from joblib import Parallel, delayed

def applyParallel(df, func):
    """This function splits a pandas Series into n chunks,
    corresponding to the number of available CPUs. Then it
    applies a given function to the dataframe chunks, and 
    finally, returns their concatenated output."""
    n_jobs=multiprocessing.cpu_count()
    groups =  np.array_split(df, n_jobs)
    results = Parallel(n_jobs)(delayed(lambda g: g.apply(func))(group) for group in groups)
    return pd.concat(results)

Calculate fingerprints:

fps = {"ECFP4": molecules.apply(lambda m: AllChem.GetMorganFingerprintAsBitVect(m, radius=2, nBits=2048)),
       "ECFP6": molecules.apply(lambda m: AllChem.GetMorganFingerprintAsBitVect(m, radius=3, nBits=2048)),
       "RDKFP": molecules.apply(lambda m: AllChem.RDKFingerprint(m, fpSize=2048)),
       "TPATF": applyParallel(data_logp.SMILES, lambda m: FeatureGenerator(m).toTPATF()),
       "TPAPF": applyParallel(data_logp.SMILES, lambda m: FeatureGenerator(m).toTPAPF())}

Comparing fingerprint models

Finally, here we apply three different types of regression models to estimate the performance of the different fingerprints.

y = data_logp.logPexp

models = {"rf": RandomForestRegressor(n_estimators=100, random_state=42),
          "nnet": MLPRegressor(random_state=42),
          "svr": SVR(gamma='auto')}

scores = {}

for f in fps:
    scores[f] = {}
    # Convert fps to 2D numpy array
    X = np.array(fps[f].tolist())
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
                                                        random_state=42)
    for m in models:
        models[m].fit(X_train, y_train)
        #scores[f][m + "_r2_train"] = models[m].score(X_train, y_train)
        y_pred = models[m].predict(X_test)
        scores[f][m + "_r2_test"] = r2_score(y_test, y_pred)
        scores[f][m + "_mse_test"] = mean_squared_error(y_test, y_pred)
scores_df = pd.DataFrame(scores).T
scores_df
nnet_mse_test nnet_r2_test rf_mse_test rf_r2_test svr_mse_test svr_r2_test
ECFP4 1.378013 0.045576 1.216157 0.157679 1.359439 0.058440
ECFP6 1.238698 0.142066 1.182595 0.180924 1.340282 0.071709
RDKFP 1.236841 0.143353 1.068570 0.259899 1.069886 0.258988
TPATF 3.357452 -1.325401 0.704787 0.511858 0.970373 0.327911
TPAPF 1.391893 0.035962 0.829020 0.425813 0.830663 0.424675

Overall, the TPATF fingerprint performed the best — even outperforming the simple descriptor model. The default random forest had the best performance out of all the regression methods, although it is very possible that this will change after some optimization of the model parameters.

In later works, we will further tune models using simple physical descriptors as well as TPATF fingerprints, and compare their performance to existing logP predictors using this dataset, as well as the PHYSPROP set. It would also be interesting to observe the effects of consensus scoring using several models.

  1. Martel, S., Gillerat, F., Carosati, E., Maiarelli, D., Tetko, I. V., Mannhold, R., & Carrupt, P.-A. (2013). Large, chemically diverse dataset of logP measurements for benchmarking studies. European Journal of Pharmaceutical Sciences, 48(1-2), 21–29. doi: 10.1016/j.ejps.2012.10.019