Ricardo Avila

Switching Your GTK Theme Based on Time of Day

2020-04-24T00:00:00-06:00

A new trend in software UI/UX is the introduction of dark and light options for user interfaces. Windows and MacOS have both added support for dark window themes in their latest versions, and many iOS and Android apps offer this feature through their application settings. Even many websites are enabling support for this feature using the prefers-color-scheme CSS tag, which alerts a website about the user’s theme preference at OS level. Linux naturally, has had the ability to customize user interface themes for decades, but what is new, is the focus on continuously adapting user interfaces.

For example, If you are using GNOME desktop, you may be aware that the default GNOME wallpaper transitions in color and tone throughout the day, providing brighter colors in the morning, and gradually transitioning to darker tones towards the evening. Furthermore, most operating systems now support “Night light” - a feature that allows a device’s screen to gradually shift from blue tones to red, in order to reduce eye strain at night.

A neat feature missing from most desktop environments is the ability to automatically switch GTK themes, in order to provide a darker environment at night, and improve brightness and contrast during the day. The good news is that in Linux it is not difficult to implement this functionality with systemd timers.

Cron Jobs vs Systemd Timers

So, what are systemd timers? Old time UNIX users will be familiar with cron jobs. Cron is a utility that runs commands, typically from a file stored under: /etc/cronbtab

An example crontab job

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name  command to be executed

Relevant xkcd

“Take THAT, piece of 1980s-era infrastructure I’ve inexplicably maintained on my systems for 15 years despite never really learning how it works.”

Systemd on the other hand, is a manager for system processes and services. It has replaced much of the functionality that was previously handled by the UNIX init daemon, and has several nice features that the cron utility lacks:

The big benefit for our purpose, is that with cron, if the computer is powered off, a scheduled cron job does not run. Systemd, on the other hand, can run the tasks that it missed the next time that it powers on.

Other advantages of systemd timers:

CPU and memory limits
Randomized scheduling
Jobs can be easily started independently of their timers
Jobs are logged in systemd journal, which makes easier debugging

Creating a Systemd Service

Systemd services have the extension .service. All user-created systemd scripts will be stored in ~/.config/systemd/user/. Here are the contents of “dark-theme.sevice”, out systemd service for switching to the Adwaita-dark GTK theme:

[Unit]
Description=Change the GTK theme to dark mode.
After=graphical.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'gsettings set org.gnome.desktop.interface gtk-theme Adwaita-dark'

[Install]
WantedBy=default.target

We will need to create a separate “light-theme.service” file to switch to a light GTK theme.

Creating a Systemd Timer

Now, to create a timer, we make another file in the same directory with the same rootname plus the extension .timer. In this case, we named our file “dark-theme.timer”. Here are the contents of this file:

[Unit]
Description=Change the GTK theme daily at a given time.

[Timer]
OnCalendar=*-*-* 16:00:00
Persistent=true

[Install]
WantedBy=timers.target

The OnCalendar setting specifies that this particular timer should run every day at 16:00 hrs. You can also create timers that run every other day, or on specific days of the week. For more on this, I have found that the best resource for documentation is the ArchLinux Wiki: https://wiki.archlinux.org/index.php/Systemd/Timers.

We will additionally need to create a “light-theme.timer” to change theme in the morning.

Running Services as a User

When we are done creating the four configuration files (two for each of dark and light), we need to enable the services. It is recommended that we enable them at the user level (since requiring sudo access would be a hazzle and potential security risk).

Enabling the services:

systemctl --user enable dark-theme.service
systemctl --user enable light-theme.service
systemctl --user enable dark-theme.timer
systemctl --user enable light-theme.timer

Now the operating system will automatically change the GTK theme at the specified times. Furthermore, if we wish to manually change the theme, or test that the service works, we may do so using:

systemctl --user start light-theme.service

or:

systemctl --user start dark-theme.service

What can we use this for?

A couple other applications come to mind:

Daily data backups
Daily data ingestion

Let me know if you come up with other interesting applications!

Web-scraping AbeBooks.com (Reverse Engineering a REST API)

2020-04-10T00:00:00-06:00

Motivation
REST APIs
Exploring Network Packets
Wrapping the API in Python
- Sending POST requests
- Sending a GET request
An Object-Oriented Module
Using the AbeBooks Module

Motivation

I have a large collection of electronic books, which I manage using Calibre. Using Calibre’s “Extract ISBN” plugin, I am able to parse the ISBN identifier from most of my files, which then makes fetching the rest of the metadata very easy. (Below is an example of my library’s metadata.)

Thus, I have access to a very convenient and ever-growing virtual library of books, which I like to use on the go, and for exploratory research. Nevertheless, whenever I find a particularly good book, the thing that I want most, is to own a physical copy.

Enter here AbeBooks.com. Next to Amazon, and occasionally Ebay, it is my go-to site for buying cheap used textbooks. Given that I have stored the ISBN data for most of my electronic books, I would like to be able to automatically fetch pricing information for any book in my virtual library, perhaps even keeping track of changes in price over time.

However, until now, the main problem stopping me from writing a script to do this was that AbeBooks does not have a publicly available API… or at the very least, none that is explicitly documented.

REST APIs

REST, or Representational State Transfer, is an architecture, or convention used by the HTTP protocol to provide interoperability between servers. It is based on a request/response system, where a request is a “payload”, normally formatted as HTML, XML, or JSON., and the response can be a link to a resource, a data payload in any of the aforementioned formats, or a confirmation that some data was modified in the server.

Several common REST methods exist: GET, HEAD, POST, PUT, PATCH, DELETE, CONNECT, OPTIONS and TRACE. Among these, the two most common are GET and POST:

GET

Used to request data from a server.
Parameter data is stored in the URL of the query as string parameters
Number of parameters is limited to the length that can fit in the URL
Not secure for sensitive information. (Passwords can be easily seen)

POST

Used to submit data to a server, and can modify server contents.
Parameters are passed in the message body, rather than the URL.
It has no restrictions on the number of parameters.
Is more secure for sending sensitive information.

Exploring Network Packets

I found that inspecting the network packets for an AbeBooks search results page is simple, and yields promising results. If we open Firefox’s developer tools, under the Network tab, we can see a list of all the packets that are loaded. In particular we are interested in those that have a JSON response, highlighted in red below:

We can see that there are four POST requests, to a service called “pricingservice”, and one GET request to a “RecomendationsApi”.

If we look more closely at one of the POST requests, we can see which parameters it takes in:

ISBN! Just what we needed! Furthermore, looking at the response tab, we can see that this request returns the prices for new and used books, among other things:

Wrapping the API in Python

Now that we know a bit more about how AbeBooks works under the hood, we can start implementing our API wrapper in Python. We will need the requests module:

import requests

Sending POST requests

The first REST method that we will implement is the POST method that fetches prices for a given book. From inspecting the page elements, we know that the URL for this service is:

url = "https://www.abebooks.com/servlet/DWRestService/pricingservice"

There seem to be three main parameter groups, and we can infer their purpose. (Parameters shown in bold below are to be replaced by user values)

Searching prices by ISBN:

Parameter	Value
action	getPricingDataByISBN
isbn	isbn
container	pricingService-isbn

Searching prices by title and author:

Parameter	Value
action	getPricingDataForAuthorTitleStandardAddToBasket
an	author
tn	title
container	oe-search-all

Searching prices by title, author, and hardcover/softcover binding:

Parameter	Value
action	getPricingDataForAuthorTitleBindingRefinements
isbn	9781250297662
an	author
tn	title
container	priced-from-soft OR priced-from-hard

The parameters can be stored as a dictionary, and sent to the request’s post method. For example:

#- Search prices by ISBN
payload1 = {'action': 'getPricingDataByISBN',
           'isbn': 9781250297662,
           'container': 'pricingService-9781250297662'}

#- Search prices by author and title
payload2 = {'action': 'getPricingDataForAuthorTitleStandardAddToBasket',
            'an': 'liu ken',
            'tn': 'broken stars',
            'container': 'oe-search-all'}

#- Sending a request
resp = requests.post(url, data=payload1)
print(resp.status_code, resp.reason)
resp.json()

The response is:

200 OK


{'errorTexts': [None],
 'errorCodes': [None],
 'success': True,
 'newExists': True,
 'usedExists': True,
 'pricingInfoForBestNew': {'bestListingid': 30410510568,
  'totalResults': 16,
  'bestPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 7.26',
  'bestPriceInSurferCurrencyWithCurrencySymbol': 'US$ 7.26',
  'domesticShippingPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 4.50',
  'shippingToDestinationPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 6.00',
  'shippingToDestinationPriceInSurferCurrencyWithCurrencySymbol': 'US$ 6.00',
  'shippingDestinationNameInSurferLanguage': 'U.S.A.',
  'vendorCountryNameInSurferLanguage': 'Canada',
  'vendorId': 71361,
  'bestPriceInPurchaseCurrencyValueOnly': '7.26',
  'bestShippingToDestinationPriceInPurchaseCurrencyValueOnly': '6.0',
  'listingCurrencySymbol': 'US$',
  'purchaseCurrencySymbol': 'US$',
  'nonPaddedPriceInListingCurrencyValueOnly': '7.26',
  'refinementList': None,
  'internationalEdition': False,
  'bookCondition': 'New',
  'bookDescription': 'Hardcover. Publisher overstock,...',
  'freeShipping': False},
 'pricingInfoForBestUsed': {'bestListingid': 30529767259,
  'totalResults': 8,
  'bestPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 6.55',
  'bestPriceInSurferCurrencyWithCurrencySymbol': 'US$ 6.55',
  'domesticShippingPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 3.99',
  'shippingToDestinationPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 3.99',
  'shippingToDestinationPriceInSurferCurrencyWithCurrencySymbol': 'US$ 3.99',
  'shippingDestinationNameInSurferLanguage': 'U.S.A.',
  'vendorCountryNameInSurferLanguage': 'U.S.A.',
  'vendorId': 71597499,
  'bestPriceInPurchaseCurrencyValueOnly': '6.55',
  'bestShippingToDestinationPriceInPurchaseCurrencyValueOnly': '3.99',
  'listingCurrencySymbol': 'US$',
  'purchaseCurrencySymbol': 'US$',
  'nonPaddedPriceInListingCurrencyValueOnly': '6.55',
  'refinementList': None,
  'internationalEdition': False,
  'bookCondition': 'As New',
  'bookDescription': 'Like brand new book.',
  'freeShipping': False},
 'pricingInfoForBestAllConditions': None,
 'isbn': '9781250297662',
 'totalResults': 24,
 'containerId': 'pricingService-9781250297662',
 'refinementList': [{'name': 'collectibleJacket',
   'label': 'Dust Jacket',
   'count': 2,
   'url': 'dj=on&isbn=9781250297662&sortby=17'},
  {'name': 'freeShipping',
   'label': 'Free US Shipping',
   'count': 9,
   'url': 'isbn=9781250297662&n=100046078&sortby=17'},
  {'name': 'bindingHard',
   'label': 'Hardcover',
   'count': 23,
   'url': 'bi=h&isbn=9781250297662&sortby=17'},
  {'name': 'collectibleFirstEdition',
   'label': 'First Edition',
   'count': 3,
   'url': 'fe=on&isbn=9781250297662&sortby=17'}],
 'bibliographicDetail': {'author': '', 'title': ''}}

Sending a GET request

The API also has a GET method for obtaining book recommendations given an ISBN. The url and parameter names are different, but the way we send the request is very similar:

url = "https://www.abebooks.com/servlet/RecommendationsApi"

Parameter	Value
pageId	plp
itemIsbn13	isbn

#- Get book recommendations by ISBN
payload = {'pageId': 'plp',
           'itemIsbn13': 9781250297662}

resp = requests.get(url, params=payload)
print(resp.status_code, resp.reason)
resp.json()

Response:

200 OK


{'widgetResponses': [{'slotName': 'detail-1',
   'title': 'Customers who bought this item also bought',
   'algoName': 'abeBooksBlendedPurchaseSims',
   'ref': 'pd_b_p_1',
   'recommendationItems': [{'attributes': [],
     'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9780765384201-us-300.jpg',
     'itemLink': '/products/isbn/9780765384201?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
     'subTitle': None,
     'isbn13': '9780765384201',
     'title': 'Invisible Planets: Contemporary Chinese Science Fiction...',
     'author': 'Liu, Ken'},
    {'attributes': [],
     'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9781250306029-us-300.jpg',
     'itemLink': '/products/isbn/9781250306029?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
     'subTitle': None,
     'isbn13': '9781250306029',
     'title': 'The Redemption of Time: A Three-Body Problem Novel...',
     'author': 'Baoshu'},
    {'attributes': [],
     'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9780765389312-us-300.jpg',
     'itemLink': '/products/isbn/9780765389312?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
     'subTitle': None,
     'isbn13': '9780765389312',
     'title': 'Waste Tide',
     'author': 'Qiufan, Chen'},
    {'attributes': [],
     'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9780765384195-us-300.jpg',
     'itemLink': '/products/isbn/9780765384195?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
     'subTitle': None,
     'isbn13': '9780765384195',
     'title': 'Invisible Planets: Contemporary Chinese Science Fiction...',
     'author': 'Liu, Ken'},
    {'attributes': [],
     'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9781784978518-us-300.jpg',
     'itemLink': '/products/isbn/9781784978518?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
     'subTitle': None,
     'isbn13': '9781784978518',
     'title': 'The Wandering Earth',
     'author': 'Liu, Cixin'}]},
  {'slotName': 'ext-search-detail-1',
   'title': None,
   'algoName': 'heroWidgetIsbnSims',
   'ref': 'pd_hw_i_1',
   'recommendationItems': [{'attributes': [],
     'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9780804172448-us-300.jpg',
     'itemLink': '/products/isbn/9780804172448?cm_sp=rec-_-pd_hw_i_1-_-plp&reftag=pd_hw_i_1',
     'subTitle': 'Best Selling',
     'isbn13': '9780804172448',
     'title': 'Station Eleven',
     'author': 'Mandel, Emily St. John'},
    {'attributes': [],
     'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9781786073495-us-300.jpg',
     'itemLink': '/products/isbn/9781786073495?cm_sp=rec-_-pd_hw_i_1-_-plp&reftag=pd_hw_i_1',
     'subTitle': 'Top Rated',
     'isbn13': '9781786073495',
     'title': 'Zuleikha',
     'author': 'Yakhina, Guzel'}]}]}

An Object-Oriented Module

I created a small Python module abebooks.py to encapsulate the requests. The full code is below:

import requests


class AbeBooks:

    def __get_price(self, payload):
        url = "https://www.abebooks.com/servlet/DWRestService/pricingservice"
        resp = requests.post(url, data=payload)
        resp.raise_for_status()
        return resp.json()

    def __get_recomendations(self, payload):
        url = "https://www.abebooks.com/servlet/RecommendationsApi"
        resp = requests.get(url, params=payload)
        resp.raise_for_status()
        return resp.json()

    def getPriceByISBN(self, isbn):
        """
        Parameters
        ----------
        isbn (int) - a book's ISBN code
        """
        payload = {'action': 'getPricingDataByISBN',
                   'isbn': isbn,
                   'container': 'pricingService-{}'.format(isbn)}
        return self.__get_price(payload)

    def getPriceByAuthorTitle(self, author, title):
        """
        Parameters
        ----------
        author (str) - book author
        title (str) - book title
        """
        payload = {'action': 'getPricingDataForAuthorTitleStandardAddToBasket',
                   'an': author,
                   'tn': title,
                   'container': 'oe-search-all'}
        return self.__get_price(payload)

    def getPriceByAuthorTitleBinding(self, author, title, binding):
        """
        Parameters
        ----------
        author (str) - book author
        title (str) - book title
        binding(str) - one of 'hard', or 'soft'
        """
        if binding == "hard":
            container = "priced-from-hard"
        elif binding == "soft":
            container = "priced-from-soft"
        else:
            raise ValueError(
                    'Invalid parameter. Binding must be "hard" or "soft"')
        payload = {'action': 'getPricingDataForAuthorTitleBindingRefinements',
                   'an': author,
                   'tn': title,
                   'container': container}
        return self.__get_price(payload)

    def getRecommendationsByISBN(self, isbn):
        """
        Parameters
        ----------
        isbn (int) - a book's ISBN code
        """
        payload = {'pageId': 'plp',
                   'itemIsbn13': isbn}
        return self.__get_recomendations(payload)

Using the AbeBooks Module

from abebooks import AbeBooks

ab = AbeBooks()
results = ab.getPriceByISBN(9780062941503)
if results['success']:
    best_new = results['pricingInfoForBestNew']
    best_used = results['pricingInfoForBestUsed']

#- Best New Price
print(best_new['bestPriceInPurchaseCurrencyWithCurrencySymbol'])

US$ 21.49

#- Best Used Price
print(best_used['bestPriceInPurchaseCurrencyWithCurrencySymbol'])

US$ 24.42

Interweaving R and Python with Reticulate

2019-05-04T00:00:00-06:00

Python is my favorite language for data manipulation, but every once in a while, I find an R library that I absolutely need to try out. I wish I could have the best of both worlds. Unfortunately, I had not found a good solution until recently, when I tried out RStudio and the Reticulate R package, and the combination is awesome!

With Reticulate and the new version of RStudio (RStudio 1.2), you can create Python code chunks that have a persistent environment across them within a single Rmarkdown document. This turns RStudio into a powerful alternative to the popular Jupyter notebook for Python development.

A simple demonstration:

R code:

# Loading the Reticulate library in RStudio
library(reticulate)

Now some Python:

# Creating a couple of simple arrays to plot
import numpy as np

x = np.array([1, 2, 3, 4, 5, 5])
y = np.exp2(x)


# Displaying a python plot
import matplotlib.pyplot as plt

plt.plot(x, y)
plt.show()

Furthermore, you can access these same Python objects from inside an R code cell, so now, you can finally have the best of both worlds!

# Plotting the same arrays in R! So simple!
plot(py$x, py$y)

I normally do most of my coding in Vim, or Jupyter notebooks, but after discovering this package, I think I will be using RStudio a lot more often for Python + R programming.

Previously, my attempts at combining Python and R code involved using the Python rpy2 library to call R code within Python, but this approach always felt cumbersome at best. By comparison, Reticulate makes the transition feel smooth and natural, effectively marrying the powerful libraries of R and Python.

Machine Learning Methods for LogP Prediction: Pt. 1

2019-03-07T00:00:00-07:00

Reading experimetal logP data
Model with simple descriptors
Calculating fingerprints
Comparing fingerprint models

The octanol-water partition coefficient, or logP, is one of the most important properties for determining a compound’s suitability as a drug. Currently, most of the available regression models for in silico logP prediction are trained on the PHYSPROP database of experimental logP values. However most of the compounds in this database are not highly representative of the drug-like chemical space. Unfortunately, there is currently a lack of publicly available experimental logP datasets for biological compounds which can be used to train better prediction tools.

In this small test, I have decided to use the experimental logP data released in the paper: “Large, chemically diverse dataset of logP measurements for benchmarking studies” by Martel et al¹. As this is a preliminary study, we are interested in finding which featurization methods work best for predicting logP.

Most of the popular tools for logP prediction are based on physical descriptors, such as atom type counts, or polar surface area, or on topological descriptors. Here, we will calculate different physical descriptors, as well as structural fingerprints for the molecules, and benchmark their performance using three different regression models: neural network, random forest, and support vector machines.

We first import some libraries including RDKit and scikit-learn tools (The utility script contains custom functions for generating TPATF and TPAPF fingerprints):

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Descriptors

from utility import FeatureGenerator

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR

The utility script can be found in this Gist.

Reading experimetal logP data

The supplementary pdf file from the Martel et al. paper was converted to csv text format using the Linux pdftotext utility from the Poppler library. The experimental data is read as a csv file, and the SMILES strings are converted to RDKit molecules.

data = pd.read_csv("training_data/logp_759_data.csv")
data_logp = data[data.Status == "Validated"]
print("Shape:", data_logp.shape)
data_logp.head()

Shape: (707, 7)

	ID	ZINC (2010)	Status	Supplier	SMILES	logPexp	pH_of_analysis
0	1	ZINC00036522	Validated	Specs	Cc1cc2c(cc1C)NC(=O)C[C@H]2c3ccccc3OC	4.17	5.0
1	3	ZINC00185379	Validated	ChemBridge	COc1ccc2c(c1)O[C@@](CC2=O)(C(F)(F)F)O	2.79	5.0
2	4	ZINC12402487	Validated	ChemBridge	CC1(O[C@H]([C@H](O1)C(=O)N)C(=O)N)C(C)(C)C	1.60	6.5
3	5	ZINC00055459	Validated	Specs	CCOc1cc(cc(c1OCC)OCC)c2nnc(o2)c3ccco3	3.96	10.5
4	6	ZINC00056871	Validated	Enamine	CN(C)c1ccc(cc1)C(=C)c2ccc(cc2)N(C)C	5.30	7.3

Convert SMILES to 2D molecules:

molecules = data_logp.SMILES.apply(Chem.MolFromSmiles)

Next, we use RDKit to calculate some physical descriptors:

data_logp.loc[:, 'MolLogP'] = molecules.apply(Descriptors.MolLogP)
data_logp.loc[:, 'HeavyAtomCount'] = molecules.apply(Descriptors.HeavyAtomCount)
data_logp.loc[:, 'HAccept'] = molecules.apply(Descriptors.NumHAcceptors)
data_logp.loc[:, 'Heteroatoms'] = molecules.apply(Descriptors.NumHeteroatoms)
data_logp.loc[:, 'HDonor'] = molecules.apply(Descriptors.NumHDonors)
data_logp.loc[:, 'MolWt'] = molecules.apply(Descriptors.MolWt)
data_logp.loc[:, 'RotableBonds'] = molecules.apply(Descriptors.NumRotatableBonds)
data_logp.loc[:, 'RingCount'] = molecules.apply(Descriptors.RingCount)
data_logp.loc[:, 'Ipc'] = molecules.apply(Descriptors.Ipc)
data_logp.loc[:, 'HallKierAlpha'] = molecules.apply(Descriptors.HallKierAlpha)
data_logp.loc[:, 'NumValenceElectrons'] = molecules.apply(Descriptors.NumValenceElectrons)
data_logp.loc[:, 'SaturatedRings'] = molecules.apply(Descriptors.NumSaturatedRings)
data_logp.loc[:, 'AliphaticRings'] = molecules.apply(Descriptors.NumAliphaticRings)
data_logp.loc[:, 'AromaticRings'] = molecules.apply(Descriptors.NumAromaticRings)

As a baseline, we calculate the performance of RDKit’s calculated MolLogP vs the experimental logP.

r2 = r2_score(data_logp.logPexp, data_logp.MolLogP)
mse = mean_squared_error(data_logp.logPexp, data_logp.MolLogP)
plt.scatter(data_logp.logPexp, data_logp.MolLogP,
            label = "MSE: {:.2f}\nR^2: {:.2f}".format(mse, r2))
plt.legend()
plt.show()

As we can see above, RDKit’s logP predictions have a relatively high mean square error, and a weak coefficient of determination for this dataset. RDKit’s MolLogP implementation is based on atomic contributions. Hence, we will first try to train our own simple logP model using the RDKit physical descriptors that we generated above.

Model with simple descriptors

These are the descriptors that we will use for the model:

X = data_logp.iloc[:, 8:]
y = data_logp.logPexp
X.head()

	HeavyAtomCount	HAccept	Heteroatoms	HDonor	MolWt	RotableBonds	RingCount	Ipc	HallKierAlpha	NumValenceElectrons	SaturatedRings	AliphaticRings	AromaticRings
0	21	2	3	1	281.355	2	3	69759.740168	-2.29	108	0	1	2
1	18	4	7	1	262.183	1	2	7977.096898	-1.76	98	0	1	1
2	16	4	6	2	230.264	2	1	2165.098769	-1.14	92	1	1	0
3	25	7	7	0	344.367	8	3	819166.201010	-2.96	132	0	0	3
4	20	2	2	0	266.388	4	2	32168.378171	-2.22	104	0	0	2

For the regression, we will use a Random Forest with the default parameters from scikit-learn, and set aside one third of the data for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

models = {"rf": RandomForestRegressor(n_estimators=100, random_state=42)}

scores = {}
for m in models:
    models[m].fit(X_train, y_train)
    scores[m + "_train"] = models[m].score(X_train, y_train, )
    y_pred = models[m].predict(X_test)
    scores[m + "_test"] = r2_score(y_test, y_pred)
    scores[m + "_mse_test"] = mean_squared_error(y_test, y_pred)

The scores of our model are:

scores = pd.Series(scores).T
scores

rf_train 0.909276
rf_test 0.451319
rf_mse_test 0.792195
dtype: float64

r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
plt.scatter(y_test, y_pred, label = "MSE: {:.2f}\nR^2: {:.2f}".format(mse, r2))
plt.legend()
plt.show()

As we can see, using these simple descriptors coupled with scikit-learn’s default random forest gets us a higher R² and MSE performance than the RDKit logP predictor. This, however is likely due to the differences in the training set that we used, versus the one that they used to develop their model. It would be interesting to see how much we can improve the performance by tuning the random forest parameters, and then measure the performance on the PHYSPROP dataset.

Calculating fingerprints

Now that we saw the performace of the simple molecular descriptors, we would like to assess the performance of some of the most popular molecular fingerprints. Among the many available methods, we will test Morgan fingerprints (ECFP4 and ECFP6), RDKFingerprints, and topological pharmacophore fingerprints (TPAPF and TPATF), the scripts for which are available from MayaChemTools.

I created a function for parallelizing a DataFrame’s apply() function. This makes TPATF and TPAPF fingerprint calculation much faster. This function has become one of my most useful snippets of code:

import multiprocessing
from joblib import Parallel, delayed

def applyParallel(df, func):
    """This function splits a pandas Series into n chunks,
    corresponding to the number of available CPUs. Then it
    applies a given function to the dataframe chunks, and 
    finally, returns their concatenated output."""
    n_jobs=multiprocessing.cpu_count()
    groups =  np.array_split(df, n_jobs)
    results = Parallel(n_jobs)(delayed(lambda g: g.apply(func))(group) for group in groups)
    return pd.concat(results)

Calculate fingerprints:

fps = {"ECFP4": molecules.apply(lambda m: AllChem.GetMorganFingerprintAsBitVect(m, radius=2, nBits=2048)),
       "ECFP6": molecules.apply(lambda m: AllChem.GetMorganFingerprintAsBitVect(m, radius=3, nBits=2048)),
       "RDKFP": molecules.apply(lambda m: AllChem.RDKFingerprint(m, fpSize=2048)),
       "TPATF": applyParallel(data_logp.SMILES, lambda m: FeatureGenerator(m).toTPATF()),
       "TPAPF": applyParallel(data_logp.SMILES, lambda m: FeatureGenerator(m).toTPAPF())}

Comparing fingerprint models

Finally, here we apply three different types of regression models to estimate the performance of the different fingerprints.

y = data_logp.logPexp

models = {"rf": RandomForestRegressor(n_estimators=100, random_state=42),
          "nnet": MLPRegressor(random_state=42),
          "svr": SVR(gamma='auto')}

scores = {}

for f in fps:
    scores[f] = {}
    # Convert fps to 2D numpy array
    X = np.array(fps[f].tolist())
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
                                                        random_state=42)
    for m in models:
        models[m].fit(X_train, y_train)
        #scores[f][m + "_r2_train"] = models[m].score(X_train, y_train)
        y_pred = models[m].predict(X_test)
        scores[f][m + "_r2_test"] = r2_score(y_test, y_pred)
        scores[f][m + "_mse_test"] = mean_squared_error(y_test, y_pred)

scores_df = pd.DataFrame(scores).T
scores_df

	nnet_mse_test	nnet_r2_test	rf_mse_test	rf_r2_test	svr_mse_test	svr_r2_test
ECFP4	1.378013	0.045576	1.216157	0.157679	1.359439	0.058440
ECFP6	1.238698	0.142066	1.182595	0.180924	1.340282	0.071709
RDKFP	1.236841	0.143353	1.068570	0.259899	1.069886	0.258988
TPATF	3.357452	-1.325401	0.704787	0.511858	0.970373	0.327911
TPAPF	1.391893	0.035962	0.829020	0.425813	0.830663	0.424675

Overall, the TPATF fingerprint performed the best — even outperforming the simple descriptor model. The default random forest had the best performance out of all the regression methods, although it is very possible that this will change after some optimization of the model parameters.

In later works, we will further tune models using simple physical descriptors as well as TPATF fingerprints, and compare their performance to existing logP predictors using this dataset, as well as the PHYSPROP set. It would also be interesting to observe the effects of consensus scoring using several models.

Martel, S., Gillerat, F., Carosati, E., Maiarelli, D., Tetko, I. V., Mannhold, R., & Carrupt, P.-A. (2013). Large, chemically diverse dataset of logP measurements for benchmarking studies. European Journal of Pharmaceutical Sciences, 48(1-2), 21–29. doi: 10.1016/j.ejps.2012.10.019 ↩

Mining Pharos with MySQL and Python

2019-02-19T00:00:00-07:00

Why MySQL and Python?
Connecting to Pharos
Executing database queries
Importing the tables to Pandas
Filtering by number of actives
Joining Tables
Exporting the data

Why MySQL and Python?

Previously, I demonstrated how to use the SIFTS database to find UniProt-to-PDB mappings for proteins from the Pharos database. To do this, we downloaded csv format files for different receptor classes directly from the Pharos website. However, manually downloading these data files is tedious, and does not allow us to keep our data up to date with future changes in the source database. A much more efficient way is to obtain this data directly through SQL queries.

I must confess that I am not proficient when it comes to complex table joins and filters in SQL, but I can do the job in Python! Additionally reading SQL tables into Python allows us to use Python’s data visualization libraries on the data with ease.

In this notebook, we use MySQL Connector and Python’s Pandas library to retrieve and manipulate data for Pharos targets. The goal is to obtain a dataset of targets that contain more than 15 active compounds, along with information about their different target classes.

All the code in this post is also available as a Jupyter notebook here.

To install mysql-connector, run: pip install mysql-connector-python-rf.

First, we import the necessary libraries:

import mysql.connector as sql
import pandas as pd
import matplotlib.pyplot as plt

Connecting to Pharos

We use Python to create an SQL connection to the Pharos database:

db_connection = sql.connect(host='tcrd.kmc.io', db='tcrd540', user='tcrd')
db_connection

<mysql.connector.connection.MySQLConnection at 0x7f428fca0668>

In order to use the new connnection, we need to create a cursor object, which allows us to send instructions to the database:

db_cursor = db_connection.cursor()

Executing database queries

We can use the newly created cursor to execute queries. First we execute the SHOW TABLES MySQL command, to get an idea of the kind of tables we can collect information from.

The cursor.fetchall() method returns a list, and is equivalent to calling list() on the cursor object.

db_cursor.execute('SHOW TABLES;')
tables = db_cursor.fetchall()
print(tables)

[('alias',), ('cmpd_activity',), ('cmpd_activity_type',), ('compartment',), ('compartment_type',), ('data_type',), ('dataset',), ('dbinfo',), ('disease',), ('disease_type',), ('do',), ('do_parent',), ('drug_activity',), ('dto',), ('expression',), ('expression_type',), ('feature',), ('gene_attribute',), ('gene_attribute_type',), ('generif',), ('goa',), ('hgram_cdf',), ('info_type',), ('kegg_distance',), ('kegg_nearest_tclin',), ('locsig',), ('mlp_assay_info',), ('ortholog',), ('ortholog_disease',), ('p2pc',), ('panther_class',), ('patent_count',), ('pathway',), ('pathway_type',), ('phenotype',), ('phenotype_type',), ('pmscore',), ('ppi',), ('ppi_type',), ('protein',), ('protein2pubmed',), ('provenance',), ('ptscore',), ('pubmed',), ('t2tc',), ('target',), ('tdl_info',), ('tdl_update_log',), ('techdev_contact',), ('techdev_info',), ('tinx_articlerank',), ('tinx_disease',), ('tinx_importance',), ('tinx_novelty',), ('tinx_target',), ('xref',), ('xref_type',)]

Above, we see a list of the tables. We can use the DESCRIBE query to obtain a list of the attributes of a particular table. In this case, we are interested in the protein, target, and cmpd_activity tables.

db_cursor.execute('DESCRIBE protein;')
list(db_cursor)

[('id', 'int(11)', 'NO', 'PRI', None, 'auto_increment'), ('name', 'varchar(255)', 'NO', 'UNI', None, ''), ('description', 'text', 'NO', '', None, ''), ('uniprot', 'varchar(20)', 'NO', 'UNI', None, ''), ('up_version', 'int(11)', 'YES', '', None, ''), ('geneid', 'int(11)', 'YES', '', None, ''), ('sym', 'varchar(20)', 'YES', '', None, ''), ('family', 'varchar(255)', 'YES', '', None, ''), ('chr', 'varchar(255)', 'YES', '', None, ''), ('seq', 'text', 'YES', '', None, ''), ('dtoid', 'varchar(13)', 'YES', '', None, ''), ('stringid', 'varchar(15)', 'YES', '', None, '')]

Importing the tables to Pandas

Compound Activity

Next, we use Pandas to read the data directly from the tables. First the cmpd_activity table, which contains information about the binding affinity of compounds to targets in the database:

query = "SELECT id, target_id, cmpd_id_in_src, cmpd_name_in_src, \
         smiles, act_value, act_type \
         FROM cmpd_activity"
cmpd_activity = pd.read_sql(query, con=db_connection)

print(cmpd_activity.shape)
cmpd_activity.head(3)

(382291, 7)

	id	target_id	cmpd_id_in_src	cmpd_name_in_src	smiles	act_value	act_type
0	1	3006	CHEMBL365855	N-(5-Cyclobutyl-thiazol-2-yl)-2-phenyl-acetamide	O=C(Cc1ccccc1)Nc2ncc(s2)C3CCC3	7.60	IC50
1	2	3006	CHEMBL3775677	3-Isopropyl-5-(2,3-dihydroxypropyl)amino-7-[4-...	CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(O)...	7.68	IC50
2	3	3006	CHEMBL3775608	3-Isopropyl-5-(3-amino-2-hydroxypropyl)amino-7...	CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(N)...	7.77	IC50

Protein

We read in the data we want from the protein table:

query = "SELECT id, name, description, uniprot, family, seq \
         FROM protein"
protein = pd.read_sql(query, con=db_connection)

print(protein.shape)
protein.head(3)

(20244, 6)

	id	name	description	uniprot	family	seq
0	1	1433E_HUMAN	14-3-3 protein epsilon	P62258	Belongs to the 14-3-3 family.	MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLS...
1	2	1433F_HUMAN	14-3-3 protein eta	Q04917	Belongs to the 14-3-3 family.	MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLS...
2	3	1433T_HUMAN	14-3-3 protein theta	P27348	Belongs to the 14-3-3 family.	MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSV...

Target

For the target table, we are interested in filtering for targets that are in the Tclin or Tchem development classifications.

query = "SELECT id, name, tdl, fam, famext \
         FROM target \
         WHERE tdl='Tclin' OR tdl='Tchem'"
target = pd.read_sql(query, con=db_connection)

print(target.shape)
target.head(3)

(2211, 5)

	id	name	tdl	fam	famext
0	2	14-3-3 protein eta	Tchem	None	None
1	3	14-3-3 protein theta	Tchem	None	None
2	23	3 beta-hydroxysteroid dehydrogenase/Delta 5-->...	Tchem	Enzyme	3-beta-HSD

Since we have all the data stored in memory, we no longer need the database connection.

db_connection.close()

Filtering by number of actives

Here, we filter out receptors that contain less than 15 active molecules.

num_actives = {}
target_ids = cmpd_activity.target_id.unique()
for i in target_ids:
    num_actives[i] = len(cmpd_activity[cmpd_activity.target_id == i])

target['num_actives'] = target.id.apply(lambda x: num_actives.get(x))
target = target[target['num_actives'] >= 15]
target.num_actives = target.num_actives.apply(int)  # Convert from float to int
target.shape

(1067, 6)

Whereas before we had a total of 2,211 targets in Tclin and Tchem, now we only have 1,067 which contain more than 15 experimental activity values.

Finally, we create a pie chart to visualize the number of targets in each target family:

tchem_tclin_fams = {}
families = [fam for fam in target.fam.unique() if fam is not None]

for f in sorted(families):
    tchem_tclin_fams[f] = len(target[target.fam == f])
tchem_tclin_fams['None'] = len(target[target.fam.isna()])

tchem_tclin_fams

{'Enzyme': 348, 'Epigenetic': 42, 'GPCR': 189, 'IC': 91, 'Kinase': 205, 'NR': 28, 'TF': 6, 'TF; Epigenetic': 5, 'Transporter': 35, 'None': 118}

plt.figure(figsize=(4, 4))
width = .6
explode = [0, 0, 0, 0, 0, .3, .2, .1, 0, 0]
labels = ["{}: {}".format(f, n) for f, n in zip(tchem_tclin_fams.keys(),
          tchem_tclin_fams.values())]
plt.pie(tchem_tclin_fams.values(), labels=labels, radius=2, explode=explode,
        wedgeprops=dict(width=width, edgecolor='w'), autopct='%1.0f%%',
        pctdistance=.8, labeldistance=1.1)

plt.savefig("pharos_targets.svg", bbox_inches = 'tight')

From this target data, we could further filter down to receptors that have known protein structures, as shown in the SIFTS database post. In this case, we will simply concatenate the data from the Protein table to the Target table, in order to obtain information about the UniProt ID, protein ontology, and sequence. Finally, we will write the data to csv files for further analysis.

Joining Tables

We need to join the Protein and Target tables by id. The two tables should have the same size:

protein = protein[protein.id.isin(target.id)]
protein.shape

(1067, 6)

Joining the tables:

protein = protein.set_index("id")
target = target.set_index("id")
result = pd.concat([target, protein], axis=1, join='outer')

result.head(3)

	name	tdl	fam	famext	num_actives	name	description	uniprot	family	seq
id
26	5-hydroxytryptamine receptor 2B	Tclin	GPCR	GPCR	777	5HT2B_HUMAN	5-hydroxytryptamine receptor 2B	P41595	Belongs to the G-protein coupled receptor 1 fa...	MALSYRVSELQSTIPEHILQSTFVHVISSNWSGLQTESIPEEMKQI...
27	5-hydroxytryptamine receptor 2C	Tclin	GPCR	GPCR	1612	5HT2C_HUMAN	5-hydroxytryptamine receptor 2C	P28335	Belongs to the G-protein coupled receptor 1 fa...	MVNLRNAVHSFLVHLIGLLVWQCDISVSPVAAIVTDIFNTSDGGRF...
30	5'-nucleotidase	Tchem	Enzyme	None	23	5NTD_HUMAN	5'-nucleotidase	P21589	Belongs to the 5'-nucleotidase family.	MCPRAARAPATLLLALGAVLWPAAGAWELTILHTNDVHSRLEQTSE...

Exporting the data

We separate each target class into different Data Frames, store these in a dictionary, and also save them to separate csv files.

target_dfs = {}
for f in families:
    target_dfs[f] = result[result.fam == f]
    target_dfs[f].to_csv(f + ".csv")

Mapping Pharos Targets to PDB Structures

2019-02-14T00:00:00-07:00

Problem Description
Getting the Data
Read SIFTS Mappings
Find PDB IDs
Summarizing the Data
Visualizing the Data

Problem Description

The Structure Integration with Function, Taxonomy and Sequence (SIFTS) database provides mappings between UniProt and PDB, as well as annotations from GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl and other resources. Here, we map all the receptors from the Pharos database to their PDB IDs, using their UniProt accession numbers.

The goal is to obtain a dataset of human targets with available structures and known ligand binding affinities. I also want to get the distribution of these PDB structures across different receptor families, such as Kinases, GPCRs, Ion Channels, Nuclear Receptors, and Transporters.

Getting the Data

First we read in Pharos data csv files downloaded from Pharos for targets in the Tclin (targets with approved drugs), and Tchem (targets with known binding affinities), for several receptor classes. The csv files contain UniProt IDs for each receptor. All downloaded data and code is available in my GitHub repository: ravila4/Pharos-to-PDB

import pandas as pd

target_classes = ["GPCRs", "ion-channels", "kinases", "nuclear-receptors", "transporters"]
IDG_data = {}
for tclass in target_classes:
    IDG_data[tclass] = pd.read_csv("data/" + tclass + ".csv", index_col=False)

Read SIFTS Mappings

The mappings were downloaded as a CSV file from their ftp site.

uniprot_to_pdb = pd.read_csv("data/uniprot_pdb.csv", skiprows=1)
uniprot_to_pdb.head()

A sample of the SIFTS Data Frame.

	SP_PRIMARY	PDB
0	A0A010	5b00;5b01;5b02;5b03;5b0i;5b0j;5b0k;5b0l;5b0m;5...
1	A0A011	3vk5;3vka;3vkb;3vkc;3vkd
2	A0A014C6J9	6br7
3	A0A016UNP9	2md0
4	A0A023GPI4	2m6j

Find PDB IDs

Here’s a function for joining the two Data Frames:

def find_pdbs(df):
    """ Input: Data Frame of Pharos data.
        Output: List of PDB IDs. """
    IDS = []
    for i in range(len(df)):
        pdb_ids = None
        uniprot_id = df.loc[:, "Uniprot ID"][i]
        mapping = uniprot_to_pdb[uniprot_to_pdb.SP_PRIMARY == uniprot_id]
        if len(mapping) != 0:
            pdb_ids = mapping.PDB.iloc[0].split(';')
        IDS.append(pdb_ids)
    return IDS

Adding PDB IDs to Pharos targets:

for df in IDG_data.values():
    df['PDB_IDS'] = find_pdbs(df)

Summarizing the Data

Number of receptors in each class with at least one structure in the Protein Data Bank:

pdbs_per_class = {}

for IDG_class in IDG_data:
    df = IDG_data[IDG_class]
    num_available = len(df) - sum(df.PDB_IDS.isna())
    pdbs_per_class[IDG_class] = num_available

pdbs_per_class

{'GPCRs': 77, 'ion-channels': 70, 'kinases': 304, 'nuclear-receptors': 41, 'transporters': 15}

Visualizing the Data

Finally, we visualize the results with a pie chart:

import matplotlib.pyplot as plt

width=0.3
labels = ["{}: {}".format(f, n) for f, n in zip(pdbs_per_class.keys(),
          pdbs_per_class.values())]
plt.pie(pdbs_per_class.values(), labels=labels, radius=1,
        wedgeprops=dict(width=width, edgecolor='w'))

plt.show()