Jekyll2021-09-27T00:12:01-06:00https://ravilabio.info/feed.xmlRicardo AvilaA blog about bioinformatics and data science.Ricardo Avilaravila@protonmail.comSwitching Your GTK Theme Based on Time of Day2020-04-24T00:00:00-06:002020-04-24T00:00:00-06:00https://ravilabio.info/2020/04/24/switching-gtk-theme-time-day<p>A new trend in software UI/UX is the introduction of dark and light options for user interfaces. Windows and MacOS have both added support for dark window themes in their latest versions, and many iOS and Android apps offer this feature through their application settings. Even many websites are enabling support for this feature using the <a href="https://developer.mozilla.org/en-US/docs/Web/CSS/@media/prefers-color-scheme">prefers-color-scheme</a> CSS tag, which alerts a website about the user’s theme preference at OS level. Linux naturally, has had the ability to customize user interface themes for decades, but what is new, is the focus on continuously adapting user interfaces.</p>
<p>For example, If you are using GNOME desktop, you may be aware that the default GNOME wallpaper transitions in color and tone throughout the day, providing brighter colors in the morning, and gradually transitioning to darker tones towards the evening. Furthermore, most operating systems now support “Night light” - a feature that allows a device’s screen to gradually shift from blue tones to red, in order to reduce eye strain at night.</p>
<p>A neat feature missing from most desktop environments is the ability to automatically switch GTK themes, in order to provide a darker environment at night, and improve brightness and contrast during the day. The good news is that in Linux it is not difficult to implement this functionality with systemd timers.</p>
<p><img src="/assets/images/gtk/gtk.png" alt="" title="Dark and light GTK themes in Nautilus" /></p>
<h2 id="cron-jobs-vs-systemd-timers">Cron Jobs vs Systemd Timers</h2>
<p>So, what are systemd timers? Old time UNIX users will be familiar with cron jobs. Cron is a utility that runs commands, typically from a file stored under: <code class="language-plaintext highlighter-rouge">/etc/cronbtab</code></p>
<h3 id="an-example-crontab-job">An example crontab job</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed
</code></pre></div></div>
<h3 id="relevant-xkcd">Relevant xkcd</h3>
<p><img src="https://imgs.xkcd.com/comics/cron_mail.png" alt="" title="Cron Mail" /></p>
<blockquote>
<p>“Take THAT, piece of 1980s-era infrastructure I’ve inexplicably maintained on my systems for 15 years despite never really learning how it works.”</p>
</blockquote>
<p>Systemd on the other hand, is a manager for system processes and services. It has replaced much of the functionality that was previously handled by the UNIX <code class="language-plaintext highlighter-rouge">init</code> daemon, and has several nice features that the cron utility lacks:</p>
<p>The big benefit for our purpose, is that with cron, if the computer is powered off, a scheduled cron job does not run. Systemd, on the other hand, can run the tasks that it missed the next time that it powers on.</p>
<p>Other advantages of systemd timers:</p>
<ul>
<li>CPU and memory limits</li>
<li>Randomized scheduling</li>
<li>Jobs can be easily started independently of their timers</li>
<li>Jobs are logged in systemd journal, which makes easier debugging</li>
</ul>
<h2 id="creating-a-systemd-service">Creating a Systemd Service</h2>
<p>Systemd services have the extension <code class="language-plaintext highlighter-rouge">.service</code>. All user-created systemd scripts will be stored in <code class="language-plaintext highlighter-rouge">~/.config/systemd/user/</code>. Here are the contents of “dark-theme.sevice”, out systemd service for switching to the Adwaita-dark GTK theme:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Unit]
Description=Change the GTK theme to dark mode.
After=graphical.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'gsettings set org.gnome.desktop.interface gtk-theme Adwaita-dark'
[Install]
WantedBy=default.target
</code></pre></div></div>
<p>We will need to create a separate “light-theme.service” file to switch to a light GTK theme.</p>
<h2 id="creating-a-systemd-timer">Creating a Systemd Timer</h2>
<p>Now, to create a timer, we make another file in the same directory with the same rootname plus the extension <code class="language-plaintext highlighter-rouge">.timer</code>. In this case, we named our file “dark-theme.timer”. Here are the contents of this file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Unit]
Description=Change the GTK theme daily at a given time.
[Timer]
OnCalendar=*-*-* 16:00:00
Persistent=true
[Install]
WantedBy=timers.target
</code></pre></div></div>
<p>The OnCalendar setting specifies that this particular timer should run every day at 16:00 hrs. You can also create timers that run every other day, or on specific days of the week. For more on this, I have found that the best resource for documentation is the ArchLinux Wiki: <a href="https://wiki.archlinux.org/index.php/Systemd/Timers">https://wiki.archlinux.org/index.php/Systemd/Timers</a>.</p>
<p>We will additionally need to create a “light-theme.timer” to change theme in the morning.</p>
<h2 id="running-services-as-a-user">Running Services as a User</h2>
<p>When we are done creating the four configuration files (two for each of dark and light), we need to enable the services. It is recommended that we enable them at the user level (since requiring sudo access would be a hazzle and potential security risk).</p>
<p>Enabling the services:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemctl --user enable dark-theme.service
systemctl --user enable light-theme.service
systemctl --user enable dark-theme.timer
systemctl --user enable light-theme.timer
</code></pre></div></div>
<p>Now the operating system will automatically change the GTK theme at the specified times.
Furthermore, if we wish to manually change the theme, or test that the service works, we may do so using:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemctl --user start light-theme.service
</code></pre></div></div>
<p>or:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemctl --user start dark-theme.service
</code></pre></div></div>
<h2 id="what-can-we-use-this-for">What can we use this for?</h2>
<p>A couple other applications come to mind:</p>
<ul>
<li>Daily data backups</li>
<li>Daily data ingestion</li>
</ul>
<p>Let me know if you come up with other interesting applications!</p>Ricardo Avilaravila@protonmail.comUsing systemd timers to automatically switch between light/dark GTK themes in a GNOME desktop. This may not be a data science post, but knowing systemd timers can certainly be handy for many applications where we want to run scheduled jobs on a Linux workstation.Web-scraping AbeBooks.com (Reverse Engineering a REST API)2020-04-10T00:00:00-06:002020-04-10T00:00:00-06:00https://ravilabio.info/2020/04/10/reverse-engineering-abebooks-api<ul>
<li><a href="#motivation">Motivation</a></li>
<li><a href="#rest-apis">REST APIs</a></li>
<li><a href="#exploring-network-packets">Exploring Network Packets</a></li>
<li><a href="#wrapping-the-api-in-python">Wrapping the API in Python</a>
<ul>
<li><a href="#sending-post-requests">Sending POST requests</a></li>
<li><a href="#sending-a-get-request">Sending a GET request</a></li>
</ul>
</li>
<li><a href="#an-object-oriented-module">An Object-Oriented Module</a></li>
<li><a href="#using-the-abebooks-module">Using the AbeBooks Module</a></li>
</ul>
<h2 id="motivation">Motivation</h2>
<p>I have a large collection of electronic books, which I manage using <a href="https://calibre-ebook.com">Calibre</a>. Using Calibre’s “Extract ISBN” plugin, I am able to parse the ISBN identifier from most of my files, which then makes fetching the rest of the metadata very easy. (Below is an example of my library’s metadata.)</p>
<p>Thus, I have access to a very convenient and ever-growing virtual library of books, which I like to use on the go, and for exploratory research. Nevertheless, whenever I find a particularly good book, the thing that I want most, is to own a physical copy.</p>
<p><img src="/assets/images/abebooks/calibre.png" alt="" title="My calibre e-book library" /></p>
<p>Enter here AbeBooks.com. Next to Amazon, and occasionally Ebay, it is my go-to site for buying cheap used textbooks.
Given that I have stored the ISBN data for most of my electronic books, I would like to be able to automatically fetch pricing information for any book in my virtual library, perhaps even keeping track of changes in price over time.</p>
<p>However, until now, the main problem stopping me from writing a script to do this was that AbeBooks does not have a publicly available API… or at the very least, none that is explicitly documented.</p>
<p><img src="/assets/images/abebooks/abebooks.png" alt="" title="AbeBooks search results page" /></p>
<h2 id="rest-apis">REST APIs</h2>
<p>REST, or Representational State Transfer, is an architecture, or convention used by the HTTP protocol to provide interoperability between servers. It is based on a request/response system, where a request is a “payload”, normally formatted as HTML, XML, or JSON., and the response can be a link to a resource, a data payload in any of the aforementioned formats, or a confirmation that some data was modified in the server.</p>
<p>Several common REST methods exist: GET, HEAD, POST, PUT, PATCH, DELETE, CONNECT, OPTIONS and TRACE. Among these, the two most common are GET and POST:</p>
<p><strong>GET</strong></p>
<ul>
<li>Used to request data from a server.</li>
<li>Parameter data is stored in the URL of the query as string parameters</li>
<li>Number of parameters is limited to the length that can fit in the URL</li>
<li>Not secure for sensitive information. (Passwords can be easily seen)</li>
</ul>
<p><strong>POST</strong></p>
<ul>
<li>Used to submit data to a server, and can modify server contents.</li>
<li>Parameters are passed in the message body, rather than the URL.</li>
<li>It has no restrictions on the number of parameters.</li>
<li>Is more secure for sending sensitive information.</li>
</ul>
<h2 id="exploring-network-packets">Exploring Network Packets</h2>
<p>I found that inspecting the network packets for an AbeBooks search results page is simple, and yields promising results. If we open Firefox’s developer tools, under the Network tab, we can see a list of all the packets that are loaded. In particular we are interested in those that have a JSON response, highlighted in red below:</p>
<p><img src="/assets/images/abebooks/packets.png" alt="" title="Network packet inspection in Firefox " /></p>
<p>We can see that there are four POST requests, to a service called “pricingservice”, and one GET request to a “RecomendationsApi”.</p>
<p>If we look more closely at one of the POST requests, we can see which parameters it takes in:</p>
<p><img src="/assets/images/abebooks/params.png" alt="" title="POST request parameters" /></p>
<p>ISBN! Just what we needed! Furthermore, looking at the response tab, we can see that this request returns the prices for new and used books, among other things:</p>
<p><img src="/assets/images/abebooks/response.png" alt="" title="Example POST response" /></p>
<h2 id="wrapping-the-api-in-python">Wrapping the API in Python</h2>
<p>Now that we know a bit more about how AbeBooks works under the hood, we can start implementing our API wrapper in Python. We will need the requests module:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">requests</span></code></pre></figure>
<h3 id="sending-post-requests">Sending POST requests</h3>
<p>The first REST method that we will implement is the POST method that fetches
prices for a given book. From inspecting the page elements, we know that the URL
for this service is:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">url</span> <span class="o">=</span> <span class="s">"https://www.abebooks.com/servlet/DWRestService/pricingservice"</span></code></pre></figure>
<p>There seem to be three main parameter groups, and we can infer their purpose. (Parameters shown in bold below are to be replaced by user values)</p>
<p>Searching prices by ISBN:</p>
<table>
<thead>
<tr>
<th style="text-align: left">Parameter</th>
<th style="text-align: left">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">action</td>
<td style="text-align: left">getPricingDataByISBN</td>
</tr>
<tr>
<td style="text-align: left">isbn</td>
<td style="text-align: left"><strong>isbn</strong></td>
</tr>
<tr>
<td style="text-align: left">container</td>
<td style="text-align: left">pricingService-<strong>isbn</strong></td>
</tr>
</tbody>
</table>
<p>Searching prices by title and author:</p>
<table>
<thead>
<tr>
<th style="text-align: left">Parameter</th>
<th style="text-align: left">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">action</td>
<td style="text-align: left">getPricingDataForAuthorTitleStandardAddToBasket</td>
</tr>
<tr>
<td style="text-align: left">an</td>
<td style="text-align: left"><strong>author</strong></td>
</tr>
<tr>
<td style="text-align: left">tn</td>
<td style="text-align: left"><strong>title</strong></td>
</tr>
<tr>
<td style="text-align: left">container</td>
<td style="text-align: left">oe-search-all</td>
</tr>
</tbody>
</table>
<p>Searching prices by title, author, and hardcover/softcover binding:</p>
<table>
<thead>
<tr>
<th style="text-align: left">Parameter</th>
<th style="text-align: left">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">action</td>
<td style="text-align: left">getPricingDataForAuthorTitleBindingRefinements</td>
</tr>
<tr>
<td style="text-align: left">isbn</td>
<td style="text-align: left">9781250297662</td>
</tr>
<tr>
<td style="text-align: left">an</td>
<td style="text-align: left"><strong>author</strong></td>
</tr>
<tr>
<td style="text-align: left">tn</td>
<td style="text-align: left"><strong>title</strong></td>
</tr>
<tr>
<td style="text-align: left">container</td>
<td style="text-align: left"><strong>priced-from-soft</strong> OR <strong>priced-from-hard</strong></td>
</tr>
</tbody>
</table>
<p>The parameters can be stored as a dictionary, and sent to the
request’s post method. For example:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1">#- Search prices by ISBN
</span><span class="n">payload1</span> <span class="o">=</span> <span class="p">{</span><span class="s">'action'</span><span class="p">:</span> <span class="s">'getPricingDataByISBN'</span><span class="p">,</span>
<span class="s">'isbn'</span><span class="p">:</span> <span class="mi">9781250297662</span><span class="p">,</span>
<span class="s">'container'</span><span class="p">:</span> <span class="s">'pricingService-9781250297662'</span><span class="p">}</span>
<span class="c1">#- Search prices by author and title
</span><span class="n">payload2</span> <span class="o">=</span> <span class="p">{</span><span class="s">'action'</span><span class="p">:</span> <span class="s">'getPricingDataForAuthorTitleStandardAddToBasket'</span><span class="p">,</span>
<span class="s">'an'</span><span class="p">:</span> <span class="s">'liu ken'</span><span class="p">,</span>
<span class="s">'tn'</span><span class="p">:</span> <span class="s">'broken stars'</span><span class="p">,</span>
<span class="s">'container'</span><span class="p">:</span> <span class="s">'oe-search-all'</span><span class="p">}</span>
<span class="c1">#- Sending a request
</span><span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">payload1</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">resp</span><span class="p">.</span><span class="n">status_code</span><span class="p">,</span> <span class="n">resp</span><span class="p">.</span><span class="n">reason</span><span class="p">)</span>
<span class="n">resp</span><span class="p">.</span><span class="n">json</span><span class="p">()</span></code></pre></figure>
<p>The response is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>200 OK
{'errorTexts': [None],
'errorCodes': [None],
'success': True,
'newExists': True,
'usedExists': True,
'pricingInfoForBestNew': {'bestListingid': 30410510568,
'totalResults': 16,
'bestPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 7.26',
'bestPriceInSurferCurrencyWithCurrencySymbol': 'US$ 7.26',
'domesticShippingPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 4.50',
'shippingToDestinationPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 6.00',
'shippingToDestinationPriceInSurferCurrencyWithCurrencySymbol': 'US$ 6.00',
'shippingDestinationNameInSurferLanguage': 'U.S.A.',
'vendorCountryNameInSurferLanguage': 'Canada',
'vendorId': 71361,
'bestPriceInPurchaseCurrencyValueOnly': '7.26',
'bestShippingToDestinationPriceInPurchaseCurrencyValueOnly': '6.0',
'listingCurrencySymbol': 'US$',
'purchaseCurrencySymbol': 'US$',
'nonPaddedPriceInListingCurrencyValueOnly': '7.26',
'refinementList': None,
'internationalEdition': False,
'bookCondition': 'New',
'bookDescription': 'Hardcover. Publisher overstock,...',
'freeShipping': False},
'pricingInfoForBestUsed': {'bestListingid': 30529767259,
'totalResults': 8,
'bestPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 6.55',
'bestPriceInSurferCurrencyWithCurrencySymbol': 'US$ 6.55',
'domesticShippingPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 3.99',
'shippingToDestinationPriceInPurchaseCurrencyWithCurrencySymbol': 'US$ 3.99',
'shippingToDestinationPriceInSurferCurrencyWithCurrencySymbol': 'US$ 3.99',
'shippingDestinationNameInSurferLanguage': 'U.S.A.',
'vendorCountryNameInSurferLanguage': 'U.S.A.',
'vendorId': 71597499,
'bestPriceInPurchaseCurrencyValueOnly': '6.55',
'bestShippingToDestinationPriceInPurchaseCurrencyValueOnly': '3.99',
'listingCurrencySymbol': 'US$',
'purchaseCurrencySymbol': 'US$',
'nonPaddedPriceInListingCurrencyValueOnly': '6.55',
'refinementList': None,
'internationalEdition': False,
'bookCondition': 'As New',
'bookDescription': 'Like brand new book.',
'freeShipping': False},
'pricingInfoForBestAllConditions': None,
'isbn': '9781250297662',
'totalResults': 24,
'containerId': 'pricingService-9781250297662',
'refinementList': [{'name': 'collectibleJacket',
'label': 'Dust Jacket',
'count': 2,
'url': 'dj=on&isbn=9781250297662&sortby=17'},
{'name': 'freeShipping',
'label': 'Free US Shipping',
'count': 9,
'url': 'isbn=9781250297662&n=100046078&sortby=17'},
{'name': 'bindingHard',
'label': 'Hardcover',
'count': 23,
'url': 'bi=h&isbn=9781250297662&sortby=17'},
{'name': 'collectibleFirstEdition',
'label': 'First Edition',
'count': 3,
'url': 'fe=on&isbn=9781250297662&sortby=17'}],
'bibliographicDetail': {'author': '', 'title': ''}}
</code></pre></div></div>
<h3 id="sending-a-get-request">Sending a GET request</h3>
<p>The API also has a GET method for obtaining book recommendations given an ISBN.
The url and parameter names are different, but the way we send the request is very similar:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">url</span> <span class="o">=</span> <span class="s">"https://www.abebooks.com/servlet/RecommendationsApi"</span></code></pre></figure>
<table>
<thead>
<tr>
<th style="text-align: left">Parameter</th>
<th style="text-align: left">Value</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">pageId</td>
<td style="text-align: left">plp</td>
</tr>
<tr>
<td style="text-align: left">itemIsbn13</td>
<td style="text-align: left"><strong>isbn</strong></td>
</tr>
</tbody>
</table>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1">#- Get book recommendations by ISBN
</span><span class="n">payload</span> <span class="o">=</span> <span class="p">{</span><span class="s">'pageId'</span><span class="p">:</span> <span class="s">'plp'</span><span class="p">,</span>
<span class="s">'itemIsbn13'</span><span class="p">:</span> <span class="mi">9781250297662</span><span class="p">}</span>
<span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n">payload</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">resp</span><span class="p">.</span><span class="n">status_code</span><span class="p">,</span> <span class="n">resp</span><span class="p">.</span><span class="n">reason</span><span class="p">)</span>
<span class="n">resp</span><span class="p">.</span><span class="n">json</span><span class="p">()</span></code></pre></figure>
<p>Response:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>200 OK
{'widgetResponses': [{'slotName': 'detail-1',
'title': 'Customers who bought this item also bought',
'algoName': 'abeBooksBlendedPurchaseSims',
'ref': 'pd_b_p_1',
'recommendationItems': [{'attributes': [],
'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9780765384201-us-300.jpg',
'itemLink': '/products/isbn/9780765384201?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
'subTitle': None,
'isbn13': '9780765384201',
'title': 'Invisible Planets: Contemporary Chinese Science Fiction...',
'author': 'Liu, Ken'},
{'attributes': [],
'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9781250306029-us-300.jpg',
'itemLink': '/products/isbn/9781250306029?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
'subTitle': None,
'isbn13': '9781250306029',
'title': 'The Redemption of Time: A Three-Body Problem Novel...',
'author': 'Baoshu'},
{'attributes': [],
'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9780765389312-us-300.jpg',
'itemLink': '/products/isbn/9780765389312?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
'subTitle': None,
'isbn13': '9780765389312',
'title': 'Waste Tide',
'author': 'Qiufan, Chen'},
{'attributes': [],
'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9780765384195-us-300.jpg',
'itemLink': '/products/isbn/9780765384195?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
'subTitle': None,
'isbn13': '9780765384195',
'title': 'Invisible Planets: Contemporary Chinese Science Fiction...',
'author': 'Liu, Ken'},
{'attributes': [],
'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9781784978518-us-300.jpg',
'itemLink': '/products/isbn/9781784978518?cm_sp=rec-_-pd_b_p_1-_-plp&reftag=pd_b_p_1',
'subTitle': None,
'isbn13': '9781784978518',
'title': 'The Wandering Earth',
'author': 'Liu, Cixin'}]},
{'slotName': 'ext-search-detail-1',
'title': None,
'algoName': 'heroWidgetIsbnSims',
'ref': 'pd_hw_i_1',
'recommendationItems': [{'attributes': [],
'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9780804172448-us-300.jpg',
'itemLink': '/products/isbn/9780804172448?cm_sp=rec-_-pd_hw_i_1-_-plp&reftag=pd_hw_i_1',
'subTitle': 'Best Selling',
'isbn13': '9780804172448',
'title': 'Station Eleven',
'author': 'Mandel, Emily St. John'},
{'attributes': [],
'thumbNailImgUrl': 'https://pictures.abebooks.com/isbn/9781786073495-us-300.jpg',
'itemLink': '/products/isbn/9781786073495?cm_sp=rec-_-pd_hw_i_1-_-plp&reftag=pd_hw_i_1',
'subTitle': 'Top Rated',
'isbn13': '9781786073495',
'title': 'Zuleikha',
'author': 'Yakhina, Guzel'}]}]}
</code></pre></div></div>
<h2 id="an-object-oriented-module">An Object-Oriented Module</h2>
<p>I created a small Python module <a href="https://github.com/ravila4/abebooks">abebooks.py</a> to encapsulate the requests. The full code is below:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">requests</span>
<span class="k">class</span> <span class="nc">AbeBooks</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">__get_price</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">payload</span><span class="p">):</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"https://www.abebooks.com/servlet/DWRestService/pricingservice"</span>
<span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">post</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">payload</span><span class="p">)</span>
<span class="n">resp</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
<span class="k">return</span> <span class="n">resp</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">__get_recomendations</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">payload</span><span class="p">):</span>
<span class="n">url</span> <span class="o">=</span> <span class="s">"https://www.abebooks.com/servlet/RecommendationsApi"</span>
<span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">params</span><span class="o">=</span><span class="n">payload</span><span class="p">)</span>
<span class="n">resp</span><span class="p">.</span><span class="n">raise_for_status</span><span class="p">()</span>
<span class="k">return</span> <span class="n">resp</span><span class="p">.</span><span class="n">json</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">getPriceByISBN</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">isbn</span><span class="p">):</span>
<span class="s">"""
Parameters
----------
isbn (int) - a book's ISBN code
"""</span>
<span class="n">payload</span> <span class="o">=</span> <span class="p">{</span><span class="s">'action'</span><span class="p">:</span> <span class="s">'getPricingDataByISBN'</span><span class="p">,</span>
<span class="s">'isbn'</span><span class="p">:</span> <span class="n">isbn</span><span class="p">,</span>
<span class="s">'container'</span><span class="p">:</span> <span class="s">'pricingService-{}'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">isbn</span><span class="p">)}</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">__get_price</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">getPriceByAuthorTitle</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">author</span><span class="p">,</span> <span class="n">title</span><span class="p">):</span>
<span class="s">"""
Parameters
----------
author (str) - book author
title (str) - book title
"""</span>
<span class="n">payload</span> <span class="o">=</span> <span class="p">{</span><span class="s">'action'</span><span class="p">:</span> <span class="s">'getPricingDataForAuthorTitleStandardAddToBasket'</span><span class="p">,</span>
<span class="s">'an'</span><span class="p">:</span> <span class="n">author</span><span class="p">,</span>
<span class="s">'tn'</span><span class="p">:</span> <span class="n">title</span><span class="p">,</span>
<span class="s">'container'</span><span class="p">:</span> <span class="s">'oe-search-all'</span><span class="p">}</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">__get_price</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">getPriceByAuthorTitleBinding</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">author</span><span class="p">,</span> <span class="n">title</span><span class="p">,</span> <span class="n">binding</span><span class="p">):</span>
<span class="s">"""
Parameters
----------
author (str) - book author
title (str) - book title
binding(str) - one of 'hard', or 'soft'
"""</span>
<span class="k">if</span> <span class="n">binding</span> <span class="o">==</span> <span class="s">"hard"</span><span class="p">:</span>
<span class="n">container</span> <span class="o">=</span> <span class="s">"priced-from-hard"</span>
<span class="k">elif</span> <span class="n">binding</span> <span class="o">==</span> <span class="s">"soft"</span><span class="p">:</span>
<span class="n">container</span> <span class="o">=</span> <span class="s">"priced-from-soft"</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span>
<span class="s">'Invalid parameter. Binding must be "hard" or "soft"'</span><span class="p">)</span>
<span class="n">payload</span> <span class="o">=</span> <span class="p">{</span><span class="s">'action'</span><span class="p">:</span> <span class="s">'getPricingDataForAuthorTitleBindingRefinements'</span><span class="p">,</span>
<span class="s">'an'</span><span class="p">:</span> <span class="n">author</span><span class="p">,</span>
<span class="s">'tn'</span><span class="p">:</span> <span class="n">title</span><span class="p">,</span>
<span class="s">'container'</span><span class="p">:</span> <span class="n">container</span><span class="p">}</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">__get_price</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">getRecommendationsByISBN</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">isbn</span><span class="p">):</span>
<span class="s">"""
Parameters
----------
isbn (int) - a book's ISBN code
"""</span>
<span class="n">payload</span> <span class="o">=</span> <span class="p">{</span><span class="s">'pageId'</span><span class="p">:</span> <span class="s">'plp'</span><span class="p">,</span>
<span class="s">'itemIsbn13'</span><span class="p">:</span> <span class="n">isbn</span><span class="p">}</span>
<span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">__get_recomendations</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span></code></pre></figure>
<h2 id="using-the-abebooks-module">Using the AbeBooks Module</h2>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">abebooks</span> <span class="kn">import</span> <span class="n">AbeBooks</span>
<span class="n">ab</span> <span class="o">=</span> <span class="n">AbeBooks</span><span class="p">()</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">ab</span><span class="p">.</span><span class="n">getPriceByISBN</span><span class="p">(</span><span class="mi">9780062941503</span><span class="p">)</span>
<span class="k">if</span> <span class="n">results</span><span class="p">[</span><span class="s">'success'</span><span class="p">]:</span>
<span class="n">best_new</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="s">'pricingInfoForBestNew'</span><span class="p">]</span>
<span class="n">best_used</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="s">'pricingInfoForBestUsed'</span><span class="p">]</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1">#- Best New Price
</span><span class="k">print</span><span class="p">(</span><span class="n">best_new</span><span class="p">[</span><span class="s">'bestPriceInPurchaseCurrencyWithCurrencySymbol'</span><span class="p">])</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>US$ 21.49
</code></pre></div></div>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1">#- Best Used Price
</span><span class="k">print</span><span class="p">(</span><span class="n">best_used</span><span class="p">[</span><span class="s">'bestPriceInPurchaseCurrencyWithCurrencySymbol'</span><span class="p">])</span></code></pre></figure>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>US$ 24.42
</code></pre></div></div>Ricardo Avilaravila@protonmail.comMany times we are faced with obtaining data from websites that do not have a documented REST API. In this post, I analyze POST and GET requests from AbeBooks's network packets, and build a Python API wrapper for programatically obtaining book prices and recommendations.Interweaving R and Python with Reticulate2019-05-04T00:00:00-06:002019-05-04T00:00:00-06:00https://ravilabio.info/2019/05/04/interweaving-r-and-python<p>Python is my favorite language for data manipulation, but every once in a while, I find an R library that I absolutely need to try out.
I wish I could have the best of both worlds. Unfortunately, I had not found a good solution until recently, when I tried out RStudio and the <a href="https://rstudio.github.io/reticulate/">Reticulate R package</a>, and the combination is awesome!</p>
<p>With Reticulate and the new version of RStudio (RStudio 1.2), you can create Python code chunks that have a persistent environment across them within a single Rmarkdown document.
This turns RStudio into a powerful alternative to the popular <a href="https://jupyter.org/">Jupyter</a> notebook for Python development.</p>
<h2 id="a-simple-demonstration">A simple demonstration:</h2>
<p>R code:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Loading the Reticulate library in RStudio</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">reticulate</span><span class="p">)</span></code></pre></figure>
<p>Now some Python:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># Creating a couple of simple arrays to plot
</span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">])</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="c1"># Displaying a python plot
</span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="/assets/images/2019-5-4-using-reticulate-to-interweave-r-and-python-01.png" alt="image" /></p>
<p>Furthermore, you can access these same Python objects from inside an R code cell, so now, you can finally have the best of both worlds!</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Plotting the same arrays in R! So simple!</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">py</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">py</span><span class="o">$</span><span class="n">y</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/images/2019-5-4-using-reticulate-to-interweave-r-and-python-02.png" alt="image" /></p>
<p>I normally do most of my coding in Vim, or Jupyter notebooks, but after discovering this package, I think I will be using RStudio a lot more often for Python + R programming.</p>
<p>Previously, my attempts at combining Python and R code involved using the Python <a href="https://pypi.org/project/rpy2/">rpy2</a> library to call R code within Python, but this approach always felt cumbersome at best. By comparison, Reticulate makes the transition feel smooth and natural, effectively marrying the powerful libraries of R and Python.</p>Ricardo Avilaravila@protonmail.comI love Python for general programming and data manipulation... but R has amazing statistical libraries. Do you also wish you could combine both? Here is a small demonstration of how acess Python objects from R using the Reticulate library.Machine Learning Methods for LogP Prediction: Pt. 12019-03-07T00:00:00-07:002019-03-07T00:00:00-07:00https://ravilabio.info/2019/03/07/machine-learning-log-p-prediction-1<ul>
<li><a href="#reading-experimetal-logp-data">Reading experimetal logP data</a></li>
<li><a href="#model-with-simple-descriptors">Model with simple descriptors</a></li>
<li><a href="#calculating-fingerprints">Calculating fingerprints</a></li>
<li><a href="#comparing-fingerprint-models">Comparing fingerprint models</a></li>
</ul>
<p>The octanol-water partition coefficient, or logP, is one of the most
important properties for determining a compound’s suitability as a drug.
Currently, most of the available regression models for <em>in silico</em> logP
prediction are trained on the <a href="https://www.srcinc.com/what-we-
do/environmental/scientific-databases.html">PHYSPROP</a> database of experimental logP
values. However most of the compounds in this database are not highly
representative of the drug-like chemical space. Unfortunately, there is
currently a lack of publicly available experimental logP datasets for biological compounds which
can be used to train better prediction tools.</p>
<p>In this small test, I have decided to use the experimental logP data released
in the paper: “Large, chemically diverse dataset of logP measurements for
benchmarking studies” by Martel et al<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. As this is a preliminary study, we are interested in finding which featurization methods work best for predicting logP.</p>
<p>Most of the popular tools for logP prediction are based on
physical descriptors, such as atom type counts, or polar surface area,
or on topological descriptors. Here, we will calculate different physical
descriptors, as well as structural fingerprints for the molecules, and
benchmark their performance using three different regression models:
neural network, random forest, and support vector machines.</p>
<p>We first import some libraries including RDKit and scikit-learn tools
(The utility script contains custom functions for generating TPATF and TPAPF
fingerprints):</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">from</span> <span class="nn">rdkit</span> <span class="kn">import</span> <span class="n">Chem</span>
<span class="kn">from</span> <span class="nn">rdkit.Chem</span> <span class="kn">import</span> <span class="n">AllChem</span>
<span class="kn">from</span> <span class="nn">rdkit.Chem</span> <span class="kn">import</span> <span class="n">Descriptors</span>
<span class="kn">from</span> <span class="nn">utility</span> <span class="kn">import</span> <span class="n">FeatureGenerator</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">r2_score</span><span class="p">,</span> <span class="n">mean_squared_error</span><span class="p">,</span> <span class="n">mean_absolute_error</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">RandomForestRegressor</span>
<span class="kn">from</span> <span class="nn">sklearn.neural_network</span> <span class="kn">import</span> <span class="n">MLPRegressor</span>
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">SVR</span></code></pre></figure>
<p>The utility script can be found in this <a href="https://gist.github.com/ravila4/a18bb9d0e3f54de3c9fb8f75908f992f">Gist</a>.</p>
<h2 id="reading-experimetal-logp-data">Reading experimetal logP data</h2>
<p>The supplementary pdf file from the Martel et al. paper was converted to csv text format using the Linux <a href="https://linux.die.net/man/1/pdftotext">pdftotext</a>
utility from the Poppler library.
The experimental data is read as a csv file, and the SMILES strings are
converted to RDKit molecules.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"training_data/logp_759_data.csv"</span><span class="p">)</span>
<span class="n">data_logp</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="n">data</span><span class="p">.</span><span class="n">Status</span> <span class="o">==</span> <span class="s">"Validated"</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Shape:"</span><span class="p">,</span> <span class="n">data_logp</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">head</span><span class="p">()</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">Shape: (707, 7)</code></p>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="0" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>ID</th>
<th>ZINC (2010)</th>
<th>Status</th>
<th>Supplier</th>
<th>SMILES</th>
<th>logPexp</th>
<th>pH_of_analysis</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>ZINC00036522</td>
<td>Validated</td>
<td>Specs</td>
<td>Cc1cc2c(cc1C)NC(=O)C[C@H]2c3ccccc3OC</td>
<td>4.17</td>
<td>5.0</td>
</tr>
<tr>
<th>1</th>
<td>3</td>
<td>ZINC00185379</td>
<td>Validated</td>
<td>ChemBridge</td>
<td>COc1ccc2c(c1)O[C@@](CC2=O)(C(F)(F)F)O</td>
<td>2.79</td>
<td>5.0</td>
</tr>
<tr>
<th>2</th>
<td>4</td>
<td>ZINC12402487</td>
<td>Validated</td>
<td>ChemBridge</td>
<td>CC1(O[C@H]([C@H](O1)C(=O)N)C(=O)N)C(C)(C)C</td>
<td>1.60</td>
<td>6.5</td>
</tr>
<tr>
<th>3</th>
<td>5</td>
<td>ZINC00055459</td>
<td>Validated</td>
<td>Specs</td>
<td>CCOc1cc(cc(c1OCC)OCC)c2nnc(o2)c3ccco3</td>
<td>3.96</td>
<td>10.5</td>
</tr>
<tr>
<th>4</th>
<td>6</td>
<td>ZINC00056871</td>
<td>Validated</td>
<td>Enamine</td>
<td>CN(C)c1ccc(cc1)C(=C)c2ccc(cc2)N(C)C</td>
<td>5.30</td>
<td>7.3</td>
</tr>
</tbody>
</table>
</div>
<p>Convert SMILES to 2D molecules:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">molecules</span> <span class="o">=</span> <span class="n">data_logp</span><span class="p">.</span><span class="n">SMILES</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Chem</span><span class="p">.</span><span class="n">MolFromSmiles</span><span class="p">)</span></code></pre></figure>
<p>Next, we use RDKit to calculate some physical descriptors:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'MolLogP'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">MolLogP</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'HeavyAtomCount'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">HeavyAtomCount</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'HAccept'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">NumHAcceptors</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'Heteroatoms'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">NumHeteroatoms</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'HDonor'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">NumHDonors</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'MolWt'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">MolWt</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'RotableBonds'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">NumRotatableBonds</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'RingCount'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">RingCount</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'Ipc'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">Ipc</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'HallKierAlpha'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">HallKierAlpha</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'NumValenceElectrons'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">NumValenceElectrons</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'SaturatedRings'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">NumSaturatedRings</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'AliphaticRings'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">NumAliphaticRings</span><span class="p">)</span>
<span class="n">data_logp</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">'AromaticRings'</span><span class="p">]</span> <span class="o">=</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">Descriptors</span><span class="p">.</span><span class="n">NumAromaticRings</span><span class="p">)</span></code></pre></figure>
<p>As a baseline, we calculate the performance of RDKit’s calculated MolLogP vs
the experimental logP.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">r2</span> <span class="o">=</span> <span class="n">r2_score</span><span class="p">(</span><span class="n">data_logp</span><span class="p">.</span><span class="n">logPexp</span><span class="p">,</span> <span class="n">data_logp</span><span class="p">.</span><span class="n">MolLogP</span><span class="p">)</span>
<span class="n">mse</span> <span class="o">=</span> <span class="n">mean_squared_error</span><span class="p">(</span><span class="n">data_logp</span><span class="p">.</span><span class="n">logPexp</span><span class="p">,</span> <span class="n">data_logp</span><span class="p">.</span><span class="n">MolLogP</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">data_logp</span><span class="p">.</span><span class="n">logPexp</span><span class="p">,</span> <span class="n">data_logp</span><span class="p">.</span><span class="n">MolLogP</span><span class="p">,</span>
<span class="n">label</span> <span class="o">=</span> <span class="s">"MSE: {:.2f}</span><span class="se">\n</span><span class="s">R^2: {:.2f}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">mse</span><span class="p">,</span> <span class="n">r2</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="/assets/images/2019-3-7-machine-learning-methods-for-log-p-prediction-1_13_0.svg" alt="svg" /></p>
<p>As we can see above, RDKit’s logP predictions have a relatively high mean
square error, and a weak coefficient of determination for this dataset. RDKit’s MolLogP implementation is based on atomic contributions. Hence, we will first
try to train our own simple logP model using the RDKit physical descriptors
that we generated above.</p>
<h2 id="model-with-simple-descriptors">Model with simple descriptors</h2>
<p>These are the descriptors that we will use for the model:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">X</span> <span class="o">=</span> <span class="n">data_logp</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">8</span><span class="p">:]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">data_logp</span><span class="p">.</span><span class="n">logPexp</span>
<span class="n">X</span><span class="p">.</span><span class="n">head</span><span class="p">()</span></code></pre></figure>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="0" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>HeavyAtomCount</th>
<th>HAccept</th>
<th>Heteroatoms</th>
<th>HDonor</th>
<th>MolWt</th>
<th>RotableBonds</th>
<th>RingCount</th>
<th>Ipc</th>
<th>HallKierAlpha</th>
<th>NumValenceElectrons</th>
<th>SaturatedRings</th>
<th>AliphaticRings</th>
<th>AromaticRings</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>21</td>
<td>2</td>
<td>3</td>
<td>1</td>
<td>281.355</td>
<td>2</td>
<td>3</td>
<td>69759.740168</td>
<td>-2.29</td>
<td>108</td>
<td>0</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<th>1</th>
<td>18</td>
<td>4</td>
<td>7</td>
<td>1</td>
<td>262.183</td>
<td>1</td>
<td>2</td>
<td>7977.096898</td>
<td>-1.76</td>
<td>98</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<th>2</th>
<td>16</td>
<td>4</td>
<td>6</td>
<td>2</td>
<td>230.264</td>
<td>2</td>
<td>1</td>
<td>2165.098769</td>
<td>-1.14</td>
<td>92</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<th>3</th>
<td>25</td>
<td>7</td>
<td>7</td>
<td>0</td>
<td>344.367</td>
<td>8</td>
<td>3</td>
<td>819166.201010</td>
<td>-2.96</td>
<td>132</td>
<td>0</td>
<td>0</td>
<td>3</td>
</tr>
<tr>
<th>4</th>
<td>20</td>
<td>2</td>
<td>2</td>
<td>0</td>
<td>266.388</td>
<td>4</td>
<td>2</td>
<td>32168.378171</td>
<td>-2.22</td>
<td>104</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
</tbody>
</table>
</div>
<p>For the regression, we will use a Random Forest with the default
parameters from scikit-learn, and set aside one third of the data for testing.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.33</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="n">models</span> <span class="o">=</span> <span class="p">{</span><span class="s">"rf"</span><span class="p">:</span> <span class="n">RandomForestRegressor</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)}</span>
<span class="n">scores</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">models</span><span class="p">:</span>
<span class="n">models</span><span class="p">[</span><span class="n">m</span><span class="p">].</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="n">scores</span><span class="p">[</span><span class="n">m</span> <span class="o">+</span> <span class="s">"_train"</span><span class="p">]</span> <span class="o">=</span> <span class="n">models</span><span class="p">[</span><span class="n">m</span><span class="p">].</span><span class="n">score</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="p">)</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">models</span><span class="p">[</span><span class="n">m</span><span class="p">].</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">scores</span><span class="p">[</span><span class="n">m</span> <span class="o">+</span> <span class="s">"_test"</span><span class="p">]</span> <span class="o">=</span> <span class="n">r2_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">)</span>
<span class="n">scores</span><span class="p">[</span><span class="n">m</span> <span class="o">+</span> <span class="s">"_mse_test"</span><span class="p">]</span> <span class="o">=</span> <span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">)</span></code></pre></figure>
<p>The scores of our model are:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">scores</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">(</span><span class="n">scores</span><span class="p">).</span><span class="n">T</span>
<span class="n">scores</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">rf_train 0.909276</code><br />
<code class="language-plaintext highlighter-rouge">rf_test 0.451319</code><br />
<code class="language-plaintext highlighter-rouge">rf_mse_test 0.792195</code><br />
<code class="language-plaintext highlighter-rouge">dtype: float64</code></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">r2</span> <span class="o">=</span> <span class="n">r2_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">)</span>
<span class="n">mse</span> <span class="o">=</span> <span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">"MSE: {:.2f}</span><span class="se">\n</span><span class="s">R^2: {:.2f}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">mse</span><span class="p">,</span> <span class="n">r2</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="/assets/images/2019-3-7-machine-learning-methods-for-log-p-prediction-1_22_0.svg" alt="svg" /></p>
<p>As we can see, using these simple descriptors coupled with scikit-learn’s
default random forest gets us a higher R<sup>2</sup> and MSE performance than the RDKit
logP predictor. This, however is likely due to the differences in the
training set that we used, versus the one that they used to develop their model.
It would be interesting to see how much we can improve the performance by tuning
the random forest parameters, and then measure the performance on the PHYSPROP dataset.</p>
<h2 id="calculating-fingerprints">Calculating fingerprints</h2>
<p>Now that we saw the performace of the simple molecular descriptors, we would
like to assess the performance of some of the most popular molecular
fingerprints. Among the many available methods, we will test Morgan
fingerprints (ECFP4 and ECFP6), RDKFingerprints, and topological pharmacophore
fingerprints (TPAPF and TPATF), the scripts for which are available from
<a href="http://www.mayachemtools.org/">MayaChemTools</a>.</p>
<p>I created a function for parallelizing a DataFrame’s apply() function. This
makes TPATF and TPAPF fingerprint calculation much faster. This function has
become one of my most useful snippets of code:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">multiprocessing</span>
<span class="kn">from</span> <span class="nn">joblib</span> <span class="kn">import</span> <span class="n">Parallel</span><span class="p">,</span> <span class="n">delayed</span>
<span class="k">def</span> <span class="nf">applyParallel</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">func</span><span class="p">):</span>
<span class="s">"""This function splits a pandas Series into n chunks,
corresponding to the number of available CPUs. Then it
applies a given function to the dataframe chunks, and
finally, returns their concatenated output."""</span>
<span class="n">n_jobs</span><span class="o">=</span><span class="n">multiprocessing</span><span class="p">.</span><span class="n">cpu_count</span><span class="p">()</span>
<span class="n">groups</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array_split</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">n_jobs</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">Parallel</span><span class="p">(</span><span class="n">n_jobs</span><span class="p">)(</span><span class="n">delayed</span><span class="p">(</span><span class="k">lambda</span> <span class="n">g</span><span class="p">:</span> <span class="n">g</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">func</span><span class="p">))(</span><span class="n">group</span><span class="p">)</span> <span class="k">for</span> <span class="n">group</span> <span class="ow">in</span> <span class="n">groups</span><span class="p">)</span>
<span class="k">return</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">results</span><span class="p">)</span></code></pre></figure>
<p>Calculate fingerprints:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">fps</span> <span class="o">=</span> <span class="p">{</span><span class="s">"ECFP4"</span><span class="p">:</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">m</span><span class="p">:</span> <span class="n">AllChem</span><span class="p">.</span><span class="n">GetMorganFingerprintAsBitVect</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">radius</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">nBits</span><span class="o">=</span><span class="mi">2048</span><span class="p">)),</span>
<span class="s">"ECFP6"</span><span class="p">:</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">m</span><span class="p">:</span> <span class="n">AllChem</span><span class="p">.</span><span class="n">GetMorganFingerprintAsBitVect</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">radius</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">nBits</span><span class="o">=</span><span class="mi">2048</span><span class="p">)),</span>
<span class="s">"RDKFP"</span><span class="p">:</span> <span class="n">molecules</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">m</span><span class="p">:</span> <span class="n">AllChem</span><span class="p">.</span><span class="n">RDKFingerprint</span><span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">fpSize</span><span class="o">=</span><span class="mi">2048</span><span class="p">)),</span>
<span class="s">"TPATF"</span><span class="p">:</span> <span class="n">applyParallel</span><span class="p">(</span><span class="n">data_logp</span><span class="p">.</span><span class="n">SMILES</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">m</span><span class="p">:</span> <span class="n">FeatureGenerator</span><span class="p">(</span><span class="n">m</span><span class="p">).</span><span class="n">toTPATF</span><span class="p">()),</span>
<span class="s">"TPAPF"</span><span class="p">:</span> <span class="n">applyParallel</span><span class="p">(</span><span class="n">data_logp</span><span class="p">.</span><span class="n">SMILES</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">m</span><span class="p">:</span> <span class="n">FeatureGenerator</span><span class="p">(</span><span class="n">m</span><span class="p">).</span><span class="n">toTPAPF</span><span class="p">())}</span></code></pre></figure>
<h2 id="comparing-fingerprint-models">Comparing fingerprint models</h2>
<p>Finally, here we apply three different types of regression models to estimate
the performance of the different fingerprints.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">y</span> <span class="o">=</span> <span class="n">data_logp</span><span class="p">.</span><span class="n">logPexp</span>
<span class="n">models</span> <span class="o">=</span> <span class="p">{</span><span class="s">"rf"</span><span class="p">:</span> <span class="n">RandomForestRegressor</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">),</span>
<span class="s">"nnet"</span><span class="p">:</span> <span class="n">MLPRegressor</span><span class="p">(</span><span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">),</span>
<span class="s">"svr"</span><span class="p">:</span> <span class="n">SVR</span><span class="p">(</span><span class="n">gamma</span><span class="o">=</span><span class="s">'auto'</span><span class="p">)}</span>
<span class="n">scores</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">fps</span><span class="p">:</span>
<span class="n">scores</span><span class="p">[</span><span class="n">f</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
<span class="c1"># Convert fps to 2D numpy array
</span> <span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">fps</span><span class="p">[</span><span class="n">f</span><span class="p">].</span><span class="n">tolist</span><span class="p">())</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.33</span><span class="p">,</span>
<span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
<span class="k">for</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">models</span><span class="p">:</span>
<span class="n">models</span><span class="p">[</span><span class="n">m</span><span class="p">].</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c1">#scores[f][m + "_r2_train"] = models[m].score(X_train, y_train)
</span> <span class="n">y_pred</span> <span class="o">=</span> <span class="n">models</span><span class="p">[</span><span class="n">m</span><span class="p">].</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">scores</span><span class="p">[</span><span class="n">f</span><span class="p">][</span><span class="n">m</span> <span class="o">+</span> <span class="s">"_r2_test"</span><span class="p">]</span> <span class="o">=</span> <span class="n">r2_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">)</span>
<span class="n">scores</span><span class="p">[</span><span class="n">f</span><span class="p">][</span><span class="n">m</span> <span class="o">+</span> <span class="s">"_mse_test"</span><span class="p">]</span> <span class="o">=</span> <span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">scores_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">scores</span><span class="p">).</span><span class="n">T</span>
<span class="n">scores_df</span></code></pre></figure>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="0" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>nnet_mse_test</th>
<th>nnet_r2_test</th>
<th>rf_mse_test</th>
<th>rf_r2_test</th>
<th>svr_mse_test</th>
<th>svr_r2_test</th>
</tr>
</thead>
<tbody>
<tr>
<th>ECFP4</th>
<td>1.378013</td>
<td>0.045576</td>
<td>1.216157</td>
<td>0.157679</td>
<td>1.359439</td>
<td>0.058440</td>
</tr>
<tr>
<th>ECFP6</th>
<td>1.238698</td>
<td>0.142066</td>
<td>1.182595</td>
<td>0.180924</td>
<td>1.340282</td>
<td>0.071709</td>
</tr>
<tr>
<th>RDKFP</th>
<td>1.236841</td>
<td>0.143353</td>
<td>1.068570</td>
<td>0.259899</td>
<td>1.069886</td>
<td>0.258988</td>
</tr>
<tr>
<th>TPATF</th>
<td>3.357452</td>
<td>-1.325401</td>
<td>0.704787</td>
<td>0.511858</td>
<td>0.970373</td>
<td>0.327911</td>
</tr>
<tr>
<th>TPAPF</th>
<td>1.391893</td>
<td>0.035962</td>
<td>0.829020</td>
<td>0.425813</td>
<td>0.830663</td>
<td>0.424675</td>
</tr>
</tbody>
</table>
</div>
<p>Overall, the TPATF fingerprint performed the best — even outperforming the
simple descriptor model. The default random forest had the best performance out
of all the regression methods, although it is very possible that this
will change after some optimization of the model parameters.</p>
<p>In later works, we will further tune models using simple physical descriptors as
well as TPATF fingerprints, and compare their performance to existing logP
predictors using this dataset, as well as the PHYSPROP set. It would also be
interesting to observe the effects of consensus scoring using several models.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Martel, S., Gillerat, F., Carosati, E., Maiarelli, D., Tetko, I. V., Mannhold, R., & Carrupt, P.-A. (2013). Large, chemically diverse dataset of logP measurements for benchmarking studies. European Journal of Pharmaceutical Sciences, 48(1-2), 21–29. doi: 10.1016/j.ejps.2012.10.019 <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Ricardo Avilaravila@protonmail.comThe octanol-water partition coefficient, or logP, is one of the most important properties for determining a compound’s suitability as a drug. Unfortunately, currently existing models are not as accurate as we would like. This experiment compares the performance of several different molecular fingerprinting methods, paired with three common machine learning algorithms.Mining Pharos with MySQL and Python2019-02-19T00:00:00-07:002019-02-19T00:00:00-07:00https://ravilabio.info/2019/02/19/mining-pharos-with-mysql-and-python<!-- vim-markdown-toc GitLab -->
<ul>
<li><a href="#why-mysql-and-python">Why MySQL and Python?</a></li>
<li><a href="#connecting-to-pharos">Connecting to Pharos</a></li>
<li><a href="#executing-database-queries">Executing database queries</a></li>
<li><a href="#importing-the-tables-to-pandas">Importing the tables to Pandas</a>
<ul>
<li><a href="#compound-activity">Compound Activity</a></li>
<li><a href="#protein">Protein</a></li>
<li><a href="#target">Target</a></li>
</ul>
</li>
<li><a href="#filtering-by-number-of-actives">Filtering by number of actives</a></li>
<li><a href="#joining-tables">Joining Tables</a></li>
<li><a href="#exporting-the-data">Exporting the data</a></li>
</ul>
<!-- vim-markdown-toc -->
<p><img src="/assets/images/2019-2-19-pharos-mysql.png" alt="image" /></p>
<h2 id="why-mysql-and-python">Why MySQL and Python?</h2>
<p>Previously, I demonstrated how to use the SIFTS database to find <a href="/2019/02/14/mapping-pharos-targets-to-pdb.html">UniProt-to-PDB
mappings</a> for proteins from the Pharos database. To do this, we downloaded <strong>csv format</strong>
files for different receptor classes directly from the Pharos website.
However, manually downloading these data files is tedious, and does not allow us to keep our data
up to date with future changes in the source database.
A much more efficient way is to obtain this data directly through SQL queries.</p>
<p>I must confess that I am not proficient when it comes to complex table joins and filters in SQL,
but I can do the job in Python! Additionally reading SQL tables into Python allows us to use Python’s
data visualization libraries on the data with ease.</p>
<p>In this notebook, we use <strong>MySQL Connector</strong> and Python’s <strong>Pandas</strong> library to
retrieve and manipulate data for Pharos targets.
The goal is to obtain a dataset of targets that contain more than <strong>15 active
compounds</strong>, along with information about their different target classes.</p>
<p>All the code in this post is also available as a Jupyter notebook <a href="https://gist.github.com/ravila4/ef493e20ff9f35d4e1b83e21a97a7de7">here.</a></p>
<p>To install mysql-connector, run: <code class="language-plaintext highlighter-rouge">pip install mysql-connector-python-rf</code>.</p>
<p>First, we import the necessary libraries:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">mysql.connector</span> <span class="k">as</span> <span class="n">sql</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span></code></pre></figure>
<h2 id="connecting-to-pharos">Connecting to Pharos</h2>
<p>We use Python to create an SQL connection to the Pharos database:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">db_connection</span> <span class="o">=</span> <span class="n">sql</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">host</span><span class="o">=</span><span class="s">'tcrd.kmc.io'</span><span class="p">,</span> <span class="n">db</span><span class="o">=</span><span class="s">'tcrd540'</span><span class="p">,</span> <span class="n">user</span><span class="o">=</span><span class="s">'tcrd'</span><span class="p">)</span>
<span class="n">db_connection</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge"><mysql.connector.connection.MySQLConnection at 0x7f428fca0668></code></p>
<p>In order to use the new connnection, we need to create a cursor object, which
allows us to send instructions to the database:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">db_cursor</span> <span class="o">=</span> <span class="n">db_connection</span><span class="p">.</span><span class="n">cursor</span><span class="p">()</span></code></pre></figure>
<h2 id="executing-database-queries">Executing database queries</h2>
<p>We can use the newly created cursor to execute queries.
First we execute the <code class="language-plaintext highlighter-rouge">SHOW TABLES</code> MySQL command, to get an idea of the kind
of tables we can collect information from.</p>
<p>The <code class="language-plaintext highlighter-rouge">cursor.fetchall()</code> method returns a list,
and is equivalent to calling <code class="language-plaintext highlighter-rouge">list()</code> on the cursor object.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">db_cursor</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">'SHOW TABLES;'</span><span class="p">)</span>
<span class="n">tables</span> <span class="o">=</span> <span class="n">db_cursor</span><span class="p">.</span><span class="n">fetchall</span><span class="p">()</span>
<span class="k">print</span><span class="p">(</span><span class="n">tables</span><span class="p">)</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">[('alias',), ('cmpd_activity',), ('cmpd_activity_type',), ('compartment',), ('compartment_type',), ('data_type',), ('dataset',), ('dbinfo',), ('disease',), ('disease_type',), ('do',), ('do_parent',), ('drug_activity',), ('dto',), ('expression',), ('expression_type',), ('feature',), ('gene_attribute',), ('gene_attribute_type',), ('generif',), ('goa',), ('hgram_cdf',), ('info_type',), ('kegg_distance',), ('kegg_nearest_tclin',), ('locsig',), ('mlp_assay_info',), ('ortholog',), ('ortholog_disease',), ('p2pc',), ('panther_class',), ('patent_count',), ('pathway',), ('pathway_type',), ('phenotype',), ('phenotype_type',), ('pmscore',), ('ppi',), ('ppi_type',), ('protein',), ('protein2pubmed',), ('provenance',), ('ptscore',), ('pubmed',), ('t2tc',), ('target',), ('tdl_info',), ('tdl_update_log',), ('techdev_contact',), ('techdev_info',), ('tinx_articlerank',), ('tinx_disease',), ('tinx_importance',), ('tinx_novelty',), ('tinx_target',), ('xref',), ('xref_type',)]</code></p>
<p>Above, we see a list of the tables. We can use the <code class="language-plaintext highlighter-rouge">DESCRIBE</code> query to obtain
a list of the attributes of a particular table.
In this case, we are interested in the <strong>protein</strong>, <strong>target</strong>, and <strong>cmpd_activity</strong> tables.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">db_cursor</span><span class="p">.</span><span class="n">execute</span><span class="p">(</span><span class="s">'DESCRIBE protein;'</span><span class="p">)</span>
<span class="nb">list</span><span class="p">(</span><span class="n">db_cursor</span><span class="p">)</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">[('id', 'int(11)', 'NO', 'PRI', None, 'auto_increment'),
('name', 'varchar(255)', 'NO', 'UNI', None, ''),
('description', 'text', 'NO', '', None, ''),
('uniprot', 'varchar(20)', 'NO', 'UNI', None, ''),
('up_version', 'int(11)', 'YES', '', None, ''),
('geneid', 'int(11)', 'YES', '', None, ''),
('sym', 'varchar(20)', 'YES', '', None, ''),
('family', 'varchar(255)', 'YES', '', None, ''),
('chr', 'varchar(255)', 'YES', '', None, ''),
('seq', 'text', 'YES', '', None, ''),
('dtoid', 'varchar(13)', 'YES', '', None, ''),
('stringid', 'varchar(15)', 'YES', '', None, '')]</code></p>
<h2 id="importing-the-tables-to-pandas">Importing the tables to Pandas</h2>
<h3 id="compound-activity">Compound Activity</h3>
<p>Next, we use Pandas to read the data directly from the tables.
First the <strong>cmpd_activity</strong> table, which contains information about the binding
affinity of compounds to targets in the database:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">query</span> <span class="o">=</span> <span class="s">"SELECT id, target_id, cmpd_id_in_src, cmpd_name_in_src, </span><span class="se">\
</span><span class="s"> smiles, act_value, act_type </span><span class="se">\
</span><span class="s"> FROM cmpd_activity"</span>
<span class="n">cmpd_activity</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">con</span><span class="o">=</span><span class="n">db_connection</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">print</span><span class="p">(</span><span class="n">cmpd_activity</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">cmpd_activity</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">(382291, 7)</code></p>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="0" class="dataframe">
<thead>
<tr style="text-align: left;">
<th></th>
<th>id</th>
<th>target_id</th>
<th>cmpd_id_in_src</th>
<th>cmpd_name_in_src</th>
<th>smiles</th>
<th>act_value</th>
<th>act_type</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>3006</td>
<td>CHEMBL365855</td>
<td>N-(5-Cyclobutyl-thiazol-2-yl)-2-phenyl-acetamide</td>
<td>O=C(Cc1ccccc1)Nc2ncc(s2)C3CCC3</td>
<td>7.60</td>
<td>IC50</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>3006</td>
<td>CHEMBL3775677</td>
<td>3-Isopropyl-5-(2,3-dihydroxypropyl)amino-7-[4-...</td>
<td>CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(O)...</td>
<td>7.68</td>
<td>IC50</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>3006</td>
<td>CHEMBL3775608</td>
<td>3-Isopropyl-5-(3-amino-2-hydroxypropyl)amino-7...</td>
<td>CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(N)...</td>
<td>7.77</td>
<td>IC50</td>
</tr>
</tbody>
</table>
</div>
<h3 id="protein">Protein</h3>
<p>We read in the data we want from the <strong>protein</strong> table:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">query</span> <span class="o">=</span> <span class="s">"SELECT id, name, description, uniprot, family, seq </span><span class="se">\
</span><span class="s"> FROM protein"</span>
<span class="n">protein</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">con</span><span class="o">=</span><span class="n">db_connection</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">print</span><span class="p">(</span><span class="n">protein</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">protein</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">(20244, 6)</code></p>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="0" class="dataframe">
<thead>
<tr style="text-align: left;">
<th></th>
<th>id</th>
<th>name</th>
<th>description</th>
<th>uniprot</th>
<th>family</th>
<th>seq</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>1433E_HUMAN</td>
<td>14-3-3 protein epsilon</td>
<td>P62258</td>
<td>Belongs to the 14-3-3 family.</td>
<td>MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLS...</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>1433F_HUMAN</td>
<td>14-3-3 protein eta</td>
<td>Q04917</td>
<td>Belongs to the 14-3-3 family.</td>
<td>MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLS...</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>1433T_HUMAN</td>
<td>14-3-3 protein theta</td>
<td>P27348</td>
<td>Belongs to the 14-3-3 family.</td>
<td>MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSV...</td>
</tr>
</tbody>
</table>
</div>
<h3 id="target">Target</h3>
<p>For the <strong>target</strong> table, we are interested in filtering for targets that are
in the <strong>Tclin</strong> or <strong>Tchem</strong> development classifications.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">query</span> <span class="o">=</span> <span class="s">"SELECT id, name, tdl, fam, famext </span><span class="se">\
</span><span class="s"> FROM target </span><span class="se">\
</span><span class="s"> WHERE tdl='Tclin' OR tdl='Tchem'"</span>
<span class="n">target</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_sql</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">con</span><span class="o">=</span><span class="n">db_connection</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">print</span><span class="p">(</span><span class="n">target</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">target</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">(2211, 5)</code></p>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="0" class="dataframe">
<thead>
<tr style="text-align: left;">
<th></th>
<th>id</th>
<th>name</th>
<th>tdl</th>
<th>fam</th>
<th>famext</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>2</td>
<td>14-3-3 protein eta</td>
<td>Tchem</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<th>1</th>
<td>3</td>
<td>14-3-3 protein theta</td>
<td>Tchem</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<th>2</th>
<td>23</td>
<td>3 beta-hydroxysteroid dehydrogenase/Delta 5-->...</td>
<td>Tchem</td>
<td>Enzyme</td>
<td>3-beta-HSD</td>
</tr>
</tbody>
</table>
</div>
<p>Since we have all the data stored in memory, we no longer need the database
connection.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">db_connection</span><span class="p">.</span><span class="n">close</span><span class="p">()</span></code></pre></figure>
<h2 id="filtering-by-number-of-actives">Filtering by number of actives</h2>
<p>Here, we filter out receptors that contain less than 15 active molecules.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">num_actives</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">target_ids</span> <span class="o">=</span> <span class="n">cmpd_activity</span><span class="p">.</span><span class="n">target_id</span><span class="p">.</span><span class="n">unique</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">target_ids</span><span class="p">:</span>
<span class="n">num_actives</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">cmpd_activity</span><span class="p">[</span><span class="n">cmpd_activity</span><span class="p">.</span><span class="n">target_id</span> <span class="o">==</span> <span class="n">i</span><span class="p">])</span>
<span class="n">target</span><span class="p">[</span><span class="s">'num_actives'</span><span class="p">]</span> <span class="o">=</span> <span class="n">target</span><span class="p">.</span><span class="nb">id</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">num_actives</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">target</span> <span class="o">=</span> <span class="n">target</span><span class="p">[</span><span class="n">target</span><span class="p">[</span><span class="s">'num_actives'</span><span class="p">]</span> <span class="o">>=</span> <span class="mi">15</span><span class="p">]</span>
<span class="n">target</span><span class="p">.</span><span class="n">num_actives</span> <span class="o">=</span> <span class="n">target</span><span class="p">.</span><span class="n">num_actives</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span> <span class="c1"># Convert from float to int
</span><span class="n">target</span><span class="p">.</span><span class="n">shape</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">(1067, 6)</code></p>
<p>Whereas before we had a total of 2,211 targets in Tclin and Tchem, now we only have 1,067 which contain more than 15 experimental activity values.</p>
<p>Finally, we create a pie chart to visualize the number of targets in each target family:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">tchem_tclin_fams</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">families</span> <span class="o">=</span> <span class="p">[</span><span class="n">fam</span> <span class="k">for</span> <span class="n">fam</span> <span class="ow">in</span> <span class="n">target</span><span class="p">.</span><span class="n">fam</span><span class="p">.</span><span class="n">unique</span><span class="p">()</span> <span class="k">if</span> <span class="n">fam</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">]</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">families</span><span class="p">):</span>
<span class="n">tchem_tclin_fams</span><span class="p">[</span><span class="n">f</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">target</span><span class="p">[</span><span class="n">target</span><span class="p">.</span><span class="n">fam</span> <span class="o">==</span> <span class="n">f</span><span class="p">])</span>
<span class="n">tchem_tclin_fams</span><span class="p">[</span><span class="s">'None'</span><span class="p">]</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">target</span><span class="p">[</span><span class="n">target</span><span class="p">.</span><span class="n">fam</span><span class="p">.</span><span class="n">isna</span><span class="p">()])</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">tchem_tclin_fams</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">{'Enzyme': 348,
'Epigenetic': 42,
'GPCR': 189,
'IC': 91,
'Kinase': 205,
'NR': 28,
'TF': 6,
'TF; Epigenetic': 5,
'Transporter': 35,
'None': 118}</code></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">4</span><span class="p">))</span>
<span class="n">width</span> <span class="o">=</span> <span class="p">.</span><span class="mi">6</span>
<span class="n">explode</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="p">.</span><span class="mi">3</span><span class="p">,</span> <span class="p">.</span><span class="mi">2</span><span class="p">,</span> <span class="p">.</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[</span><span class="s">"{}: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span> <span class="k">for</span> <span class="n">f</span><span class="p">,</span> <span class="n">n</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">tchem_tclin_fams</span><span class="p">.</span><span class="n">keys</span><span class="p">(),</span>
<span class="n">tchem_tclin_fams</span><span class="p">.</span><span class="n">values</span><span class="p">())]</span>
<span class="n">plt</span><span class="p">.</span><span class="n">pie</span><span class="p">(</span><span class="n">tchem_tclin_fams</span><span class="p">.</span><span class="n">values</span><span class="p">(),</span> <span class="n">labels</span><span class="o">=</span><span class="n">labels</span><span class="p">,</span> <span class="n">radius</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">explode</span><span class="o">=</span><span class="n">explode</span><span class="p">,</span>
<span class="n">wedgeprops</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span><span class="n">width</span><span class="o">=</span><span class="n">width</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s">'w'</span><span class="p">),</span> <span class="n">autopct</span><span class="o">=</span><span class="s">'%1.0f%%'</span><span class="p">,</span>
<span class="n">pctdistance</span><span class="o">=</span><span class="p">.</span><span class="mi">8</span><span class="p">,</span> <span class="n">labeldistance</span><span class="o">=</span><span class="mf">1.1</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"pharos_targets.svg"</span><span class="p">,</span> <span class="n">bbox_inches</span> <span class="o">=</span> <span class="s">'tight'</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/images/2019-2-19-mining-pharos-with-mysql-and-python_32_0.svg" alt="svg" /></p>
<p>From this target data, we could further filter down to receptors that have known
protein structures, as shown in the SIFTS database post. In this case, we will
simply concatenate the data from the Protein table to the Target table, in order
to obtain information about the UniProt ID, protein ontology, and sequence.
Finally, we will write the data to csv files for further analysis.</p>
<h2 id="joining-tables">Joining Tables</h2>
<p>We need to join the Protein and Target tables by id. The two tables should have the same size:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">protein</span> <span class="o">=</span> <span class="n">protein</span><span class="p">[</span><span class="n">protein</span><span class="p">.</span><span class="nb">id</span><span class="p">.</span><span class="n">isin</span><span class="p">(</span><span class="n">target</span><span class="p">.</span><span class="nb">id</span><span class="p">)]</span>
<span class="n">protein</span><span class="p">.</span><span class="n">shape</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">(1067, 6)</code></p>
<p>Joining the tables:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">protein</span> <span class="o">=</span> <span class="n">protein</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"id"</span><span class="p">)</span>
<span class="n">target</span> <span class="o">=</span> <span class="n">target</span><span class="p">.</span><span class="n">set_index</span><span class="p">(</span><span class="s">"id"</span><span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">target</span><span class="p">,</span> <span class="n">protein</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">join</span><span class="o">=</span><span class="s">'outer'</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">result</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span></code></pre></figure>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="0" class="dataframe">
<thead>
<tr style="text-align: left;">
<th></th>
<th>name</th>
<th>tdl</th>
<th>fam</th>
<th>famext</th>
<th>num_actives</th>
<th>name</th>
<th>description</th>
<th>uniprot</th>
<th>family</th>
<th>seq</th>
</tr>
<tr>
<th>id</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>26</th>
<td>5-hydroxytryptamine receptor 2B</td>
<td>Tclin</td>
<td>GPCR</td>
<td>GPCR</td>
<td>777</td>
<td>5HT2B_HUMAN</td>
<td>5-hydroxytryptamine receptor 2B</td>
<td>P41595</td>
<td>Belongs to the G-protein coupled receptor 1 fa...</td>
<td>MALSYRVSELQSTIPEHILQSTFVHVISSNWSGLQTESIPEEMKQI...</td>
</tr>
<tr>
<th>27</th>
<td>5-hydroxytryptamine receptor 2C</td>
<td>Tclin</td>
<td>GPCR</td>
<td>GPCR</td>
<td>1612</td>
<td>5HT2C_HUMAN</td>
<td>5-hydroxytryptamine receptor 2C</td>
<td>P28335</td>
<td>Belongs to the G-protein coupled receptor 1 fa...</td>
<td>MVNLRNAVHSFLVHLIGLLVWQCDISVSPVAAIVTDIFNTSDGGRF...</td>
</tr>
<tr>
<th>30</th>
<td>5'-nucleotidase</td>
<td>Tchem</td>
<td>Enzyme</td>
<td>None</td>
<td>23</td>
<td>5NTD_HUMAN</td>
<td>5'-nucleotidase</td>
<td>P21589</td>
<td>Belongs to the 5'-nucleotidase family.</td>
<td>MCPRAARAPATLLLALGAVLWPAAGAWELTILHTNDVHSRLEQTSE...</td>
</tr>
</tbody>
</table>
</div>
<h2 id="exporting-the-data">Exporting the data</h2>
<p>We separate each target class into different Data Frames, store these in a
dictionary, and also save them to separate csv files.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">target_dfs</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">families</span><span class="p">:</span>
<span class="n">target_dfs</span><span class="p">[</span><span class="n">f</span><span class="p">]</span> <span class="o">=</span> <span class="n">result</span><span class="p">[</span><span class="n">result</span><span class="p">.</span><span class="n">fam</span> <span class="o">==</span> <span class="n">f</span><span class="p">]</span>
<span class="n">target_dfs</span><span class="p">[</span><span class="n">f</span><span class="p">].</span><span class="n">to_csv</span><span class="p">(</span><span class="n">f</span> <span class="o">+</span> <span class="s">".csv"</span><span class="p">)</span></code></pre></figure>Ricardo Avilaravila@protonmail.comAccessing SQL databases with Python can be useful in many situations. Here, we use MySQL Connector and Python’s Pandas library to retrieve and manipulate data for Pharos targets. The goal is to obtain a dataset of targets that contain more than 15 active compounds, along with information about their different target classes.Mapping Pharos Targets to PDB Structures2019-02-14T00:00:00-07:002019-02-14T00:00:00-07:00https://ravilabio.info/2019/02/14/mapping-pharos-targets-to-pdb<ul>
<li><a href="#problem-description">Problem Description</a></li>
<li><a href="#getting-the-data">Getting the Data</a></li>
<li><a href="#read-sifts-mappings">Read SIFTS Mappings</a></li>
<li><a href="#find-pdb-ids">Find PDB IDs</a></li>
<li><a href="#summarizing-the-data">Summarizing the Data</a></li>
<li><a href="#visualizing-the-data">Visualizing the Data</a></li>
</ul>
<h2 id="problem-description">Problem Description</h2>
<p><img src="/assets/images/SIFTS.png" style="width:500px" /></p>
<p>The Structure Integration with Function, Taxonomy and Sequence (<a href="https://www.ebi.ac.uk/pdbe/docs/sifts/">SIFTS</a>) database
provides mappings between UniProt and PDB, as well as annotations from GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl and other resources.
Here, we map all the receptors from the <a href="https://pharos.nih.gov/idg/index">Pharos</a> database to their PDB IDs,
using their UniProt accession numbers.</p>
<p><strong>The goal is to obtain a dataset of human targets with available structures and known ligand binding affinities.</strong>
I also want to get the distribution of these PDB structures across different receptor families,
such as Kinases, GPCRs, Ion Channels, Nuclear Receptors, and Transporters.</p>
<h2 id="getting-the-data">Getting the Data</h2>
<p>First we read in Pharos data csv files downloaded from Pharos for targets in the <strong>Tclin</strong> (targets with approved drugs),
and <strong>Tchem</strong> (targets with known binding affinities), for several receptor classes.
The csv files contain UniProt IDs for each receptor. All downloaded data and code is available in my GitHub repository:
<a href="https://github.com/ravila4/Pharos-to-PDB">ravila4/Pharos-to-PDB</a></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="n">target_classes</span> <span class="o">=</span> <span class="p">[</span><span class="s">"GPCRs"</span><span class="p">,</span> <span class="s">"ion-channels"</span><span class="p">,</span> <span class="s">"kinases"</span><span class="p">,</span> <span class="s">"nuclear-receptors"</span><span class="p">,</span> <span class="s">"transporters"</span><span class="p">]</span>
<span class="n">IDG_data</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">tclass</span> <span class="ow">in</span> <span class="n">target_classes</span><span class="p">:</span>
<span class="n">IDG_data</span><span class="p">[</span><span class="n">tclass</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"data/"</span> <span class="o">+</span> <span class="n">tclass</span> <span class="o">+</span> <span class="s">".csv"</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span></code></pre></figure>
<h2 id="read-sifts-mappings">Read SIFTS Mappings</h2>
<p>The mappings were downloaded as a CSV file from
<a href="ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/flatfiles/csv">their ftp site</a>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">uniprot_to_pdb</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"data/uniprot_pdb.csv"</span><span class="p">,</span> <span class="n">skiprows</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">uniprot_to_pdb</span><span class="p">.</span><span class="n">head</span><span class="p">()</span></code></pre></figure>
<figcaption>A sample of the SIFTS Data Frame.</figcaption>
<div>
<style scoped="">
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: left;
}
</style>
<table border="0" class="dataframe">
<thead>
<tr style="text-align: left;">
<th></th>
<th>SP_PRIMARY</th>
<th>PDB</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>A0A010</td>
<td>5b00;5b01;5b02;5b03;5b0i;5b0j;5b0k;5b0l;5b0m;5...</td>
</tr>
<tr>
<th>1</th>
<td>A0A011</td>
<td>3vk5;3vka;3vkb;3vkc;3vkd</td>
</tr>
<tr>
<th>2</th>
<td>A0A014C6J9</td>
<td>6br7</td>
</tr>
<tr>
<th>3</th>
<td>A0A016UNP9</td>
<td>2md0</td>
</tr>
<tr>
<th>4</th>
<td>A0A023GPI4</td>
<td>2m6j</td>
</tr>
</tbody>
</table>
</div>
<h2 id="find-pdb-ids">Find PDB IDs</h2>
<p>Here’s a function for joining the two Data Frames:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">find_pdbs</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="s">""" Input: Data Frame of Pharos data.
Output: List of PDB IDs. """</span>
<span class="n">IDS</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)):</span>
<span class="n">pdb_ids</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">uniprot_id</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">loc</span><span class="p">[:,</span> <span class="s">"Uniprot ID"</span><span class="p">][</span><span class="n">i</span><span class="p">]</span>
<span class="n">mapping</span> <span class="o">=</span> <span class="n">uniprot_to_pdb</span><span class="p">[</span><span class="n">uniprot_to_pdb</span><span class="p">.</span><span class="n">SP_PRIMARY</span> <span class="o">==</span> <span class="n">uniprot_id</span><span class="p">]</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">mapping</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">pdb_ids</span> <span class="o">=</span> <span class="n">mapping</span><span class="p">.</span><span class="n">PDB</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">split</span><span class="p">(</span><span class="s">';'</span><span class="p">)</span>
<span class="n">IDS</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">pdb_ids</span><span class="p">)</span>
<span class="k">return</span> <span class="n">IDS</span></code></pre></figure>
<p>Adding PDB IDs to Pharos targets:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">df</span> <span class="ow">in</span> <span class="n">IDG_data</span><span class="p">.</span><span class="n">values</span><span class="p">():</span>
<span class="n">df</span><span class="p">[</span><span class="s">'PDB_IDS'</span><span class="p">]</span> <span class="o">=</span> <span class="n">find_pdbs</span><span class="p">(</span><span class="n">df</span><span class="p">)</span></code></pre></figure>
<h2 id="summarizing-the-data">Summarizing the Data</h2>
<p>Number of receptors in each class with at least one structure in the Protein
Data Bank:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">pdbs_per_class</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">IDG_class</span> <span class="ow">in</span> <span class="n">IDG_data</span><span class="p">:</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">IDG_data</span><span class="p">[</span><span class="n">IDG_class</span><span class="p">]</span>
<span class="n">num_available</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span> <span class="o">-</span> <span class="nb">sum</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">PDB_IDS</span><span class="p">.</span><span class="n">isna</span><span class="p">())</span>
<span class="n">pdbs_per_class</span><span class="p">[</span><span class="n">IDG_class</span><span class="p">]</span> <span class="o">=</span> <span class="n">num_available</span>
<span class="n">pdbs_per_class</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">{'GPCRs': 77,
'ion-channels': 70,
'kinases': 304,
'nuclear-receptors': 41,
'transporters': 15}</code></p>
<h2 id="visualizing-the-data">Visualizing the Data</h2>
<p>Finally, we visualize the results with a pie chart:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="n">width</span><span class="o">=</span><span class="mf">0.3</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[</span><span class="s">"{}: {}"</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span> <span class="k">for</span> <span class="n">f</span><span class="p">,</span> <span class="n">n</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">pdbs_per_class</span><span class="p">.</span><span class="n">keys</span><span class="p">(),</span>
<span class="n">pdbs_per_class</span><span class="p">.</span><span class="n">values</span><span class="p">())]</span>
<span class="n">plt</span><span class="p">.</span><span class="n">pie</span><span class="p">(</span><span class="n">pdbs_per_class</span><span class="p">.</span><span class="n">values</span><span class="p">(),</span> <span class="n">labels</span><span class="o">=</span><span class="n">labels</span><span class="p">,</span> <span class="n">radius</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">wedgeprops</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span><span class="n">width</span><span class="o">=</span><span class="n">width</span><span class="p">,</span> <span class="n">edgecolor</span><span class="o">=</span><span class="s">'w'</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span></code></pre></figure>
<p><img src="/assets/images/2019-2-14-mapping-pharos-targets-to-pdb-structures_19_0.svg" alt="svg" /></p>Ricardo Avilaravila@protonmail.comThe SIFTS (Structure Integration with Function, Taxonomy and Sequence) database provides mappings between UniProt genes and PDB structures, among other things. Using these mappings, we count the number of targets from the NIH's Pharos database which have ligands with known binding affinities, and at least one structure in the Protein Data Bank.