Mining Pharos with MySQL and Python



Why MySQL and Python?

Previously, I demonstrated how to use the SIFTS database to find UniProt-to-PDB mappings for proteins from the Pharos database. To do this, we downloaded csv format files for different receptor classes directly from the Pharos website. However, manually downloading these data files is tedious, and does not allow us to keep our data up to date with future changes in the source database. A much more efficient way is to obtain this data directly through SQL queries.

I must confess that I am not proficient when it comes to complex table joins and filters in SQL, but I can do the job in Python! Additionally reading SQL tables into Python allows us to use Python’s data visualization libraries on the data with ease.

In this notebook, we use MySQL Connector and Python’s Pandas library to retrieve and manipulate data for Pharos targets. The goal is to obtain a dataset of targets that contain more than 15 active compounds, along with information about their different target classes.

All the code in this post is also available as a Jupyter notebook here.

To install mysql-connector, run: pip install mysql-connector-python-rf.

First, we import the necessary libraries:

import mysql.connector as sql
import pandas as pd
import matplotlib.pyplot as plt

Connecting to Pharos

We use Python to create an SQL connection to the Pharos database:

db_connection = sql.connect(host='', db='tcrd540', user='tcrd')

<mysql.connector.connection.MySQLConnection at 0x7f428fca0668>

In order to use the new connnection, we need to create a cursor object, which allows us to send instructions to the database:

db_cursor = db_connection.cursor()

Executing database queries

We can use the newly created cursor to execute queries. First we execute the SHOW TABLES MySQL command, to get an idea of the kind of tables we can collect information from.

The cursor.fetchall() method returns a list, and is equivalent to calling list() on the cursor object.

db_cursor.execute('SHOW TABLES;')
tables = db_cursor.fetchall()

[('alias',), ('cmpd_activity',), ('cmpd_activity_type',), ('compartment',), ('compartment_type',), ('data_type',), ('dataset',), ('dbinfo',), ('disease',), ('disease_type',), ('do',), ('do_parent',), ('drug_activity',), ('dto',), ('expression',), ('expression_type',), ('feature',), ('gene_attribute',), ('gene_attribute_type',), ('generif',), ('goa',), ('hgram_cdf',), ('info_type',), ('kegg_distance',), ('kegg_nearest_tclin',), ('locsig',), ('mlp_assay_info',), ('ortholog',), ('ortholog_disease',), ('p2pc',), ('panther_class',), ('patent_count',), ('pathway',), ('pathway_type',), ('phenotype',), ('phenotype_type',), ('pmscore',), ('ppi',), ('ppi_type',), ('protein',), ('protein2pubmed',), ('provenance',), ('ptscore',), ('pubmed',), ('t2tc',), ('target',), ('tdl_info',), ('tdl_update_log',), ('techdev_contact',), ('techdev_info',), ('tinx_articlerank',), ('tinx_disease',), ('tinx_importance',), ('tinx_novelty',), ('tinx_target',), ('xref',), ('xref_type',)]

Above, we see a list of the tables. We can use the DESCRIBE query to obtain a list of the attributes of a particular table. In this case, we are interested in the protein, target, and cmpd_activity tables.

db_cursor.execute('DESCRIBE protein;')

[('id', 'int(11)', 'NO', 'PRI', None, 'auto_increment'), ('name', 'varchar(255)', 'NO', 'UNI', None, ''), ('description', 'text', 'NO', '', None, ''), ('uniprot', 'varchar(20)', 'NO', 'UNI', None, ''), ('up_version', 'int(11)', 'YES', '', None, ''), ('geneid', 'int(11)', 'YES', '', None, ''), ('sym', 'varchar(20)', 'YES', '', None, ''), ('family', 'varchar(255)', 'YES', '', None, ''), ('chr', 'varchar(255)', 'YES', '', None, ''), ('seq', 'text', 'YES', '', None, ''), ('dtoid', 'varchar(13)', 'YES', '', None, ''), ('stringid', 'varchar(15)', 'YES', '', None, '')]

Importing the tables to Pandas

Compound Activity

Next, we use Pandas to read the data directly from the tables. First the cmpd_activity table, which contains information about the binding affinity of compounds to targets in the database:

query = "SELECT id, target_id, cmpd_id_in_src, cmpd_name_in_src, \
         smiles, act_value, act_type \
         FROM cmpd_activity"
cmpd_activity = pd.read_sql(query, con=db_connection)

(382291, 7)

id target_id cmpd_id_in_src cmpd_name_in_src smiles act_value act_type
0 1 3006 CHEMBL365855 N-(5-Cyclobutyl-thiazol-2-yl)-2-phenyl-acetamide O=C(Cc1ccccc1)Nc2ncc(s2)C3CCC3 7.60 IC50
1 2 3006 CHEMBL3775677 3-Isopropyl-5-(2,3-dihydroxypropyl)amino-7-[4-... CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(O)... 7.68 IC50
2 3 3006 CHEMBL3775608 3-Isopropyl-5-(3-amino-2-hydroxypropyl)amino-7... CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(N)... 7.77 IC50


We read in the data we want from the protein table:

query = "SELECT id, name, description, uniprot, family, seq \
         FROM protein"
protein = pd.read_sql(query, con=db_connection)

(20244, 6)

id name description uniprot family seq
0 1 1433E_HUMAN 14-3-3 protein epsilon P62258 Belongs to the 14-3-3 family. MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLS...
1 2 1433F_HUMAN 14-3-3 protein eta Q04917 Belongs to the 14-3-3 family. MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLS...
2 3 1433T_HUMAN 14-3-3 protein theta P27348 Belongs to the 14-3-3 family. MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSV...


For the target table, we are interested in filtering for targets that are in the Tclin or Tchem development classifications.

query = "SELECT id, name, tdl, fam, famext \
         FROM target \
         WHERE tdl='Tclin' OR tdl='Tchem'"
target = pd.read_sql(query, con=db_connection)

(2211, 5)

id name tdl fam famext
0 2 14-3-3 protein eta Tchem None None
1 3 14-3-3 protein theta Tchem None None
2 23 3 beta-hydroxysteroid dehydrogenase/Delta 5-->... Tchem Enzyme 3-beta-HSD

Since we have all the data stored in memory, we no longer need the database connection.


Filtering by number of actives

Here, we filter out receptors that contain less than 15 active molecules.

num_actives = {}
target_ids = cmpd_activity.target_id.unique()
for i in target_ids:
    num_actives[i] = len(cmpd_activity[cmpd_activity.target_id == i])

target['num_actives'] = x: num_actives.get(x))
target = target[target['num_actives'] >= 15]
target.num_actives = target.num_actives.apply(int)  # Convert from float to int

(1067, 6)

Whereas before we had a total of 2,211 targets in Tclin and Tchem, now we only have 1,067 which contain more than 15 experimental activity values.

Finally, we create a pie chart to visualize the number of targets in each target family:

tchem_tclin_fams = {}
families = [fam for fam in target.fam.unique() if fam is not None]

for f in sorted(families):
    tchem_tclin_fams[f] = len(target[target.fam == f])
tchem_tclin_fams['None'] = len(target[target.fam.isna()])

{'Enzyme': 348, 'Epigenetic': 42, 'GPCR': 189, 'IC': 91, 'Kinase': 205, 'NR': 28, 'TF': 6, 'TF; Epigenetic': 5, 'Transporter': 35, 'None': 118}

plt.figure(figsize=(4, 4))
width = .6
explode = [0, 0, 0, 0, 0, .3, .2, .1, 0, 0]
labels = ["{}: {}".format(f, n) for f, n in zip(tchem_tclin_fams.keys(),
plt.pie(tchem_tclin_fams.values(), labels=labels, radius=2, explode=explode,
        wedgeprops=dict(width=width, edgecolor='w'), autopct='%1.0f%%',
        pctdistance=.8, labeldistance=1.1)

plt.savefig("pharos_targets.svg", bbox_inches = 'tight')


From this target data, we could further filter down to receptors that have known protein structures, as shown in the SIFTS database post. In this case, we will simply concatenate the data from the Protein table to the Target table, in order to obtain information about the UniProt ID, protein ontology, and sequence. Finally, we will write the data to csv files for further analysis.

Joining Tables

We need to join the Protein and Target tables by id. The two tables should have the same size:

protein = protein[]

(1067, 6)

Joining the tables:

protein = protein.set_index("id")
target = target.set_index("id")
result = pd.concat([target, protein], axis=1, join='outer')
name tdl fam famext num_actives name description uniprot family seq
26 5-hydroxytryptamine receptor 2B Tclin GPCR GPCR 777 5HT2B_HUMAN 5-hydroxytryptamine receptor 2B P41595 Belongs to the G-protein coupled receptor 1 fa... MALSYRVSELQSTIPEHILQSTFVHVISSNWSGLQTESIPEEMKQI...
27 5-hydroxytryptamine receptor 2C Tclin GPCR GPCR 1612 5HT2C_HUMAN 5-hydroxytryptamine receptor 2C P28335 Belongs to the G-protein coupled receptor 1 fa... MVNLRNAVHSFLVHLIGLLVWQCDISVSPVAAIVTDIFNTSDGGRF...
30 5'-nucleotidase Tchem Enzyme None 23 5NTD_HUMAN 5'-nucleotidase P21589 Belongs to the 5'-nucleotidase family. MCPRAARAPATLLLALGAVLWPAAGAWELTILHTNDVHSRLEQTSE...

Exporting the data

We separate each target class into different Data Frames, store these in a dictionary, and also save them to separate csv files.

target_dfs = {}
for f in families:
    target_dfs[f] = result[result.fam == f]
    target_dfs[f].to_csv(f + ".csv")