Mining Pharos with MySQL and Python

Why MySQL and Python?
Connecting to Pharos
Executing database queries
Importing the tables to Pandas
Filtering by number of actives
Joining Tables
Exporting the data

Why MySQL and Python?

Previously, I demonstrated how to use the SIFTS database to find UniProt-to-PDB mappings for proteins from the Pharos database. To do this, we downloaded csv format files for different receptor classes directly from the Pharos website. However, manually downloading these data files is tedious, and does not allow us to keep our data up to date with future changes in the source database. A much more efficient way is to obtain this data directly through SQL queries.

I must confess that I am not proficient when it comes to complex table joins and filters in SQL, but I can do the job in Python! Additionally reading SQL tables into Python allows us to use Python’s data visualization libraries on the data with ease.

In this notebook, we use MySQL Connector and Python’s Pandas library to retrieve and manipulate data for Pharos targets. The goal is to obtain a dataset of targets that contain more than 15 active compounds, along with information about their different target classes.

All the code in this post is also available as a Jupyter notebook here.

To install mysql-connector, run: pip install mysql-connector-python-rf.

First, we import the necessary libraries:

import mysql.connector as sql
import pandas as pd
import matplotlib.pyplot as plt

Connecting to Pharos

We use Python to create an SQL connection to the Pharos database:

db_connection = sql.connect(host='tcrd.kmc.io', db='tcrd540', user='tcrd')
db_connection

<mysql.connector.connection.MySQLConnection at 0x7f428fca0668>

In order to use the new connnection, we need to create a cursor object, which allows us to send instructions to the database:

db_cursor = db_connection.cursor()

Executing database queries

We can use the newly created cursor to execute queries. First we execute the SHOW TABLES MySQL command, to get an idea of the kind of tables we can collect information from.

The cursor.fetchall() method returns a list, and is equivalent to calling list() on the cursor object.

db_cursor.execute('SHOW TABLES;')
tables = db_cursor.fetchall()
print(tables)

[('alias',), ('cmpd_activity',), ('cmpd_activity_type',), ('compartment',), ('compartment_type',), ('data_type',), ('dataset',), ('dbinfo',), ('disease',), ('disease_type',), ('do',), ('do_parent',), ('drug_activity',), ('dto',), ('expression',), ('expression_type',), ('feature',), ('gene_attribute',), ('gene_attribute_type',), ('generif',), ('goa',), ('hgram_cdf',), ('info_type',), ('kegg_distance',), ('kegg_nearest_tclin',), ('locsig',), ('mlp_assay_info',), ('ortholog',), ('ortholog_disease',), ('p2pc',), ('panther_class',), ('patent_count',), ('pathway',), ('pathway_type',), ('phenotype',), ('phenotype_type',), ('pmscore',), ('ppi',), ('ppi_type',), ('protein',), ('protein2pubmed',), ('provenance',), ('ptscore',), ('pubmed',), ('t2tc',), ('target',), ('tdl_info',), ('tdl_update_log',), ('techdev_contact',), ('techdev_info',), ('tinx_articlerank',), ('tinx_disease',), ('tinx_importance',), ('tinx_novelty',), ('tinx_target',), ('xref',), ('xref_type',)]

Above, we see a list of the tables. We can use the DESCRIBE query to obtain a list of the attributes of a particular table. In this case, we are interested in the protein, target, and cmpd_activity tables.

db_cursor.execute('DESCRIBE protein;')
list(db_cursor)

[('id', 'int(11)', 'NO', 'PRI', None, 'auto_increment'), ('name', 'varchar(255)', 'NO', 'UNI', None, ''), ('description', 'text', 'NO', '', None, ''), ('uniprot', 'varchar(20)', 'NO', 'UNI', None, ''), ('up_version', 'int(11)', 'YES', '', None, ''), ('geneid', 'int(11)', 'YES', '', None, ''), ('sym', 'varchar(20)', 'YES', '', None, ''), ('family', 'varchar(255)', 'YES', '', None, ''), ('chr', 'varchar(255)', 'YES', '', None, ''), ('seq', 'text', 'YES', '', None, ''), ('dtoid', 'varchar(13)', 'YES', '', None, ''), ('stringid', 'varchar(15)', 'YES', '', None, '')]

Importing the tables to Pandas

Compound Activity

Next, we use Pandas to read the data directly from the tables. First the cmpd_activity table, which contains information about the binding affinity of compounds to targets in the database:

query = "SELECT id, target_id, cmpd_id_in_src, cmpd_name_in_src, \
         smiles, act_value, act_type \
         FROM cmpd_activity"
cmpd_activity = pd.read_sql(query, con=db_connection)

print(cmpd_activity.shape)
cmpd_activity.head(3)

(382291, 7)

	id	target_id	cmpd_id_in_src	cmpd_name_in_src	smiles	act_value	act_type
0	1	3006	CHEMBL365855	N-(5-Cyclobutyl-thiazol-2-yl)-2-phenyl-acetamide	O=C(Cc1ccccc1)Nc2ncc(s2)C3CCC3	7.60	IC50
1	2	3006	CHEMBL3775677	3-Isopropyl-5-(2,3-dihydroxypropyl)amino-7-[4-...	CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(O)...	7.68	IC50
2	3	3006	CHEMBL3775608	3-Isopropyl-5-(3-amino-2-hydroxypropyl)amino-7...	CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(N)...	7.77	IC50

Protein

We read in the data we want from the protein table:

query = "SELECT id, name, description, uniprot, family, seq \
         FROM protein"
protein = pd.read_sql(query, con=db_connection)

print(protein.shape)
protein.head(3)

(20244, 6)

	id	name	description	uniprot	family	seq
0	1	1433E_HUMAN	14-3-3 protein epsilon	P62258	Belongs to the 14-3-3 family.	MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLS...
1	2	1433F_HUMAN	14-3-3 protein eta	Q04917	Belongs to the 14-3-3 family.	MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLS...
2	3	1433T_HUMAN	14-3-3 protein theta	P27348	Belongs to the 14-3-3 family.	MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSV...

Target

For the target table, we are interested in filtering for targets that are in the Tclin or Tchem development classifications.

query = "SELECT id, name, tdl, fam, famext \
         FROM target \
         WHERE tdl='Tclin' OR tdl='Tchem'"
target = pd.read_sql(query, con=db_connection)

print(target.shape)
target.head(3)

(2211, 5)

	id	name	tdl	fam	famext
0	2	14-3-3 protein eta	Tchem	None	None
1	3	14-3-3 protein theta	Tchem	None	None
2	23	3 beta-hydroxysteroid dehydrogenase/Delta 5-->...	Tchem	Enzyme	3-beta-HSD

Since we have all the data stored in memory, we no longer need the database connection.

db_connection.close()

Filtering by number of actives

Here, we filter out receptors that contain less than 15 active molecules.

num_actives = {}
target_ids = cmpd_activity.target_id.unique()
for i in target_ids:
    num_actives[i] = len(cmpd_activity[cmpd_activity.target_id == i])

target['num_actives'] = target.id.apply(lambda x: num_actives.get(x))
target = target[target['num_actives'] >= 15]
target.num_actives = target.num_actives.apply(int)  # Convert from float to int
target.shape

(1067, 6)

Whereas before we had a total of 2,211 targets in Tclin and Tchem, now we only have 1,067 which contain more than 15 experimental activity values.

Finally, we create a pie chart to visualize the number of targets in each target family:

tchem_tclin_fams = {}
families = [fam for fam in target.fam.unique() if fam is not None]

for f in sorted(families):
    tchem_tclin_fams[f] = len(target[target.fam == f])
tchem_tclin_fams['None'] = len(target[target.fam.isna()])

tchem_tclin_fams

{'Enzyme': 348, 'Epigenetic': 42, 'GPCR': 189, 'IC': 91, 'Kinase': 205, 'NR': 28, 'TF': 6, 'TF; Epigenetic': 5, 'Transporter': 35, 'None': 118}

plt.figure(figsize=(4, 4))
width = .6
explode = [0, 0, 0, 0, 0, .3, .2, .1, 0, 0]
labels = ["{}: {}".format(f, n) for f, n in zip(tchem_tclin_fams.keys(),
          tchem_tclin_fams.values())]
plt.pie(tchem_tclin_fams.values(), labels=labels, radius=2, explode=explode,
        wedgeprops=dict(width=width, edgecolor='w'), autopct='%1.0f%%',
        pctdistance=.8, labeldistance=1.1)

plt.savefig("pharos_targets.svg", bbox_inches = 'tight')

svg

From this target data, we could further filter down to receptors that have known protein structures, as shown in the SIFTS database post. In this case, we will simply concatenate the data from the Protein table to the Target table, in order to obtain information about the UniProt ID, protein ontology, and sequence. Finally, we will write the data to csv files for further analysis.

Joining Tables

We need to join the Protein and Target tables by id. The two tables should have the same size:

protein = protein[protein.id.isin(target.id)]
protein.shape

(1067, 6)

Joining the tables:

protein = protein.set_index("id")
target = target.set_index("id")
result = pd.concat([target, protein], axis=1, join='outer')

result.head(3)

	name	tdl	fam	famext	num_actives	name	description	uniprot	family	seq
id
26	5-hydroxytryptamine receptor 2B	Tclin	GPCR	GPCR	777	5HT2B_HUMAN	5-hydroxytryptamine receptor 2B	P41595	Belongs to the G-protein coupled receptor 1 fa...	MALSYRVSELQSTIPEHILQSTFVHVISSNWSGLQTESIPEEMKQI...
27	5-hydroxytryptamine receptor 2C	Tclin	GPCR	GPCR	1612	5HT2C_HUMAN	5-hydroxytryptamine receptor 2C	P28335	Belongs to the G-protein coupled receptor 1 fa...	MVNLRNAVHSFLVHLIGLLVWQCDISVSPVAAIVTDIFNTSDGGRF...
30	5'-nucleotidase	Tchem	Enzyme	None	23	5NTD_HUMAN	5'-nucleotidase	P21589	Belongs to the 5'-nucleotidase family.	MCPRAARAPATLLLALGAVLWPAAGAWELTILHTNDVHSRLEQTSE...

Exporting the data

We separate each target class into different Data Frames, store these in a dictionary, and also save them to separate csv files.

target_dfs = {}
for f in families:
    target_dfs[f] = result[result.fam == f]
    target_dfs[f].to_csv(f + ".csv")

PREVIOUSMapping Pharos Targets to PDB Structures

NEXTMachine Learning Methods for LogP Prediction: Pt. 1