- Why MySQL and Python?
- Connecting to Pharos
- Executing database queries
- Importing the tables to Pandas
- Filtering by number of actives
- Joining Tables
- Exporting the data
Why MySQL and Python?
Previously, I demonstrated how to use the SIFTS database to find UniProt-to-PDB mappings for proteins from the Pharos database. To do this, we downloaded csv format files for different receptor classes directly from the Pharos website. However, manually downloading these data files is tedious, and does not allow us to keep our data up to date with future changes in the source database. A much more efficient way is to obtain this data directly through SQL queries.
I must confess that I am not proficient when it comes to complex table joins and filters in SQL, but I can do the job in Python! Additionally reading SQL tables into Python allows us to use Python’s data visualization libraries on the data with ease.
In this notebook, we use MySQL Connector and Python’s Pandas library to retrieve and manipulate data for Pharos targets. The goal is to obtain a dataset of targets that contain more than 15 active compounds, along with information about their different target classes.
All the code in this post is also available as a Jupyter notebook here.
To install mysql-connector, run: pip install mysql-connector-python-rf
.
First, we import the necessary libraries:
Connecting to Pharos
We use Python to create an SQL connection to the Pharos database:
<mysql.connector.connection.MySQLConnection at 0x7f428fca0668>
In order to use the new connnection, we need to create a cursor object, which allows us to send instructions to the database:
Executing database queries
We can use the newly created cursor to execute queries.
First we execute the SHOW TABLES
MySQL command, to get an idea of the kind
of tables we can collect information from.
The cursor.fetchall()
method returns a list,
and is equivalent to calling list()
on the cursor object.
[('alias',), ('cmpd_activity',), ('cmpd_activity_type',), ('compartment',), ('compartment_type',), ('data_type',), ('dataset',), ('dbinfo',), ('disease',), ('disease_type',), ('do',), ('do_parent',), ('drug_activity',), ('dto',), ('expression',), ('expression_type',), ('feature',), ('gene_attribute',), ('gene_attribute_type',), ('generif',), ('goa',), ('hgram_cdf',), ('info_type',), ('kegg_distance',), ('kegg_nearest_tclin',), ('locsig',), ('mlp_assay_info',), ('ortholog',), ('ortholog_disease',), ('p2pc',), ('panther_class',), ('patent_count',), ('pathway',), ('pathway_type',), ('phenotype',), ('phenotype_type',), ('pmscore',), ('ppi',), ('ppi_type',), ('protein',), ('protein2pubmed',), ('provenance',), ('ptscore',), ('pubmed',), ('t2tc',), ('target',), ('tdl_info',), ('tdl_update_log',), ('techdev_contact',), ('techdev_info',), ('tinx_articlerank',), ('tinx_disease',), ('tinx_importance',), ('tinx_novelty',), ('tinx_target',), ('xref',), ('xref_type',)]
Above, we see a list of the tables. We can use the DESCRIBE
query to obtain
a list of the attributes of a particular table.
In this case, we are interested in the protein, target, and cmpd_activity tables.
[('id', 'int(11)', 'NO', 'PRI', None, 'auto_increment'),
('name', 'varchar(255)', 'NO', 'UNI', None, ''),
('description', 'text', 'NO', '', None, ''),
('uniprot', 'varchar(20)', 'NO', 'UNI', None, ''),
('up_version', 'int(11)', 'YES', '', None, ''),
('geneid', 'int(11)', 'YES', '', None, ''),
('sym', 'varchar(20)', 'YES', '', None, ''),
('family', 'varchar(255)', 'YES', '', None, ''),
('chr', 'varchar(255)', 'YES', '', None, ''),
('seq', 'text', 'YES', '', None, ''),
('dtoid', 'varchar(13)', 'YES', '', None, ''),
('stringid', 'varchar(15)', 'YES', '', None, '')]
Importing the tables to Pandas
Compound Activity
Next, we use Pandas to read the data directly from the tables. First the cmpd_activity table, which contains information about the binding affinity of compounds to targets in the database:
(382291, 7)
id | target_id | cmpd_id_in_src | cmpd_name_in_src | smiles | act_value | act_type | |
---|---|---|---|---|---|---|---|
0 | 1 | 3006 | CHEMBL365855 | N-(5-Cyclobutyl-thiazol-2-yl)-2-phenyl-acetamide | O=C(Cc1ccccc1)Nc2ncc(s2)C3CCC3 | 7.60 | IC50 |
1 | 2 | 3006 | CHEMBL3775677 | 3-Isopropyl-5-(2,3-dihydroxypropyl)amino-7-[4-... | CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(O)... | 7.68 | IC50 |
2 | 3 | 3006 | CHEMBL3775608 | 3-Isopropyl-5-(3-amino-2-hydroxypropyl)amino-7... | CC(C)c1n[nH]c2c(NCc3ccc(cc3)c4ccccn4)nc(NCC(N)... | 7.77 | IC50 |
Protein
We read in the data we want from the protein table:
(20244, 6)
id | name | description | uniprot | family | seq | |
---|---|---|---|---|---|---|
0 | 1 | 1433E_HUMAN | 14-3-3 protein epsilon | P62258 | Belongs to the 14-3-3 family. | MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLS... |
1 | 2 | 1433F_HUMAN | 14-3-3 protein eta | Q04917 | Belongs to the 14-3-3 family. | MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLS... |
2 | 3 | 1433T_HUMAN | 14-3-3 protein theta | P27348 | Belongs to the 14-3-3 family. | MEKTELIQKAKLAEQAERYDDMATCMKAVTEQGAELSNEERNLLSV... |
Target
For the target table, we are interested in filtering for targets that are in the Tclin or Tchem development classifications.
(2211, 5)
id | name | tdl | fam | famext | |
---|---|---|---|---|---|
0 | 2 | 14-3-3 protein eta | Tchem | None | None |
1 | 3 | 14-3-3 protein theta | Tchem | None | None |
2 | 23 | 3 beta-hydroxysteroid dehydrogenase/Delta 5-->... | Tchem | Enzyme | 3-beta-HSD |
Since we have all the data stored in memory, we no longer need the database connection.
Filtering by number of actives
Here, we filter out receptors that contain less than 15 active molecules.
(1067, 6)
Whereas before we had a total of 2,211 targets in Tclin and Tchem, now we only have 1,067 which contain more than 15 experimental activity values.
Finally, we create a pie chart to visualize the number of targets in each target family:
{'Enzyme': 348,
'Epigenetic': 42,
'GPCR': 189,
'IC': 91,
'Kinase': 205,
'NR': 28,
'TF': 6,
'TF; Epigenetic': 5,
'Transporter': 35,
'None': 118}
From this target data, we could further filter down to receptors that have known protein structures, as shown in the SIFTS database post. In this case, we will simply concatenate the data from the Protein table to the Target table, in order to obtain information about the UniProt ID, protein ontology, and sequence. Finally, we will write the data to csv files for further analysis.
Joining Tables
We need to join the Protein and Target tables by id. The two tables should have the same size:
(1067, 6)
Joining the tables:
name | tdl | fam | famext | num_actives | name | description | uniprot | family | seq | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
26 | 5-hydroxytryptamine receptor 2B | Tclin | GPCR | GPCR | 777 | 5HT2B_HUMAN | 5-hydroxytryptamine receptor 2B | P41595 | Belongs to the G-protein coupled receptor 1 fa... | MALSYRVSELQSTIPEHILQSTFVHVISSNWSGLQTESIPEEMKQI... |
27 | 5-hydroxytryptamine receptor 2C | Tclin | GPCR | GPCR | 1612 | 5HT2C_HUMAN | 5-hydroxytryptamine receptor 2C | P28335 | Belongs to the G-protein coupled receptor 1 fa... | MVNLRNAVHSFLVHLIGLLVWQCDISVSPVAAIVTDIFNTSDGGRF... |
30 | 5'-nucleotidase | Tchem | Enzyme | None | 23 | 5NTD_HUMAN | 5'-nucleotidase | P21589 | Belongs to the 5'-nucleotidase family. | MCPRAARAPATLLLALGAVLWPAAGAWELTILHTNDVHSRLEQTSE... |
Exporting the data
We separate each target class into different Data Frames, store these in a dictionary, and also save them to separate csv files.