Mapping Pharos Targets to PDB Structures


Problem Description

The Structure Integration with Function, Taxonomy and Sequence (SIFTS) database provides mappings between UniProt and PDB, as well as annotations from GO, InterPro, Pfam, CATH, SCOP, PubMed, Ensembl and other resources. Here, we map all the receptors from the Pharos database to their PDB IDs, using their UniProt accession numbers.

The goal is to obtain a dataset of human targets with available structures and known ligand binding affinities. I also want to get the distribution of these PDB structures across different receptor families, such as Kinases, GPCRs, Ion Channels, Nuclear Receptors, and Transporters.

Getting the Data

First we read in Pharos data csv files downloaded from Pharos for targets in the Tclin (targets with approved drugs), and Tchem (targets with known binding affinities), for several receptor classes. The csv files contain UniProt IDs for each receptor. All downloaded data and code is available in my GitHub repository: ravila4/Pharos-to-PDB

import pandas as pd

target_classes = ["GPCRs", "ion-channels", "kinases", "nuclear-receptors", "transporters"]
IDG_data = {}
for tclass in target_classes:
    IDG_data[tclass] = pd.read_csv("data/" + tclass + ".csv", index_col=False)

Read SIFTS Mappings

The mappings were downloaded as a CSV file from their ftp site.

uniprot_to_pdb = pd.read_csv("data/uniprot_pdb.csv", skiprows=1)
A sample of the SIFTS Data Frame.
0 A0A010 5b00;5b01;5b02;5b03;5b0i;5b0j;5b0k;5b0l;5b0m;5...
1 A0A011 3vk5;3vka;3vkb;3vkc;3vkd
2 A0A014C6J9 6br7
3 A0A016UNP9 2md0
4 A0A023GPI4 2m6j

Find PDB IDs

Here’s a function for joining the two Data Frames:

def find_pdbs(df):
    """ Input: Data Frame of Pharos data.
        Output: List of PDB IDs. """
    IDS = []
    for i in range(len(df)):
        pdb_ids = None
        uniprot_id = df.loc[:, "Uniprot ID"][i]
        mapping = uniprot_to_pdb[uniprot_to_pdb.SP_PRIMARY == uniprot_id]
        if len(mapping) != 0:
            pdb_ids = mapping.PDB.iloc[0].split(';')
    return IDS

Adding PDB IDs to Pharos targets:

for df in IDG_data.values():
    df['PDB_IDS'] = find_pdbs(df)

Summarizing the Data

Number of receptors in each class with at least one structure in the Protein Data Bank:

pdbs_per_class = {}

for IDG_class in IDG_data:
    df = IDG_data[IDG_class]
    num_available = len(df) - sum(df.PDB_IDS.isna())
    pdbs_per_class[IDG_class] = num_available


{'GPCRs': 77, 'ion-channels': 70, 'kinases': 304, 'nuclear-receptors': 41, 'transporters': 15}

Visualizing the Data

Finally, we visualize the results with a pie chart:

import matplotlib.pyplot as plt

labels = ["{}: {}".format(f, n) for f, n in zip(pdbs_per_class.keys(),
plt.pie(pdbs_per_class.values(), labels=labels, radius=1,
        wedgeprops=dict(width=width, edgecolor='w'))