Human TM Domain Target Analysis

Target Analysis for the human membrane proteome

The data for the human proteome analysis were taken from all unique sequences from the combined datasets of all known and novel peptides in the ENSEMBL 37.57 release of the human genome. Only sequences where considered with at least 2 predicted transmembrane helices (as predicted by TMHMM2.0), to minimize false positives (5554 unique sequences). Of these sequences, an approximation of the transmembrane domain (first to last predicted transmembrane helix residues plus 10 residues on either end, if present) was used for the analysis.

According to Fagerberg et al., TMHMM is the most conservative of several TMH prediction methods, with its numbers the closest to the consensus set of transmembrane sequences in the human genome. The assembly of the human genome used in this analysis is more current than Fagerberg's curate dataset at proteinatlas.org. While the dataset is expected to be slightly different after a more thorough analysis, using additional TMH prediction programs and/or PFAM domain definitions to extract the TMH domains, we used the current dataset because we had a fairly recent (2010) dataset of comparative models for all sequences in this particular release of the human genome.

To be revised after modweb run finishes: 438 sequences from this dataset of human membrane proteins have been modeled with >85% coverage of the tmh domain with a sequence identity > =30% (as of June 2010).

Clustering has been performed using blastclust (30% sequence identity, 50% coverage). The clusters were divided into two groups: clusters including one or more sequence(s) with significant model coverage (as defined above): human_modeled.html, and clusters that don't include a sequence with significant model coverage: human_notmodeled.html.

Additionally to the human transmembrane domain proteome, we included the transmembrane domains of all sequences from 40 organisms in the clustering procedure. The phylogenetic tree has been constructed using the NCBI Taxonomy Browser.

The clustering results are summarized in the following graphs in the standard structural genomics style (Vitkup et al, 2001). The green line represents the number of structures needed for the number of sequences covered, if one sequence per cluster is structurally characterized (starting with the largest cluster). The red line represents the number of structures needed if the target selection is random. The first graph represents just the human sequences, the other 40 organisms have been added for the second graph. Finally, the results are summarized below.

Result set:

Human - 30% clustering, 50% coverage:

To be revised after finished ModPipe run.

Number of sequences: 5458
Number of unique TM Domains: 4484
Number of sequences plus homolog sequences: 14585
Number of clusters: 1444
Number of sequences in human clusters with a known structure* covering at least 85% of the TM region: 73
Number of sequences in all clusters with a known structure* covering at least 85% of the TM region: 124
Number of clusters minus clusters with a known structure covering at least 85% of the TM region: 1391
Number of human sequences (clusters with a known structure covering 85% removed): 3317
Number of human+homolog sequences (clusters with a known structure covering 85% removed): 9937
Number of clusters with at least 5 sequences in the cluster: 137 (with 1435 human sequences, 4430 sequences total)

* known structure can be a structure related to the sequence at 90% sequence identity or higher, or a model based on a template at 90% sequence identity or higher