Human TM Domain Target Analysis

Target Analysis for the human membrane proteome

The data for the human proteome analysis were taken from all unique sequences from the combined datasets of all known and novel peptides in the ENSEMBL 37.57 release of the human genome. Only sequences where considered with at least 2 predicted transmembrane helices (as predicted by TMHMM2.0), to minimize false positives (5554 unique sequences). Of these sequences, an approximation of the transmembrane domain (first to last predicted transmembrane helix residues plus 10 residues on either end, if present) was used for the analysis.

According to Fagerberg et al., TMHMM is the most conservative of several TMH prediction methods, with its numbers the closest to the consensus set of transmembrane sequences in the human genome. The assembly of the human genome used in this analysis is more current than Fagerberg's curate dataset at proteinatlas.org. While the dataset is expected to be slightly different after a more thorough analysis, using additional TMH prediction programs and/or PFAM domain definitions to extract the TMH domains, we used the current dataset because we had a fairly recent (2010) dataset of comparative models for all sequences in this particular release of the human genome.

To be revised after modweb run finishes: 438 sequences from this dataset of human membrane proteins have been modeled with >85% coverage of the tmh domain with a sequence identity > =30% (as of June 2010).

Clustering has been performed using blastclust (30% sequence identity, 50% coverage). The clusters were divided into two groups: clusters including one or more sequence(s) with significant model coverage (as defined above): human_modeled.html, and clusters that don't include a sequence with significant model coverage: human_notmodeled.html.

Additionally to the human transmembrane domain proteome, we included the transmembrane domains of all sequences from 40 organisms in the clustering procedure. The phylogenetic tree has been constructed using the NCBI Taxonomy Browser.

The clustering results are summarized in the following graphs in the standard structural genomics style (Vitkup et al, 2001). The green line represents the number of structures needed for the number of sequences covered, if one sequence per cluster is structurally characterized (starting with the largest cluster). The red line represents the number of structures needed if the target selection is random. The first graph represents just the human sequences, the other 40 organisms have been added for the second graph. Finally, the results are summarized below.

 

Result set:

Human - 30% clustering, 50% coverage:

To be revised after finished ModPipe run.

* known structure can be a structure related to the sequence at 90% sequence identity or higher, or a model based on a template at 90% sequence identity or higher