Methods used in analysis

Stage 1: Finding human α-helical transmembrane domains

The consensus human membrane proteome was previously predicted by using different transmembrane α-helix prediction methods3-8, including TMHMM 2.01 (http://proteinatlas.org)2. However, the assembly of the human genome (ENSEMBL 36.52) has been updated since these studies were performed. Because our goal is to provide the basis for a dynamical up-to-date assessment, reflecting new membrane protein structures as they are determined, we used TMHMM 2.0 to identify transmembrane spanning proteins in an updated RefSeq version (BUILD 37.2) of the human genome9. For robustness of membrane protein identification, alignment, and clustering, only membrane proteins that are predicted to have at least two transmembrane α-helices were included in the analysis. For each full-length sequence, only the transmembrane domain from the first to the last predicted transmembrane α-helix residue was used for all subsequent analyses. To remove highly similar sequences from our analysis, the resulting domain set was clustered at 98% sequence identity (USEARCH10) followed by retaining only one representative for each cluster.

Stage 2: Assessing current modeling coverage

All transmembrane domain sequences were attempted to be modeled using ModPipe11-13, which relies on PSI-BLAST14 and MODELLER15 for its functionality. Sequence-structure matches were established using fold assignment methods, including sequence-sequence16, profile-sequences14,17, and profile-profile17,18 alignments, using a PDB19 template database from 9/30/2011. The probability of finding a template structure was increased by using alignments with the E-value threshold of 1.0. By default, ten models were calculated for each alignment. A representative model for each alignment was then chosen by ranking based on the atomic distance-dependent statistical potential DOPE20.
Finally, the fold of each comparative model was evaluated using several quality scores, using an approach originally developed for globular protein structures12. A comparative protein structure model was considered ‘reliable’ if it scored better than certain threshold for at least one of the quality criteria (zDOPE20 < 0, MPQS12 > 1.0, GA34121 > 0.7, TSVMod22 native overlap > 0.4, target-template sequence identity > 30%) or was generated from a significant sequence-structure alignment (an alignment is considered significant, if the corresponding E-value is lower than 0.0001)12; reliable models are generally predicted to have the correct fold. The reliable models were deposited in the ModBase database of modeled structures (salilab.org/modbase/search?dataset=tmh_sequences). They provide a useful resource for investigating a membrane protein whose structure has not yet been determined by experiment.
The ‘modeling coverage’ was calculated for each modeled domain sequence, taking into account all reliable models produced by ModPipe, as follows. Each residue in the sequence was first annotated with the sequence identity between the modeled sequence and its closest template. The modeling coverage was then computed as the percentage of residues above a given sequence identity threshold. We also classified the modeling coverage of a sequence as ‘high’ (> 90% of domain residues are modeled), ‘medium’ (60-90%), and low (<60%).

Stage 3: Clustering of domain sequences

For assistance in target selection, transmembrane domain sequences were clustered into the smallest possible number of clusters, such that the structure of any member of a cluster allows for comparative modeling of the remaining cluster members at specified thresholds on target-template sequence identity and fraction of aligned residues (coverage)23. To obtain a pragmatic solution, we clustered the human domain sequences using BLASTCLUST24 at several sequence identity (20%, 25%, 30%) and coverage (50%, 70%, 90%) thresholds. The BLASTCLUST algorithm includes in a cluster all domains that match at least one cluster domain at the specified sequence identity and coverage thresholds. A domain can only be a member of one cluster. As an aside, we also tested another algorithm, USEARCH10, but the corresponding clusters were judged to be less suitable for the purposes of structural genomics (data not shown).

Stage 4: Assessing target selection strategies

We tested two target selection strategies, ‘guided’ and ‘random.’ These strategies were assessed with respect to how many target structures need to be determined by experiment to achieve varying degrees of structural characterization of the human membrane proteome by comparative modeling. Guided target selection prioritizes experimental structure determination of domains by the size of the clusters that they make accessible to modeling. In contrast, random target selection picks targets randomly from a list of domain sequences without known structures. In addition, the utility of the existing target lists of the nine PSI membrane protein centers (TargetDB25; October 2011) for modeling the human membrane proteome was computed; these targets were matched to the human α-helical transmembrane domains by pairwise sequence comparison using BLASTP14.

Stage 5: Expanding the target set by adding homologous sequences

The cluster analysis was focused on the human sequences only, in the expectation that each human cluster will have a large number of non-human sequences related to at least one human member at more than 30% sequence identity. To get these non-human homologs, a multiple sequence profile of each human domain sequence was prepared by scanning the sequence against all 18.5 million sequences in the UniProt database (release-2011_08) using the BUILD_PROFILE module of MODELLER18, as implemented in ModPipe (http://salilab.org/modpipe), first using the default settings (E-value: 0.1, five iterations) against a non-redundant (at the 90% sequence identity level) version of UniProt. The profiles were then expanded by one additional iteration using the full UniProt database. Once the profile was calculated, only homologs with the sequence identity of at least 30% were retained. The set of such homologs for all human domains in the cluster corresponds to potential structural genomics targets that allow modeling of any cluster member.