Stage 1: Finding human α-helical transmembrane domains
Of the 29,375 unique human protein sequences from the RefSeq-37 database of the human genome9, 7,299 were predicted by the TMHMM 2.01 program to contain at least one transmembrane α-helix; 3,838 were predicted to contain two or more such helices. For each full-length sequence, only the transmembrane domain from the first to the last predicted transmembrane α-helix residue was used for all subsequent analyses. Only a single representative of sequences with more than 98% sequence identity to each other was retained, using program USEARCH10, yielding the final non-redundant dataset of 2,925 domains (Suppl. Fig. 1a).
Stage 2: Assessing current modeling coverage
To provide context for assessing various target selection lists and strategies, we assessed the current ability to model each of the human α-helical transmembrane domain sequences. Automated comparative modeling using ModPipe12 resulted in “reliable” models (Supp. Note 1) for 10-100% of residues in 2,683 of the 2,925 non-redundant human α-helical transmembrane domains. The modeling coverage of a domain sequence was calculated as the percentage of its residues that were modeled based on a template structure with the sequence identity to the modeled sequence above a given sequence identity threshold (Suppl. Fig. 1b). The modeling coverage was described as high, medium, and low when >90%, 60-90%, and < 60% of the domain residues, respectively, were modeled.
Stage 3: Clustering of domain sequences
A cluster suitable for structural genomics target selection needs to satisfy two requirements to ensure that a structure determined for one of its members allows
the modeling of most if not all other cluster members: (i) member sequences must be sufficiently similar to each other and (ii) member sequences must be aligned without long gaps. Simultaneously, the number of clusters of a given set of sequences, such as the human membrane proteome, needs to be minimized, to maximize the efficiency of the structural genomics effort.
To find the optimal clustering parameters, the domain sequences were clustered based on threshold levels of proportion of residues that can be aligned (>50%, >70%, and >90%; “coverage threshold”) and the sequence identity of the aligned segments (>20%, >25%, and >30%; “sequence identity threshold”), using program BLASTCLUST24 (Suppl. Table 1). For quality control, the sequences in each cluster were aligned, using a multiple sequence alignment program MUSCLE26. We quantified the gaps in an alignment by computing its “gap ratio”, defined as the proportion of single residue gap positions in the alignment. Hence, a small gap ratio is associated with more accurate alignments that are preferred for comparative modeling. At the same coverage threshold, the gap ratio is approximately constant for the three sequence identity thresholds. As expected, the gap ratio is highest and lowest at the coverage thresholds of 50% and 90%, respectively. For further analysis, the intermediary coverage threshold of 70% was chosen.
Next, we inspected the number of members in a cluster (cluster size; Suppl. Fig. 2a). The distribution of the number of clusters as a function of cluster size (eg, the red bars in Suppl. Fig. 2a) is similar at the three sequence identity thresholds. Approximately 100 of the clusters contain 4 or more members. One large cluster with approximately 600 members, including the GPCR families, occurs at all three sequence identity thresholds. Finally, we assessed the current average modeling coverage of the sequences in each cluster (Suppl. Fig. 2b). Large clusters that currently have low average modeling coverage are attractive sources of targets for structural genomics, because their representative structures would result in a large increase in reliably modeled sequences.
Stage 4: Assessing target selection strategies
We assessed ‘guided’ and ‘random’ target selection strategies with respect to the number of target structures that need to be determined by experiment to achieve varying degrees of structural characterization of the human membrane proteome by comparative modeling. Guided target selection prioritizes experimental structure determination of targets by the size of the clusters that they make accessible to modeling. In contrast, random target selection picks targets randomly from a list of domain sequences without known structures.
To compare and contrast the guided and random target selection schemes, we computed the number of domain sequences that can be modeled based on the future successful determination of a 100 new target structures, with a modeling coverage threshold of 70% at different sequence identity thresholds27. The number of 100 was selected because it is feasible for the nine PSI centers to determine 100 membrane protein structures during two five-year cycles of PSI:Biology.
In the first scenario, we considered all unique human α-helical transmembrane domains, whether or not they can be currently modeled (Suppl. Fig. 3a; Suppl. Table 2). As expected, the guided target selection yields a significantly higher number of domain sequences covered when compared to the random target selection at the same sequence identity thresholds. For guided selection, the number of the domains that could be modeled based on 100 target structures is 1,445, 1,488, and 1,514 at 30%, 25%, and 20% sequence identity threshold, respectively (at 70% modeling coverage threshold). In contrast, for the random selection, 100 structures would allow comparative modeling of only approximately 900, 950, and 1,000 sequences at 30%, 25%, and 20% sequence identity threshold, respectively. This result highlights the superior efficiency of the guided target selection strategy over random choice27. Thus, the number of sequences that can be structurally characterized increases by approximately 50% when using the guided target selection strategy, as proposed here, instead of the random selection strategy.
In the second scenario, only sequences without current models were considered by first removing all clusters with sufficiently close homologs of known structure from the analysis (Suppl. Fig. 3b). As expected, guided target selection is also superior over random choice in this scenario.
The current target lists of the nine PSI membrane protein centers collectively contain 14,591 unique targets (Suppl. Table 3a). Of these, we identified 464 (3.2%) sequences that match human α-helical transmembrane domain sequences at high sequence identity (90%) and modeling coverage thresholds (90%, Suppl. Table 2). These sequences were then mapped onto the domain clusters, resulting in 57 clusters comprising a total of 1,190 sequences, which have at least one cluster member that is already included in the target lists of the nine PSI membrane protein centers. 20 of these clusters include sequences that have already been structurally characterized at the 25% sequence identity level, leaving a total of 47 clusters with 1,075 structurally uncharacterized sequences. Structure determination of at least one member from each one of these 47 clusters would thus increase the structural coverage of the human α-helical transmembrane proteome by approximately 1,000.
Stage 5: Expanding the target set by adding homologous sequences
To augment the potential target pool by the non-human homologs of the human sequences, we identified approximately 450,000 non-human sequences from approximately 50,000 different organisms homologous to the human α-helical transmembrane domains over at least 70% of the residues at the 30% sequence identity threshold. The corresponding expanded alignments can be valuable for target selection.
On average, each domain sequence has approximately 2,000 homologs from other organisms in UniProt. The number of homologs per sequence ranges from approximately 50,000 for the transmembrane domain of cytochrome b to only one for 40 transmembrane domains (Suppl. Fig. 4). For 17 human domain sequences, scanning UniProt did not identify any non-human homologs given the two thresholds used. While the non-human homologs are, as expected, predominantly from eukaryotic organisms (368,401 sequences), we were also able to identify 82,386 sequences from bacteria and 1,985 sequences from archaea (Suppl. Fig. 4).
Most of the larger clusters with a significant number of non-eukaryotic homologs have already been structurally characterized (e.g., ATPases and aquaporins). Clusters with bacterial and archaeal homologs with low modeling coverage include for example SLC transporters and lipid phosphate phosphatases. A full list of the sequences with non-eukaryotic homologs, cluster sizes, and annotations can be found at salilab.org/membrane, menu item ‘Non-human Homologs’.