The image shows the final clustering results using BLASTCLUST at three different sequence identity cut-offs and a coverage cut-off of 70%. The clusters of several prominent superfamilies have been marked.
Clustering of the Human TMD proteome will enable us to determine the sequences with the largest impact on also structurally characterizing other sequences, if a structure has been determined.
A cluster appropriate for target selection needs to satisfy two requirements to ensure each sequence could be used as a template to model any other sequence in the cluster:
(i) member sequences must be sufficiently similar to each other and
(ii) member sequences must be aligned without long gaps.
Simultaneously, the number of clusters of a given set of sequences needs to be minimized, to minimize the number of targets and thus the efficiency of the corresponding structural genomics effort.
We chose to explore clusters using the sequence clustering programs USEARCH and BLASTCLUST. Both programs have been tested for clustering sequences at higher than 40% sequence identity levels. There are no clustering programs available, to our knowledge, that are considered reliable at lower sequence identity levels. Thus, the clusters are less then ideal, and should always be inspected manually before making a target selection decision.
We compared the USEARCH and BLASTCLUST programs by their ability to generate sequence clusters useful for target selection. Both algorithms were assessed at several sequence identity thresholds; BLASTCLUST was also assessed at several coverage cutoffs (USEARCH does not have coverage cutoffs).
For all coverage cutoffs, BLASTCLUST results in a larger number of clusters than USEARCH. For example, at 30% sequence identity, BLASTCLUST results in 1,470 (50%) to 2,130 (90%) clusters, whereas USEARCH results in 1,322 clusters.
To further determine the more suitable clustering method, we calculated all pair-wise sequence identities of all sequences to the longest sequence within a cluster at the three clustering sequence identity thresholds using BLASTP34. The percentage of sequences below the cutoff sequence identity within the individual clusters is significantly higher at 30% and 25% clustering sequence identity for BLASTCLUST (33% for 30% sequence identity, 12% for 25% sequence identity, 1% for 20% sequence identity, 70% coverage cutoff) compared to USEARCH (17% for 30% sequence identity, 6% for 25% sequence identity, 1% for 20% sequence identity). For all sequence identity and coverage cutoffs, clustering using USEARCH resulted in cluster alignments with more similar sequences and in a smaller number of clusters than clustering by BLASTCLUST.
Another criterion for assessing the quality of the sequence clusters considers the length distribution of the individual sequences in the cluster. The most useful clusters for structural genomics contain sequences of similar length, with a minimum number and size of alignment gaps. On average, the gap to alignment length ratio is 23% for USEARCH (25% sequence identity) and 8% for BLASTCLUST (25% sequence identity, 70% coverage), making BLASTCLUST the preferred method for this criterion.
Both methods perform best for different criteria, however, the gap ratio is significantly better with BLASTCLUST. We continued our analysis with BLASTCLUST clusters at 25% sequence identity with 70% length coverage. To allow for maximum flexibility, we make all clusters available on the this web site (menu item “clustering -> cluster view”).
Clustering quality assessment
Gap ratio for BLASTCLUST clusters
|Sequence identity threshold||20%||20%||20%||25%||25%||25%||30%||30%||30%|
|Length coverage threshold||50%||70%||90%||50%||70%||90%||50%||70%||90%|
|Number of clusters||1,059||1,186||1,394||1,074||1,201||1,406||1,106||1,236||1,435|
Sequence identities of sequences within clusters
|Method||Sequence identity||Average number of sequences below threshold||Average number of fraction below cutoff sequence identity|