All human sequences have been obtained from RefSeq (BUILD.37.2, ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/ARCHIVE).
This assembly contains 33,610 protein sequences. 3,838 sequences have been predicted to contain at least two transmembrane helices (TMHMM2.0).
These 3,838 sequence have been truncated, starting with the first predicted tmh residue to the last predicted tmh residue, resulting in 3,418 unique protein sequences. We call this set of sequences the TMH domain proteome.
This is a "cheap" method of defining the transmembrane domain, and for some sequences, a significant section of the sequence can not be considered a transmembrane region. However, visual inspection of the ModBase sketch that indicates the position of the predicted transmembrane helices reveals, that most of the sequences that might need to get excluded contain only two predicted transmembrane helices (see figure above for the distribution of the number of predicted TM helices).
For some of the subsequent analysis, we also reduced the number of sequences further by clustering them at 98% sequence identity, and choosing the longest sequence in the resulting clusters as a representative. The resulting sequence set is named subsequently the TMH98 sequences (2926 sequences).