Membrane proteins in the human genome

During the recent PSI meeting, the number of membrane proteins in the human genome, as determined by this study, was questioned. Here is a short summary with some notes and references.

The percentage of integral helix bundles in the human genome is believed to be 26%, according to:

  • Krogh A, Larsson B, von Heijne G, Sonnhammer E. 2001. Predicting transmembrane protein topology
    with a hidden Markov model. Application to complete genomes. J. Mol. Biol. 305:567–80
  • Fagerberg L, Jonasson K, von Heijne G,Uhle ́n M, Berglund L. 2010. Prediction oft he humanmembrane
    proteome. Proteomics 10:1141–49
  • In this study, we found the following:
    33,610 protein sequences in the human genome (Ref Seq BUILD.37.2).
    29,375 unique protein sequences (no alternative splicing).
    7,299 protein sequences are predicted to have at least one transmembrane helix (25%).
    3,838 protein sequences predicted to have at least two transmembrane helices (13%).
    3,418 unique sequences after removing all residues before and after the first and last predicted TMH residue.
    2,926 unique sequences after clustering at 98% sequence identity.

    Fagerberg et al:

    21,416 unique protein-coding genes
    47,237 protein sequences
    41,285 non-redundant protein sequences (unique at the amino acid level)
    5,508 protein-coding genes with predicted transmembrane regions (consensus, similar to the number using TMHMM).
    The largest fraction of these have one predicted transmembrane helix.

    The first image shows the distribution of transmembrane helices found using different methods by Fagerberg et al. The second image shows the distribution found in this study. The main difference is the distinction between protein-coding genes and protein sequences. We are using RefSeq human genome sequences from 2011, which only include coding-genes. Fagerberg et al. use protein-coding genes from the Ensemble dataset (2009) of protein sequences, analysed the protein sequences and assigned the transmembrane regions back to the coding sequences.