Technology

Data sources

  • The RefSeq assembly of the human genome (Build 37.2).
  • ModBase models for all predicted transmembrane sequences, calculated using our automated comparative modeling pipeline ModPipe on the QB3 compute cluster.
  • The UniProtKB protein sequence database for homolog sequences.
  • Clustered human transmembrane sequences (using BlastClust and usearch) in external database tables.

Web-site

Motivation

This web-site has several purposes:

  • To inform the collaborators on the project progress (2011-2013).
  • To present the results in a form that allows to find specific sub-sets of the data.
  • To rapidly create a mechanism for all collaborators to inspect intermediate results and comment on it.
  • To present the final results to the community.

Since the results were published, most of the content of the analysis stays accessible through this site for informational purpose. A web-site for the follow up project (human membrane production study) has been created at the Protein Structure Initiative site.

Technology
  • The web-site was built using the content management system Drupal 7.
  • The web-site is expected to be low-traffic, only minimal site optimization has been implemented. Particularly, the loading of images is not optimized - however - the ModBase sketch images are critical for a quick visual inspection of the results.
  • Theming: A subtheme of the Adaptive theme was created to resemble the Salilab design (http://salilab.org, http://salilab.org/modbase). The emphasis was on functionality, not on creating a polished site. Especially, form search sections would need a bit more clean-up for a polished site.
  • Individual records for 3838 human transmembrane domain sequences have been collected from the ModBase database tables (meta-data and model data) and details of the analysis, and were imported as nodes into Drupal tables using a custom node-creation script (drush). Also all PSI targets from the membrane centers have been collected from Target Track (http://sbkb.org/tt/), and imported into nodes, to enable mapping between existing targets and the human tmh domains. This ensures that the sequence records and PSI targets are accessible through the Drupal search mechanism.
  • Content (number of nodes):
    • 52,649 cluster members
    • 1,667 cluster nodes
    • 3,841 sequences
    • 14,762 PSI center targets
    • 17,558 matches for PSI center targets and human tm domain sequences
  • Details for approximately 10 million homolog entries from 440,000 unique sequences in external tables were made accessible to this site via a custom views-extension module that retrieves data from ModBase database tables directly. To achieve this, it was necessary to create a MySQL view including the sequence identifiers and the meta data, because the Drupal Views API doesn't allow table joins from different databases.
  • ModBase sketches were created externally, and are displayed on the Cluster Details page, to allow a visual inspection of the sequence clusters.