ModBase

ModBase Documentation

ModBase is a comprehensive database of comparative protein structure models.
ModBase is organized into datasets, which are either available to the public, to the academic community, or to specific users.
To submit new comments/suggestions/bugs, please send email to .

Help Topics
ModBase Search

User Login

There are several kinds of possible user logins that can be managed at the dataset management page:

  • Public: no login necessary
  • Academic: the academic agreement has to get accepted
  • ModWeb created User Logins (Username/Password received from ModWeb)
  • Private Datasets for specific User Groups (e.g. NYSGXRC)

For a ModWeb dataset, the user receives username/password by email after the modweb calculation has been finished and the data have been stored in ModBase.

During the login process, ModBase sets a cookie for each user login. These cookies are being checked at every ModBase query. Once a cookie is set, there is no need for subsequent logins for that user, unless the cookie gets deleted.

Datasets

ModBase is organized into Datasets. The comprehensive Dataset (combination of all SP/TR datasets) comprises comparative models of all sequences in the SwissProt and TrEMBL databases that have detectable similarity to an experimental protein structure.
Additionally, there are datasets for specific projects (SNP-HUMAN, Plasmodium falciparum, etc), datasets for user groups (e.g. nysgxrc), and dataset created by ModWeb calculations.

While some datasets are accessible publically, the majority of models are only accessible for the academic community.

ModWeb calculations are generally only available to the user who initiated the ModWeb calculation, unless it is specifically requested to make a dataset available to other user groups.

Dataset Selection

To select a specific subset of the datasets available to you, please go to the search page, and click on "Select specific dataset(s)". This will open two boxes. The left box contains all datasets available to you. The right box contains a subset of those. You can either click on a dataset on the left, and then click the arrow, or double click a dataset on the left, to include it into the search. You can also double click a dataset on the right to exclude it from the search. If you don't use this menu (Collapse dataset selection), all available datasets are selected by default.

Search Types

Different search options are available in ModBase.

  • Model (Default)
    Search for ModBase models.
  • Sequence Similarity
    Blast search is performed against the comprehensive datasets.
  • Ligand Binding
    Under Construction
  • Interacting Proteins
    Under Construction
Display Types

Depending on the chosen search mode, different display options are available in ModBase.

  • Model Detail (graphical)
    Detailed view for models of one sequence. An image of the model with the highest sequence identity to its template is prominently displayed, all other models are available through their thumbprint images.
  • Model Detail (schema)
    Schematic view for models of one sequence. A thumbprint of each model together with a schema of the sequence coverage are displayed.
  • Sequence Overview
    Summary of sequence data for up to 500 models including the sequence coverage sketch. Click on the sketch to go to the Model Details (Graphical) page for this sequence.
  • SNP View
    SNP related properties
  • Interacting Proteins
    Predicted Interacting Proteins
  • Ligand Binding
    A number of ligand binding properties from LIGBASE, ligand based or sequence based are displayed.
Search Properties

Search properties are dependent on the chosen search mode. The most common are described here:

  • Database Accession Number
    Accession numbers from several protein sequence databases are understood (and will be translated to an internal sequence id), including UniProt (SwissProt, TrEMBL, e.g. Q12321); NCBI GenPept/RefSeq (e.g. NP_015395.1, plus some older GI numbers such as 584473776); PlasmoDB transcripts (e.g. PF3D7_0321300.1); and some older PIR.
  • Template PDB code
    Search for models calculated using the input template structure.
  • Template or Homolog PDB code
    Search for models calculated using the input template structure or a homolog based on 95% identity.
  • Annotation Keyword
    Search for modeled sequences who's annotation contains the search keyword. This search is currently very computationally expensive, and therefore slow.
  • Internal ID
    Search using our internal sequence/alignment/model id, based on the MD5 digest of the searched entity.
Advanced Search Properties

Search options for model properties are available:
GA341 Score
Evalue
Sequence Identity
Protein Size
Model Size

Those properties can get combined using boolean operators.

Original Sequence

ModPipe, the software that calculates the ModBase models, modifies the original protein sequences to replace non-standard amino acid residues. The Original Sequence represents the un-modified version, gets only displayed if a modification has taken place.

Input Sequence

Enter a sequences in either FASTA format or just the amino acid residues. Non-standard residues are ignored.

FASTA Format

Sequence Similarity Search

Search by sequence similarity using BLAST.

Action Pulldown Menus

The Action Pulldown Menus give the users the option to:

  • Jump to different ModBase pages:
    Ligand Binding
    SNP View
    Model Overview or Details pages

  • Retrieve Files for the current or the checked Models:
    Alignment files
    Coordinate Files

  • Visualize Surface Cavities:
    Calculates cavity properties using ConCavity, and visualizes them using Chimera. (Minimum cavity score: lightblue. Medium cavity score: green. Maximum cavity score: maroon. Additionally, cysteine residues on the surface are colored in yellow.
     
  • Chimera(Structure/Alignment):
    UCSF Chimera logo

    Chimera is an interactive molecular graphics program developed by the Computer Graphics Lab at UCSF. It can be used as a helper application for several types of files linked to web pages. If Chimera is installed on the user's computer, the ModBase-Chimera action will launch Chimera, retrieve template and alignment information about the current model, and display them in Chimera.
    Chimera Installation Instructions
Linking to ModBase Models from external databases

To link from outside pages to specific ModBase sequences/models, please use the following link construction:

Link to specific Database ID (SwissProt,TrEMBL,GI):
https://salilab.org/modbase/searchbyid?databaseID=P04848

Link to the SNP dataset:
All SNPs for the sequence:
https://salilab.org/modbase/searchbyid?databaseID=P43632&displaymode=snpview

Specific dbSnp ID:
https://salilab.org/modbase/searchbyid?databaseID=P43632&dbsnpID=rs1143507

Link to specific dataset:
https://salilab.org/modbase/searchbyDATASET?datasetNAME=nysgxrc_1jd1_06-04f

To incorporate ModBase model images into your own web-pages, use the following link: https://modbase.compbio.ucsf.edu/modbase-cgi/image/modbase.jpg?database_id=P04848&type=modbase

If you link to ModBase, please drop me a note (), just for information. Thanks!

Model Details page

This is the default ModBase page for models for one sequence. Sequence information, model/sequence coverage and model information are displayed. Two version of this page are available: Graphical and schematic

Model Details (Graphical)

The graphical Model Details page gives access to all available information for the models of one sequence: Sequence information, model information, database crosslinks. If there are several models for the current sequence, the model with the highest sequence identity to its template is displayed, and thumbprints of the other models are show as well. Mouseover the thumbprint to get information about that model.

Model Coverage Sketch

Example model coverage
sketch

This image indicates the modeled area of the query sequence.

  • Top line: a summary sketch representing the whole sequence. The "best" model for each segment is shown.
  • Bottom line: represents the current model
  • Color coding:
    • Grey area: Segment not modeled
    • Red area: Segment modeled only by an unreliable model
    • Yellow area: Segment modeled by a reliable model with <30% sequence identity to a template structure
    • Blue area: Segment modeled by a reliable model with >=30% sequence identity to a template structure
    • Green area: Segment modeled using a template structure with >=98% sequence identity

If (a) specific dataset(s) have been selected, and the query sequence has been modeled in other available datasets as well, the darker colors indicate the selected dataset(s) and the light colors the others.

Model Image

The model images are created on the fly using MolScript and raster 3d.

Model Details (Schema)

The schematic model details page shows a thumbprint of each model and a sketch of the model/sequence coverage.

Filtered Models

ModBase contains many models for some sequences. This might be due to a very long sequence, or because this sequence has been processed in several datasets. To avoid confusion, a "filtered" subset of the models is displayed on the model details pages, containing only the models from the last calculation. Please click on "all models" if want to check out all available models, a better one might be hiding. On the schematic page, a mouseover the thumbprint gives more information about that model.
We are in the process of pre-computing the thumbnail images. Some have erroneously been computed with a black background. This is random, and has no meaning!


Sequence/Model Overview

This is the default page when the search results in more than one sequences. There are two models: Sequence Overview and Model Overview.

Sequence Overview

The Sequence Overview page summarizes the search results for many sequences. The sequence coverage Sketch indicates the modeled area(s) for the given sequence. Click on the sketch to go to the "Model Details" page.

Model Coverage Sketch

Similar as the Model Details Sketch, but smaller and with less complexity.

Model Overview

The Model Overview page displays the search results as one line for each model. Details about modeling quality and templates are being displayed. Click on the thumbprint on the left to get to the "Model Details" page for that model.

Model/Fold Reliability
Acceptable Models and Foldassignments (both criteria apply) green
Acceptable Models (GA341 >=0.7, insignificant Psi-blast evalue) redgreen
Acceptable Fold Assignments (Psi-blast evalue from filtered search <=0.0001, insignificant GA341) greenred

Please click on the Ball to go to the Model Details (schema) page for this model/sequence.

Model Thumbnail

Please click on the Thumbnail to go the the Model Details (graphical) page for this model.

Sequence Information
Primary Sequence Database ID

The Sequence Database ID displayed on the "Sequence Information" section is chosen according to the following order of availability:
1. SwissProt ID
2. TrEMBL Accession Number
3. GenPept GI ID
4. The ID provided in the dataset at time of processing
The Annotation line comes with the chosen ID.
The Organism information has been obtained from the NCBI Taxonomy database, and links to it.

Original Sequence Database ID

The original Sequence Database ID from the fasta file that was used for the modeling calculation. The prefix "CU" indicates a custom database ID. This ID can be useful to identity sequences from modweb calculations.

Sequence Length

Length of the input sequence.

Annotation

The Sequence Annotation is either retrieved from UniProt or GenPept, or, if those are not available, from the input fasta file.

Organism Information (Taxonomy)

ModBase currently contains a number of datasets of complete genomes. These are included in the pull-down menu.

Since ModBase also contains a large number of models based on subsets of sequences in UNIPROT, virtually all organisms are represented. However, when searching for other organisms than found in the pull-down menu, please restrict your search further, by choosing a property from the "Search by Properties" pull-down menu, and entering an appropriate search-term, or by choosing a specific dataset (excluding the comprehensive datasets).

The Organism information has been obtained from the NCBI Taxonomy database, and links to it.

Model Information

ModBase often has several models for one sequence. This can happen, if the sequence got processed in different datasets (at different times, for a different project or with different hit-selection criteria), or if there are models for different domains, because the template PDB structures don't cover the full target sequence. Also, ModPipe usually calculates a large number of models for each sequence section, and use a number of quality criteria to select the "best". Since the quality criteria are not always in agreement, ModPipe frequently chooses up to four models for each region

Alignment Significance

Significance of the alignment between the target and the template as reported by NCBI's PSI-BLAST program (Nucl. Acids Res. 25, 3389-3402, 1997). This is the significance reported during the template (PDB) database search. It is not the significance of the modeling alignment produced by Modeller.

E-Value

ModPipe1.0: Significance of the alignment between the target and the template as reported by NCBI's PSI-BLAST program (Nucl. Acids Res. 25, 3389-3402, 1997). This is the significance reported during the template (PDB) database search. It is not the significance of the modeling alignment produced by Modeller. ModPipe2 and later:Similar significance value, but calculated by Modeller using the Build-Profile routine.

GA341 (Model Score)

Score for the reliability of a Model, derived from statistical potentials (F. Melo, R. Sanchez, A. Sali,2002 PDF). A model is predicted to be good when the model score is higher than a pre-specified cutoff (0.7). A reliable model has a probability of the correct fold that is larger than 95%. A fold is correct when at least 30% of its Cα atoms superpose within 3.5 Å of their correct positions.

Protein Size

Length of the modeled sequences (original sequence, not the modeled part).

Model Size

Length of the model;

Reliable Model / Fold Assignment

A model is considered to be reliable (have a reliable fold assignment) if it is evaluated within the following thresholds by at least one of theses model evaluation criteria:

  • MPQS (ModPipe Quality Score): >=1.1
  • TSVMod NO35 (estimated native overlap at 3.5 Å): >=40%
  • GA341: >=0.7
  • E-value: <0.0001
  • zDOPE: <0
Sequence Identity

Percentage of identical residues in the alignment between the target and the template as reported during the template search.

ModPipe Protein Quality Score

The ModPipe Protein Quality Score is a composite score comprising sequence identity to the template, coverage, and the three individual scores evalue, z-Dope and GA341. We consider a MPQS of >1.1 as reliable.

TSVMod
  • MTALL: Training set is based on the template structure (most reliable method)
  • MSALL: Training set is based on similar secondary structure, all features are used
  • MSRED: For lack of sufficiently similar secondary structures in the training set, a reduced set of z-scores features is used to compute the results.
  • NA: No prediction, due to secondary structure that is entirely missing from the training set.
  • NO35: Predicted Native Overlap (3.5 Å)
  • RMSD: Predicted RMSD

Reference: D. Eramian, N. Eswar, M.Y. Shen, A. Sali. How well can the accuracy of comparative protein structure models be predicted? Protein Sci 17, 1881-1893, 2008.

z-Dope

Using the probability theory, we derive an atomic distance- dependent statistical potential from a sample of native structures that does not depend on any adjustable parameters (Discrete Optimized Protein Energy, or DOPE). DOPE is based on an improved reference state that corresponds to noninteracting atoms in a homogeneous sphere with the radius dependent on a sample native structure; it thus accounts for the finite and spherical shape of the native structures.

Target Region

The region of the protein sequence that is modeled.

Protein Length

The length of the original protein sequence.

Template PDB Code

The PDB code of the template the model is based on.

Template Region

The region of the PDB structure that was used as a template.

ModPipe Version

ModPipe is the underlying software pipeline that is used to build all ModBase models. ModPipe1.0 relies on PSI-Blast and Impala for template selection and fold assignment. ModPipe2 is additionally using the Build-Profile method in Modeller. ModPipe2 models are also scored with the MPQS and z-Dope.

ModPipe Date

Modeling date for current model. Often, a sequence got modeled in several independed datasets. If you model is older, please check the additional models (by clicking on the thumbprints below the prominent model) for a newer date. If you suspect that a better template has been released after the newest model date, you should submit the sequence to ModWeb to get a current model.

Coordinate (3D) File

Coordinate file for the model in the PDB format. The "fifth column" (which normally contains B-factors or order parameters) contains the Modeller error profile.

Modeller Error Profile

PAP Alignment Format

The 'PAP' format is nicer to look at than the 'PIR' format, but not as computer friendly. The alignment.write() command description in the Modeller manual contains more detailed information about this format.

PIR Alignment Format

The 'PIR' format resembles that of the PIR sequence database. It is described in the Modeller manual and is used for comparative modeling with Modeller because it can contain all the information useful for modeling.

LigBase (as integrated in ModBase)

LigBase is a structural database of ligand binding sites. The LigBase database tables contain all amino acid residues that are within 5 Angstroms from a small molecule ligand in a given PDB file. The current version of LigBase contains ligand binding information from 16629 PDB files.

ModBase queries the LigBase tables to derive ligand binding information for protein structure models. For additional LigBase functionality, please see https://salilab.org/ligbase.

Ligand binding sites

1. Putative ligand binding sites derived from the template

Putative ligand binding sites of ModBase models are derived from the template on the fly by parsing the ModBase alignment file. The native ligand binding residues of the template (TEMPL) and the derived ligand binding sites of the model (MODEL) are shown. Additionally, the putative binding residues of the model are colored in its image. If a gap is found in the alignment at a ligand binding residue, a "-" is displayed instead of the model residue.

If atoms from different hetero groups are less than 1.8 A apart, they are considered "one ligand", and all residues codes are show for that ligand.

2. Putative ligand binding sites inherited from related PDB files

Many PDB files used as templates don't include a ligand, but closely related PDB files might have one or more ligands bound. Using DBALI , a database of structural alignments, and using the information in the PDB-LigBase tables, the INHERITOR table of LigBase contains the amino acid residues of PDB files with ligands and the equivalent residues of related PDB files. Additionally, the sequence identity and coverage between those PDB files and between the respective binding sites are stored. Once the INHERITOR information is retrieved, putative ligand binding sites from related PDB files are determined similar to the binding sites derived from the templates. The inherited residues are shown (INHER) together with the equivalent model residues (MODEL).

LigBase model coverage sketch

The ligbase coverage sketch displays the same information as the general model coverage sketch. Additionally, it displays the position of the amino acid residues which are putative ligand binding residues.

ABC Transporter Datasets

The ABC Transporter model dataset includes domains in all 48 human ABC transporters. It also includes models of disease-associated and polymorphic non-synonymous SNPs found in the nucleotide binding domains.

LS-Mut

LS-Mut is a database of Somatic mutations found in advanced pancreatic tumor or glioblastoma multiforme from the Karchin lab at Johns Hopkins University. Please refer to the following publications:

  • Parsons, D., Jones, S., Zhang, X., Lin, J., Leary, R., Angenendt, P., Mankoo, P., Carter, H., Su, I., Gallia, G., et al. (2008) An integrated genomic analysis of glioblastoma multiforme. Science 9/4/2008. PubMed
  • Jones, S., Zhang, Z., Parsons, D., Lin, J., Leary, R., Angenendt, P., Mankoo, P., Carter, H., Kamiyama, H., Jimeno, A., et al. (2008) Core signaling pathways in human pancreatic cancer revealed by tumor genome analysis. Science 5/4/2008.PubMed
ModWeb

When a ModWeb job is finished (with the option of depositing the models into ModBase), the user gets a results page (and optionally an email) including the dataset name, and the username/password for ModBase to access that particular dataset.
Use the following URL pattern to manually access the ModWeb dataset:

https://salilab.org/modbase/search?dataset=ModWebEnterYourDatasetIDHere&username=YourUserName&password=YourPassWord

Alternatively, please do the following:

  1. Go to https://salilab.org/modbase
  2. Click "Current Datasets"
  3. Choose "Individual User" and click "Select Mode"
  4. Enter Username/Password from the ModWeb email or results page and click "submit query"
  5. This should have brought you back to the Search Page. Mouse-over the "Current Logins" field, to see whether the login was successful (cookie has been set).
  6. Use the ModBase interface to find your dataset(s):
    1. To select your ModWeb dataset, click on "Select specific datasets".
    2. Identify your dataset on the left side, double click to get it to the right side, mark the datasets on the right side you don't want to search on right now (all the others), and click "remove select datasets", then click "Search", you should get a page with all models from this ModWeb calculation.

If you still have problems, please email .

ModBase Command line Retrieval


! This functionality is not completely mature.
The file formats might be changing in the next few months.
If you have suggestions for different formats, please email .


Retrieval of ModBase models and alignments is possible using wget or curl:
Example:
wget 'https://salilab.org/modbase/retrieve/modbase/?databaseID=P21812'

Non Academic Users:
Please add '&dataset=public' to the retrieve url like:
wget 'https://salilab.org/modbase/retrieve/modbase/?databaseID=P21812&dataset=public'

Options:
One or more of the following Parameters:

databaseID=P21812
dataset=SP/TR2004
modelID=064dd62ea7483831c9cfc1f72499630e
seqID=16f39f89e31970dbec2f39c36959c116MVEGDIVS
dataset=ModWeb0-xxx (your ModWeb Job Identifier)
type=model (default, alternatively: alignment)
You can search for databaseID, OR seqID, OR modelID, OR alignID, OR modweb dataset.
You can combine each with dataset and/or type parameter.

(The internal ModBase seqID can be calculated directly from the sequence, as the MD5 hash of the primary sequence without gaps or line breaks, followed by the first 4 and last 4 residues, so that of Q12321 is 16f39f89e31970dbec2f39c36959c116MVEGDIVS)

By default models are downloaded in PDB format in a single file, with models separated by XML tags. To get files in mmCIF format instead, add "&format=mmcif" to the URL. Each model in an mmCIF file starts with a line "data_model_<modelID>". ModBase mmCIF files contain data fields from the PDBx and ModelCIF dictionaries; see the wwPDB website for more information.

About the cookies.txt file:
in order to set your access permissions when using wget, you have to transfer the modbase cookies
to a cookies file. Example:
cat cookies.txt
modbase.compbio.ucsf.edu FALSE / FALSE 2116975226 modbase-academic modbase_user&anonymous&modbase_passwd&anonymous

You can use this link also to retrieve coordinate files directly through a web-browser:
https://salilab.org/modbase/retrieve/modbase?databaseID=P21812

Predicted Protein Complexes

MODBASE contains structure-based predictions of 3,213 binary and 1,234 higher order protein complexes in Saccharomyces cerevisiae involving 750 and 195 proteins, respectively. To generte candidate complexes, comparative models of individual proteins were built and combined together using complexes of known structure as templates. These candidate complexes were then assessed using a statistical potential, derived from binary domain interfaces in PIBASE (https://salilab.org/pibase). A benchmark indicates a false positive rate of 3% and a true positive rate of 97%. Moreover, the predicted complexes are also filtered using functional annotation (http://yeastgenome.org) and sub-cellular localization (http://yeastgfp.ucsf.edu) data.

SNP Stability
PDestabilizing change in buried position.
CCharge change in buried position.
HProline in alpha-Helix (helix breaker).
EDesolubilizing change in exposed position.
Modeling Leverage Calculations

The modeling leverage of a PDB structure is calculated by:

  1. Extracting the sequence from the PDB structure
  2. Creating a profile for this sequence using UniProt (Modeller)
  3. Collecting all sequences from the profile
  4. Modeling the sequences using ModPipe with the input PDB structure as template
  5. Calculating statistics

All resulting models are deposited in ModBase.