Dave Barkan
Dan Hostetter
Dave Barkan
Ben Webb
Andrej Sali
Charles Craik
Version main.da9b573
Download Granzyme B and Caspase predicted cleavage sites generated from study in Barkan, Hostetter, et al, Bioinformatics, 2010
Please note: This server is undergoing minor modifications to the documentation and layout, in order to match the publication which describes it. Please feel free to contact us with any questions or concerns until we finalize the server. Thank you for your patience!
Use this option to select whether to train a new predictive model on a set of known positives and known negatives that you provide ("Training") or to use an existing model to search a set of proteins to look for high scoring sites of interest ("Application").
If an e-mail address is entered here, it will be used to notify you when the job has completed.
This option allows the user to choose which criteria will be used to select the best quality comparative model (when one is available)
to evaluate secondary structure and solvent accessiblity for a peptide in a protein. Often there will be more than one model generated
for the protein, but only one is used to evaluate these structure features, chosen according to these criteria. The following options
are available:
Model Score: A general score for the reliability of a model, as described in the ModBase documentation.
Model coverage: Fraction of the target protein sequence for which a comparative model was generated.
Predicted Native Overlap: Fraction of alpha carbons in the model that are predicted to be within 3.5 angstroms of the native state.
The algorithm for predicting the native overlap is described in this study. It was shown to outperform the Model Score metric in evaluating comparative model quality, but the two are generally correlated, and users who are familiar with ModBase may wish to use Model Score. Model coverage is a useful evaluation metric when a user wishes to evaluate a model that has a larger fraction of the target included, for example when only a small portion of the target is included in the model but it receives a very high score.
Error handling allows the user to specify what to do if a UniProt accession is not found in ModBase. There are two options:
Quit: This will inform the user of an accession error immediately and allow them to correct it.
Ignore: This will continue processing the job, and a message will be displayed in the results file for the missing accession.
If searching through a small number of accessions for high scoring peptides, choosing the Quit option is suggested, as the user will not have to wait for the job to complete before discovering there is a problem. However, if a large number of accessions are uploaded, it is probably more beneficial to choose the Ignore option and just receive a list of all accessions that were missed upon job completion.
In Training mode, the uploaded file is a list of peptides found in proteins that are known to be either postive or negative examples in a system involving specificity in protein-peptide recognition. For example, in studying which peptide sequences are recognized by a protease, the file could include a list of peptides known to be cleaved by the protease (the positives) and other peptides known not to be cleaved (the negatives). The file can include multiple peptides from one protein.
Each line in the file should include one peptide. The peptide is specified as such:
Accession Start_Position Sequence Classification
Accession: The UniProt accession of the protein containing the training peptide
Start_Position: The position in the sequence repersenting the first residue of the peptide (position counting is 1-based)
Sequence: The amino acid residue sequence of the peptide
Classification: This should be set to "Positive" or "Negative", reflecting whether the peptide is a positive or negative example of specificity
Entries on each line are separated by a space or tab. UniProt accessions are mapped to IDs in ModBase, a database of comparative protein structure models. ModBase continually updates this mapping. Occasionally a UniProt accession will not be found in ModBase. This could be due to ModBase not having processed this sequence yet in its mapping, or that the accession is outdated or not the primary accession for the protein. Note that accessions in UniProt contain six characters and are distinct from UniProt entry names, which are up to 11 characters and contain an '_'.
Additionally, all peptides must be the same length. The provided peptide start position and peptide sequence must match the residue sequence for the UniProt accession in ModBase. The number of negatives in the file must be greater than or equal to the number of positives (if this is not the case in your dataset, you can simply reverse your definition of Negative and Positive for each peptide.)
An example follows:
O75791 11 SGEDELSF Positive
This represents the peptide SGEDELSF, found in the protein specified by UniProt accession O75791, starting at position 11. The protein will be considered a known positive in the training set.
In Training mode, use this option to select how many Jackknifing iterations to run. Each iteration trains on a portion of the known positives and negatives listed in the Peptide Training File and tests on another fraction (specified by the option "Jackknife Fraction") of these known positives and negatives. It is suggested that multiple iterations of the Jackknifing procedure are run to avoid biasing the results to which peptides are chosen to be in the test set. The maximum number of iterations that can be run is 1000. The results of the procedure are averaged to produce final results.
In Training mode, use this option to select what percentage of positive peptides in the Training file are set aside for testing; the remaining set of positives will be used for training. An equal number of negatives will be included in the training set, and the remaining negatives will be used in testing.
Thus, if you have 100 positives and 200 negatives in your training file, and specify .1 as the jackknife fraction, there will be 90 positives and 90 negatives in the training set, 10 positives in the test set, and 110 negatives in the test set.
In application mode, the uploaded file is a list of UniProt accessions, one per line. The predictive model will search for high scoring peptides in the protein sequences for these accessions.
These accessions are mapped to IDs in ModBase, a database of comparative protein structure models. ModBase continually updates this mapping. Occasionally a UniProt accession will not be found in ModBase. This could be due to ModBase not having processed this sequence yet in its mapping, or that the accession is outdated or not the primary accession for the protein. Note that accessions in UniProt contain six characters and are distinct from UniProt entry names, which are up to 11 characters and contain an '_'.
Note that only human protein sequences currently have structural features calculated for them in this framework. More organisms may be represented as this service is developed. If you are interested in training or testing a specific species other than human, please contact the administrators.
In application mode, the uploaded file allows the user to specify restrictions on the peptides they wish to score. These take the form of allowing only certain residues to be present at different positions in the peptide. This reduces the number of peptides scored and output, and allows the user to incorporate knowledge of the system into predictions. A demonstration is given by the GrB and caspase systems, where these proteases recognize peptides containing an aspartic acid at the P1 site. The provided GrB and caspase predictive models were optimized on these peptides, so it follows that when applying these models to search for high-scoring peptides, the user should restrict peptides to have the same property.
The format of the specifier file is simple. Each position to restrict is designated with one line in the file. The line should begin with a number representing the position in the peptide. This is followed by a space separated list of residues that should not be present in that position. If there are no restrictions on a position in the peptide, then there does not need to be a line specifying that position. Position counting is 1-based.
An example follows:
1 E D 4 A C E F G H I K L M N P Q R S T V W Y 8 W
This file will tell the service to only score peptides that don't have an acidic residue in the first position, that only contain Asp in the fourth position, and that don't have Trp in the 8th position.
Note the number of the maximum position to restrict should not exceed the length of the peptides on which the predictive model was trained. For the provided GrB and Caspase predictors, this length is 8, representing the peptide from P4 to P4'.
This lets the user select which predictive model to use to score the peptides found in the accessions. Each model has been trained on a different set of known positives and negatives and is optimized for sequence and structure features in that set. Currently the available models include one to predict which peptides will be recognized by Granzyme B and another for Caspase peptides.