|Phos3D - Help|
Phos3D is a web server for the prediction of phosphorylation sites (P-sites) in proteins, originally designed to investigate the advantages of including spatial information in P-site prediction. The approach is based on Support Vector Machines trained on sequence profiles enhanced by information from the spatial context of experimentally identified P-sites. In addition to serine, threonine, and tyrosine P-sites, Phos3D is capable to predict kinase-specific phosphorylations by the serine kinases PKA, PKC, MAPK, and CKII, as well as by the tyrosine kinase SRC. The quality of predictions is greatly dependent on the quality of submitted protein structures.
Select a protein structure file in valid PDB format and click the submit button. Depending on the size of the structure and current server load, prediction might take a few minutes. P-Site predictions are restricted to amino acids that are situated at the center of a 13-mer peptide (with 6-mer flanks on either side) of which all 13 amino acids are completely present (i.e. without missing atoms) in the structure file. It is of high importance to keep in mind that the predictive quality is also greatly dependent on the distance to non-sequence-local amino acids. Erroneous or incomplete protein structures may lead to inaccurate predictions.
Interpretation of Results
Prediction results are provided as eight lists corresponding to Ser/Thr/Tyr and five available kinase-specific predictions. The lists contain classified amino acids identified by name of the submitted structure file and the corresponding chain-id (Sequence), the SEQRES record in the structure file (Residue Index), as well as the amino acid type (AA) and the sequential context (13-mer motif, Context). Amino acids marked with an asterisk (*) correspond to the phosphorylated forms of amino acids in the pdb, as annotated by the SEP, TPO, or PTR residue symbols in the structure file. The last two columns contain the decision value (Score) and the prediction (P-Site?) for the respective amino acid. A decision value > 0 is a positive prediction marked with a 'Yes' in the last column, whereas a decision value < 0 is a negative prediction (last column is left blank). Larger absolute values indicate more confident predictions.
How to cite Phos3D?
Please cite the following reference if you use Phos3D for your work.
The dataset of phosphorylation sites was obtained from the Phospho.ELM database. Serine, threonine, and tyrosine residues that were annotated as phosphorylated were extracted from their native sequence together with six flanking amino acids on either side. If such a 13-mer motif could not be extracted, for instance because of missing amino acids, the respective incomplete motif was discarded. To identify associated protein structures and the actual conformations and locations of the peptide motifs within their three-dimensional context, we screened the Protein Data Base (PDB) for protein structures containing the 13-mer peptide sequence motifs associated with phosphorylation sites based on exact sequence matches. Considering only structures with complete atomic coordinates for the phosphorylation motif, we obtained a set of 750 non-redundant structural phosphorylation motifs (phos-Set: Ser: 363, Thr: 134, Tyr: 253 structural motifs). For a subset containing 307 motifs (Ser: 164, Thr: 59, Tyr: 84 motifs), information on their respective phosphorylating kinases was available.
We removed the phos-Set motifs from the sequences of their respective protein chains with known 3D structure. From the remaining sequence fragments, we extracted all non-overlapping Ser/Thr/Tyr site motifs. The resulting sets of sites were used as true-negative sets (non-phos-Sets). For the training the size of non-phos-Sets were scaled down by randomly eliminating sites from the non-phos-Sets until the negative sets were no more than twice as large as the positive sets.
For classification, Phos3D uses Support Vector Machines (SVM) implemented in the kernlab R-package by Alexandros et al.. For training, the default Radial Basis kernel with automated sigma estimation was applied. The feature-vector (FV) used for the Support Vector Machines consisted of chemical-physical amino acid properties for the sequence-information and spatial information based on amino acid distribution patterns in the spatial context of putatively phosphorylated sites. For the amino acid property components of the FV, we utilized values from a collection of 530 commonly used indices provided by the AAindex database including hydrophobicity, solvent accessibility preferences, secondary and tertiary structure preferences, polarity, volume, and solvent accessibility, structural disorder indices and others. The vector consisted of 530 x 12 dimensions for every index and position around the central serine, threonine, or tyrosine, where the components were values from the respective index and 530 dimensions for the average index value of the particular sequence motif. The latter dimensions were introduced to cover the general properties of the motifs, e.g. negative charge or high flexibility. The dimensionality of the feature-vectors was reduced by principle component analysis and subsequent replacement of the amino acid properties by the resulting principle components. The PCA was performed independently for the serine, threonine, and tyrosine motifs. The resulting rotation matrices are used for the transformation of unknown sites during classification.
The spatial information component consisted of the
normalized distribution ratios according to (Eq. 1). The ratios of
amino acid residues within the local sequence, outside the local
sequence, and irrespective of the position in the protein sequence
were used for distances in a range of 2 to 10 Å between the
putatively activated oxygen (β-hydrogen) in case of a central
serine and threonine, or γ-carbon in case of tyrosine and the
closest atom of all other amino acid residues, or between the
interaction centers proposed by Park et al.. The performance
of Phos3D was evaluated by the Area Under the Receiver Operator
Characteristic-curve resulting from a 10-fold cross-validation.
Calculation of amino acid propensity ratios for the estimation of
average depletion or enrichment given a particular motif set.
#AAk/s is the count for amino acids, where k designates
a particular amino acid residue type and s is the count summed over
all amino acid residue types; r is the considered radial distance to
the central serine/threonine/tyrosine, f is the relative frequency of
amino acid in a particular set, and g the relative frequency of the
amino acid k in the reference non-phos set.
 Diella, F., Cameron, S., Gemund, C., Linding, R., Via, A., Kuster, B., Sicheritz-Ponten, T., Blom,
N. & Gibson, T. J. (2004). Phospho.ELM: a database of experimentally verified
phosphorylation sites in eukaryotic proteins. BMC Bioinformatics 5, 79.