pFind Studio: a computational solution for mass spectrometry-based proteomics
Introduction
Summary: pScan is a flexible tool that helps biologists to preprocess protein sequence databases in proteomics research. Besides the commonly used functions, such as sequence pattern-matching, building decoy databases, and converting protein sequence databases to peptide sequence databases, pScan also supports querying and substituting of protein entries based on the regular expression, creating customized databases, and conducting statistical characterization of the databases. pScan can greatly help biologists to improve the design of proteomics experiments and to facilitate the database search and analysis by making full use of the information content contained in the sequence databases.
Database searching is a commonly used method for peptide identification in high-throughput proteomics. Protein sequence databases, such as Swiss-Prot, IPI and the NCBI-nr, play a critical role in proteomics. Currently, there are a few database preprocessing toolkits, such as Kangroo (Betel et al., 2002), DecoyDBB (Reidegeld et al., 2008), and DBToolkit (Martens et al., 2005), which have already helped biologists considerably in protein sequence database processing. Kangroo is a sequence pattern-matching toolkit, DecoyDBB can build target-decoy databases with three different decoy strategies, and DBToolkit can convert protein sequence databases to peptide sequence databases to enhance protein identification. However, these commonly used functions are implemented in different software separately. What’s more, there are some special database preprocessing functions, which have not been implemented in the available software, but are extremely useful for designing proteomics experiments and for facilitating the database search, such as querying and substituting of protein entries based on the regular expression, creating customized databases, and conducting statistical characterization of the databases. To solve this problem, we have developed an integrated software tool named pScan, to conduct protein sequence database preprocessing.
pScan is an easily extensible and user-friendly database preprocessing toolbox. First, pScan allows biologists to edit, query and substitute the accession ID, the description information and the sequence for each entry in the FASTA file, which are based on various types of regular expressions. Second, pScan can be used to create some customized databases, e.g., sub-species databases, N- and C-terminal sequence databases, and target-decoy databases with different decoy strategies, which are very helpful for peptide identification in database search engines, such as pFind (Wang et al., 2007), SEQUEST and Mascot. Third, pScan also supports the statistical characterization of the protein sequence databases, for example, the ratio of digested peptides with a specific amino acid to all peptide sequences, the ratio of digested peptides with special modification patterns (e.g., ‘NXS/T/C’ in glycosylation and ‘S/T/Y’ in phosphorylation) to all peptide sequences, and the distribution of mass values of all peptides (with or without modifications) obtained from digestion of the proteins. The flexible manipulations in pScan can greatly help to improve the design of proteomics experiments and to facilitate the database search and analysis by making full use of the information content contained in the sequence databases.
Functionalities & Applications
pScan can perform various types of preprocessing on protein sequence databases. Here, we present some commonly used applications in pScan.
Display, Query and Substitute Sequences - Besides the commonly used regular expression based sequence pattern-matching against the entire database file, pScan can also help biologists to display, query and substitute the accession ID, the description information and the sequence for each entry included in the sequence databases, collectively or separately. For example, biologists are often interested in the sequence motif of ‘NXS/T/C’ in N-glycosylation site analysis where X may be any amino acid except praline (Bause, E. et al., 1979). pScan has been successfully used to substitute the letter N with J, which was defined to have the same mass as Asn, to conduct the database searching by pFind in large-scale identification of core fucosylated glycoproteins (see Jia et al., 2009 and Fu et al., 2009).
Create Customized Databases - pScan can help biologists to extract any sub protein database that they want from the NCBI taxonomy database or any other database based on the self-defined regular expressions. For example, the ‘bovin’ protein database can be easily retrieved from NCBI taxonomy database by inputting the regular expression ‘bovin’ into the ‘DE (DEscription)’ query edit box in pScan.
pScan can create N- or C- terminal sequence database, which contains the first n residues from the N or C-terminal side of the target sequence, respectively. In contrast to shotgun proteomics, biologists can retrieve higher fidelity results from terminal proteomics, because of the high information content of terminal sequence (Nakazawa et al., 2008).
The target-decoy search strategy is a widely used method to control the false discovery rate. Reverse and shuffle strategies have been implemented in pScan to create decoy databases. Reverse database is simply created by reversing the target protein sequences. Shuffle database is built by putting each letter from the target protein sequence to a randomly chosen position in the decoy sequence. pScan can be used to create two types of databases: the composite target-decoy database and the decoy database only. Fig. 1. The human IPI database version 3.55 is used to conduct the statistical characterization. (a) The ratio of digested peptides with specific amino acids to all peptide sequences. (b) The ratio of digested peptides with special modification patterns (e.g., ‘NXS/T/C’ in glycosylation, ‘S/T/Y’ in phosphorylation, ‘M’ in oxidation, and ‘C’ in carbamidomethylation) to all peptide sequences. (c) Mass distribution of phosphorylated peptides for nominal mass 950 u and 1050 u. (d) Robust and extensible framework in the core implementation of pScan.
Conduct Statistical Characterization - A powerful protein enzymatic digesting and indexing software package, IndexToolkit (Li et al., 2006), has been integrated into pScan to get all peptides obtained from digestion of the proteins. Currently, three different peptides statistical characterization methods have been implemented in pScan to improve the design of experiments. The human IPI database version 3.55 is used to conduct the statistical characterization.
First, pScan can be used to calculate the ratio of digested peptides with a specific amino acid to all peptide sequences (Fig.1 a), which is useful in the stable isotopic labeling in quantitative proteomics.
Second, pScan is able to calculate the ratio of digested peptides with special modification patterns (e.g., ‘NXS/T/C’ in glycosylation, ‘S/T/Y’ in phosphorylation, ‘M’ in oxidation, and ‘C’ in carbamidomethylation) to all peptide sequences (Fig.1 b), which is very helpful for the post-translational modifications study.
Third, pScan can perform the calculating of the mass distribution of all peptides (with or without modifications) obtained from digestion of the proteins (Fig.1 c).
These statistical characterizations are very helpful for biologists to design their experiments with more careful consideration and get more reliable identified results.
In sum, pScan can greatly help biologists to improve the design of proteomics experiments and to facilitate the database search and analysis by making full use of the information content contained in the sequence databases. pScan has been integrated into the pFind Studio (http://pfind.ict.ac.cn), which is a new efficient and effective software platform for mass spectrometry-based proteomics, and has also been successfully applied in numerous tasks for the design of experiments and database search. With the robust and extensible framework in the core implementation (Fig.1 d), new functions will be easily incorporated into pScan as needed in the future.