Advanced Comput, Math & Data
Computers "Learn" to Find Mass Spec Peptides from Sequence Data
Machine-learning approach may prove to be more cost effective and accurate than current method
Results: Training a computer to recognize acceptable behavior is a lot like training a puppy. You need plenty of repetition, consistency, and patience. Researchers at the Pacific Northwest National Laboratory have used an advanced machine-learning method for classifying data to help a computer program learn to identify observable peptides from mass spectrometry data. They trained their model using a large, shared memory machine in EMSL, a national scientific user facility at PNNL.
The support vector machine (SVM) models they developed can enhance biologists' ability to quickly and economically determine which peptide sequences can be identified. The SVM provides an accurate and robust approach to statistical classification, with the essential ability to do non-linear mapping and the straightforward treatment of highly noisy data.
Why it matters: Proteins are the chemical engine of the cell. Protein identification provides new knowledge needed to develop solutions in bioenergy, carbon sequestration, and environmental remediation.
Identifying proteins is one of the major challenges of proteomics studies based on mass spectrometry data. As the amount of data grows, the challenge of sorting through it and identifying what job each protein performs also increases. By training the computer to recognize the most likely protein fragments (peptides), the SVM helps to more effectively and efficiently identify protein sequences within millions of data points.
The standard approach to identifying proteins that is based on accurate mass and elution time compares data profiles to a database of peptides previously identified from tandem mass spectrometry studies.
The PNNL-developed SVM method provides an approach for predicting the quantities of peptides that are detectable and thus can significantly reduce both the computational expense of peptide identification and the need for extensive mass spectrometry analyses. The ability to define the peptides can have a cascading effect that will ultimately yield more accurate statistical predictions of peptide and protein quantities.
Methods: Training a computer on large, dense datasets is often challenging. Given 20 amino acids, the number of peptides that can be associated with organisms is extremely large; for example, a chain of 5 amino acids could yield over three million possible peptides. To reduce this dimensionality, the researchers train on previously identified peptides, still ranging in the tens of thousands.
Researchers used the large shared memory Altix machine in EMSL, a national scientific user facility at PNNL, to perform the training of the model, completing the task of training three species within a day. The Altix's large shared memory allowed researchers to train on large-scale datasets using data structures that can't fit in memory on normal workstations. Without the Altix, training with data at this scale takes much longer, or for some implementations is impossible. What emerged was an accurate and robust approach to statistical classification of observable peptides with mass spectrometry with the ability to do non-linear mapping and the straightforward treatment of highly noisy data.
What's next: The PNNL research team is currently working on generalizing the SVM models to work on diverse species, ranging from microorganism to higher eukaryotes. The long-term goal is to apply these models to help improve the protein identification from community proteomics data associated with problems such as the human microbiome and bioenergy.
Acknowledgments: This work was supported through PNNL Laboratory Directed Research and Development funds; EMSL, a U.S. Department of Energy (DOE) national scientific user facility at PNNL, and the DOE Office of Advanced Scientific Computing Research.
Citation: Webb-Robertson BJ, Cannon WR, Oehmen CS, Shah AR, Gurumoorthi V, Lipton MS, Waters KM. "A support vector machine model for the prediction of proteotypic peptides for accurate mass and time proteomics." Bioinformatics, 2008 Jul 1;24(13):1503-9. Epub 2008