April 12, 2023
Journal Article

Snekmer: A scalable pipeline for protein sequence fingerprinting based on amino acid recoding

Abstract

Motivation: The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; how-ever, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical proper-ties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families. Results: Here we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (1) construction of supervised classification models trained on input protein families, or (2) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes. Availability and Implementation: Snekmer is written in Python using Snakemake. Code and data used in this paper, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open source BSD-3 license. Contact: Jason.McDermott@pnnl.gov Supplementary information: Supplementary data are available at Bioinformatics Advances online

Published: April 12, 2023

Citation

Chang C.H., W.C. Nelson, A.R. Jerger, A.T. Wright, R.G. Egbert, and J.E. McDermott. 2023. Snekmer: A scalable pipeline for protein sequence fingerprinting based on amino acid recoding. Bioinformatics Advances 3, no. 1:Art. No. vbad005. PNNL-SA-169271. doi:10.1039/d2cc01517j