April 5, 2024
Journal Article

Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics

Abstract

Finding the set of the n most dissimilar items from a large population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Finding the set of the n most dissimilar items is different than simply sorting an array of numbers because there exists a pairwise relationship between each item and all of the other items in the population. For instance, if you have a set of the most dissimilar n=4 items, one or more of the items from n=4 might not be in the set n=5. An exact solution would have to search all possible combinations of size n in the population, exhaustively. We present an open-source software called similarity downselection (SDS) written in Python and freely available on GitHub. SDS implements a new heuristic algorithm for finding the set(s) of the n most dissimilar. We benchmark the algorithm, as instantiated in Python, against an uninformed, Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show for the original implementation of SDS, to find the set of n most dissimilar conformers, our method is not only orders of magnitude faster, but is also more accurate than running the Monte Carlo for 1,000,000 iterations searching for set sizes n=3–7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing SDS produces a solution close to the exact solution in these instances.

Published: April 5, 2024

Citation

Nielson F.F., B. Kay, S.J. Young, S.M. Colby, R.S. Renslow, and T.O. Metz. 2023. Similarity Downselection: Finding the n Most Dissimilar Molecular Conformers for Reference-Free Metabolomics. Metabolites 13, no. 1:Art. No. 105. PNNL-SA-157372. doi:10.3390/metabo13010105

Research topics