February 15, 2024
Conference Paper

Cloud Services Enable Efficient AI-Guided Simulation Workflows across Heterogeneous Resources

Abstract

Applications which fuse machine learning and simulation are rarely best served by a single computing resource. Highly parallel simulation codes are best deployed on super- computers, while AI tasks used to decide which simulations to perform may be best suited to specialized accelerators. Here we present a Function-as-a-Service (FaaS) system for executing complex, distributed computational campaigns that achieves performance parity with conventional workflow systems without the complexities of secure network connections between compute providers. One innovation enabling high performance is a subsystem that directly moves task data between sites, separate from the cloud-hosted FaaS system used to distribute task instructions. We also introduce a flexible scheduling system that allows us access factor of 2 trade offs between the amount of resources required to solve a problem at each compute site. We anticipate that this system will upgrade multi-site applications from demonstration projects to routine practice in computational science.

Published: February 15, 2024

Citation

Ward L., J. Pauloski, V. Hayot-Sasson, R. Chard, Y. Babuji, G. Sivaraman, and S. Choudhury, et al. 2023. Cloud Services Enable Efficient AI-Guided Simulation Workflows across Heterogeneous Resources. In IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW 2023), May 15-19, 2023, St. Petersburg, FL, 32-41. Piscataway, New Jersey:IEEE. PNNL-SA-179856. doi:10.1109/IPDPSW59300.2023.00018

Research topics