August 13, 2023
Journal Article

Elastic Resource Management for Deep Learning Applications in a Container Cluster

Abstract

While cloud resource management is a fruitful research field that has made many advances in production, such as Kubernetes and YARN, few efforts have been invested to further optimize the system performance, especially for deep learning (DL) training jobs in a container cluster. This work introduces FlowCon, a system that is able to monitor the individual evaluation functions of DL jobs at runtime, and thus to make placement decisions on resource allocations elastically. We present a detailed design and implementation of FlowCon and conduct intensive experiments over various DL models. The results demonstrate that FlowCon significantly improves DL job completion time and resource utilization efficiency, compared to default systems. According to the results, FlowCon is able to improve the completion time by up to 68.8% and meanwhile, reduce the makespan by 18.0%, in the presence of various DL job workloads.

Published: August 13, 2023

Citation

Mao Y., V. Sharma, W. Zheng, L. Cheng, Q. Guan, and A. Li. 2023. Elastic Resource Management for Deep Learning Applications in a Container Cluster. IEEE Transactions on Cloud Computing 11, no. 2:2204 - 2216. PNNL-SA-166431. doi:10.1109/TCC.2022.3194128