Kubernetes Scheduling with Checkpoint/Restore: Challenges and Open Problems

Warning

This publication doesn't include Faculty of Sports Studies. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

SPIŠAKOVÁ Viktória STOYANOV Radostin HEJTMÁNEK Lukáš KLUSÁČEK Dalibor REBER Adrian BRUNO Rodrigo

Year of publication 2026
Type Article in Proceedings
Conference Job Scheduling Strategies for Parallel Processing
MU Faculty or unit

Faculty of Informatics

Citation
Doi https://doi.org/10.1007/978-3-032-10507-3_3
Keywords Checkpoint and Restore; Kubernetes; Containers; Resource Management; Scheduling
Description Efficient resource management and scheduling have been persistent challenges since the early days of computing and remain critical to this day.The widespread adoption of containers managed by orchestrators like Kubernetes have introduced new dimensions to this challenge. Despite the lightweight nature and minimal overhead of containers, they still suffer from utilization inefficiencies due to overprovisioning. Existing scheduling techniques are not enough to meet these demands and there is a growing need for orchestration and scheduling policies that support advanced preemption, migration, and fault tolerance. Well-established container checkpoint/restore (C/R) mechanisms implemented through tools like CRIU, offer a promising solution for improving resource scheduling efficiency. However, these mechanisms remain only partially integrated with platforms like Kubernetes. In this paper, we explore the use cases for general C/R, examine the current state, and delve into the open problems and challenges associated with native integration into Kubernetes. We propose potential solutions to these challenges, offering a pathway towards more efficient resource management to better meet the needs of today's computational landscape. While scheduling efficiency is considered critical in HPC clusters, serverless and deep learning platforms also benefit directly from these optimizations.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.

More info