Opportunistic Resource Reclamation in Kubernetes: From Aggressive Resizing to Flash Jobs

Warning

This publication doesn't include Faculty of Sports Studies. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

SPIŠAKOVÁ Viktória STOYANOV Radostin KLUSÁČEK Dalibor HEJTMÁNEK Lukáš

Year of publication 2026
Type Article in Proceedings
Conference Job Scheduling Strategies for Parallel Processing
MU Faculty or unit

Faculty of Informatics

Citation
Keywords Kubernetes; Resource Management; Resource Utilization; In-place Resizing
Description Modern cloud data centers suffer from chronic resource under-utilization. The gap between static resource allocations and dynamic workload demand creates systemic inefficiency that current orchestration platforms fail to address adequately. In this work, we explore resource reclamation strategies in production Kubernetes clusters using emerging infrastructure-level primitives---in-place resource resizing and transparent checkpoint/restore (C/R). For CPU resources, we analyze a production workload trace, which we release publicly, and reveal significant allocation-utilization gaps. Through trace-driven simulation, we demonstrate that aggressive in-place resizing substantially increases resource utilization as well as workload evictions. We find a balanced strategy for in-place resizing and identify C/R as the missing primitive that makes aggressive resizing safe by enabling graceful termination and resumable migrations instead of progress loss. For GPU resources, where dynamic resizing is infeasible, we propose a C/R-enabled sharing strategy that allocates reserved-but-idle GPU memory to secondary workloads (flash jobs) with safety guarantees for reclamation. Our work demonstrates how the same infrastructure primitives address resource reclamation across different resource types, each with distinct technical constraints, validated through real production cluster deployments.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.

More info