The LSV seminar takes place on Tuesday at 11:00 AM. The usual location is the conference room at Pavillon des Jardins (venue). If you wish to be informed by e-mail about upcoming seminars, please contact Stéphane Le Roux and Matthias Fuegger.
The seminar is open to public and does not require any form of registration.
In this talk, we will discuss resilient scheduling on high-performance computing (HPC) platforms. Resilience is (loosely) defined as surviving to failures. Failures are usually handled by adding redundancy, either continuously (replication) or at periodic intervals (migration from faulty node to spare node, rollback and recovery). However, the amount of replication and/or the frequency of checkpointing must be optimized carefully, and we will discuss how to optimally decide the checkpointing interval. We will also consider moldable jobs, which allow for choosing a processor allocation before execution. The objective here is to minimize the overall completion time of the jobs, or makespan, assuming that jobs are subject to arbitrary failure scenarios. Hence, jobs need to be re-executed each time they fail until successful completion. This work generalizes the classical framework where jobs are known offline and do not fail. We introduce a list-based algorithm, and prove new approximation ratios for three prominent speedup models (roofline, communication, Amdahl). We also introduce a batch-based algorithm, where each job is allowed a restricted number of failures per batch, and prove a new approximation ratio for the arbitrary speedup model. Finally, we will discuss simulation results that compare the algorithms.