GdT SCALE (Scheduling for Computing Architecture and Low Energy)
Jean-Marc Nicod | FEMTO-ST | Besançon | jean-marc.nicod@ens2m.fr |
La thématique du groupe de travail s’inscrit pleinement dans la définition de l’axe OPA avec une spécificité applicative des travaux au niveau des grands systèmes de calcul, des réseaux, des centres de données ou centres calcul de différentes natures. Ces architectures ont des contraintes spécifiques comme l’espace mémoire, la consommation énergétique, la disponibilité, etc. De nouveaux paradigmes apparaissent avec le Edge Computing et les architectures parallèles à large échelle (exascale). Cela conduit à une grande diversité dans les problèmes posés. Que les architectures visées soient à petite ou à très large échelle, la problématique de la consommation énergétique se pose avec là encore des réponses variées en terme de pilotage, de résilience et de robustesse. De nombreuses incertitudes existent dans les problèmes traités. Elles peuvent se manifester de différentes manières, pannes, alimentation énergétique renouvelable intrinsèquement intermittente, demande de calcul, etc.
Les Journées organisées par le GdT SCALE de l'axe OPA
Journée du GT SCALE |
le 6 juillet 2023
LIG Bâtiment IMAG - 700 avenue Centrale
Domaine universitaire de Saint-Martin-d'Hères
Programme
Matin
9h30 Accueil café
10h00 Mathilde Jay
An experimental comparison of software-based power meters
The global energy demand for digital activities is constantly growing. Computing nodes and cloud services are at the heart of these activities. Understanding their energy consumption is an important step towards reducing it. On one hand, physical power meters are very accurate in measuring energy but they are expensive, difficult to deploy on a large scale, and are not able to provide measurements at the service level. On the other hand, power models and vendor-specific internal interfaces are already available or can be implemented on existing systems. Plenty of tools, called software-based power meters, have been developed around the concepts of power models and internal interfaces, in order to report the power consumption at levels ranging from the whole computing node to applications and services. However, we have found that it can be difficult to choose the right tool for a specific need.
In this work, we qualitatively and experimentally compare several software-based power meters able to deal with CPU or GPU-based infrastructures. For this purpose, we evaluate them against high-precision physical power meters while executing various intensive workloads. We extend this empirical study to highlight the strengths and limitations of each software-based power meter.
10h30 Louis-Claude Canon
Low-Cost Algorithms for the Restricted Assignment Problem on Intervals of Machines
We consider a special case of the Restricted Assignment problem where processing sets of jobs consist in contiguous intervals of machines. This problem is known as the Restricted Assignment problem on Intervals (RAI), and arises, for instance, when partitioning multi-get requests in distributed and replicated NoSQL databases. These databases aim at providing high availability and very low latency, i.e., the algorithms used to schedule requests must have good execution time. In this paper, we propose optimal and approximation algorithms for some variants of the RAI problem that apply to these systems. On the basis of previous work done by Lin et al. (European Journal of Operational Research, 2004), we present an efficient (2 − 1/m)-approximation algorithm, where m is the number of machines. We also provide a generalization of the RAI problem on so-called circular intervals, i.e., intervals that appear when machines are arranged in a circular fashion, and we call this new problem the Restricted Assignment problem on Circular Intervals (RACI). We give a procedure that computes an optimal solution when there are K possible processing times, assuming that one already knows an optimal algorithm for the corresponding non-circular problem. This enables us, for instance, to obtain an optimal algorithm with complexity O(n log n) for the RACI problem with a fixed number of machines and unitary jobs.
11h00 Lucas Pérotin
Scheduling Algorithms for Variable Capacity Resources
The drive to decarbonize the power grid to slow the pace of climate change has caused dramatic variation in the cost, availability, and carbon-intensity of power. This has begun to shape the planning and operation of datacenters. This presentation focuses on the design of scheduling algorithms for independent jobs that are submitted to a platform whose resource capacity varies over time.
11h30 Table ronde
Expression libre et travaux en petits groupes
Après Midi
14h00 Vincent Fagnon (soutenance de thèse)
Prédiction et augmentation de ressources à la rescousse de l'algorithmique
(Amphi F018 - BatF), thèse dirigée par Denis Trystram et co-encadrée par Giorgio Lucarelli.
Une retransmission en ligne sera accessible en suivant ce lien :
https://meet.univ-grenoble-alpes.fr/b/vin-qx6-fby-glc
Résumé : Actuellement, la tendance est à la mutualisation des ressources informatiques. Cette centralisation offre plusieurs avantages, notamment une meilleure gestion des variations de charge de travail, l'évitement de pics d'utilisation extrêmes et une optimisation simplifiée des calculs avec les données déjà en place. Cependant, l'avènement de l'Internet des objets (IoT) a créé une augmentation des besoins de traitement et d'analyse de données à des endroits éloignés des data-centers. Pour un traitement plus rapide et efficace, il est important de limiter les transferts de données et de les traiter près de leur lieu de production. Ainsi, il est essentiel de trouver un équilibre entre l'utilisation de plateformes centralisées pour mutualiser les ressources de calcul et les modèles décentralisés pour traiter les données localement. Cette thèse porte sur la mise en œuvre de plusieurs mécanismes visant à concevoir des algorithmes d'ordonnancement pour plusieurs agents ayant des objectifs distincts sur une plateforme composée de ressources de calcul. Tout d'abord, nous nous interrogeons sur l'utilisation d'un mécanisme externe (oracle) pour révéler les incertitudes relatives aux durées d'exécution de tâches séquentielles à exécuter sur une ressource de calcul. Nous proposons des algorithmes qui garantissent des performances moyennes optimales sur l'objectif classique du "sum flowtime" et étudions, via une campagne de simulations, leurs performances, leurs robustesses et la pertinence du modèle d'oracle en moyenne. Nous nous interrogeons également sur l'efficacité d'un tel modèle si l'on souhaite garantir de bonnes performances au pire des cas. Ensuite, nous étudions un problème bi-agent dans lequel chaque agent est soumis à des contraintes et des objectifs différents. Nous proposons un algorithme mettant en œuvre de l'augmentation de ressources (vitesse et rejet) et prouvons, grâce à la technique du dual fitting, un rapport de compétitivité dépendant, entre autres, de ces paramètres d'augmentation. Enfin, nous abordons un problème plus classique d'ordonnancement d'une application de tâches séquentielles pour laquelle nous souhaitons minimiser le "makespan" sur une plateforme hybride multicœurs (CPU/GPU). Nous proposons un algorithme basé sur l'étude de "l'intégrality gap" de la relaxation d'une programmation linéaire, algorithme pour lequel nous donnons une preuve de garantie de performances sous la forme d'un rapport d'approximation constant.
le 13 décembre 2022
ENS-Lyon, site Monod (46 allée d’Italie)
Salle R116 (1er étage, GN Sud)
Programme
Matin
10h00 Accueil café
10h30 Manal Benaissa
Standalone Data-Center Sizing Combating the Over-provisioning of Both the IT and Electrical Parts
During the two last decades, sustainability of IT infrastructures like datacenters became a major concern for computer giants like Google or Amazon. Datacenters powered with renewable energy have been proposed. But because of the intermittency of these power alternatives, these platforms remains connected to the classical power grid. IT structure and electrical constraints were often questioned separately, leading to a non-efficient global system. In this paper, an energy self-sufficient green datacenter is modeled and designed, by proposing an electrically autonomous infrastructure including wind turbines, solar panels and its own short and long-term storage devices mainly based on batteries and hydrogen system.
Existing sizing approaches limit themselves to a perfect QoS leading to an overestimation of the needed equipment. In this paper we show how to reduce and combat this overprovioning by questioning its impact on the QoS and needed equipment: decreasing the number of computing or storage elements (servers and batteries). As an example, decreasing the targeted QoS from 100 to 95% more than halves the needed number of servers, while a decrease of 30% of the battery capacity has a negligible impact on the electrical infrastructure.
11h00 Eniko Kevi
Machine Learning as a Tool to Improve Algorithms
Since classical algorithms for fundamental online optimization problems have tight (or close to tight) worst-case guarantees, we study whether we can surpass these performance bounds by combining classical algorithms with machine learning predictions. We present such a learning-augmented online algorithm for the bounded allocation problem, which improves on previous performance bounds with reasonable prediction error rates.
11h30 Bertrand Simon
Learning-Augmented Dynamic Power Management with Multiple States via New Ski Rental Bounds
We study the online problem of minimizing power consumption in systems with multiple power-saving states. During idle periods of unknown lengths, an algorithm has to choose between power-saving states of different energy consumption and wake-up costs. We develop a learning-augmented online algorithm that makes decisions based on (potentially inaccurate) predicted lengths of the idle periods. The algorithm's performance is near-optimal when predictions are accurate and degrades gracefully with increasing prediction error, with a worst-case guarantee almost identical to the optimal classical online algorithm for the problem. A key ingredient in our approach is a new algorithm for the online ski rental problem in the learning augmented setting with tight dependence on the prediction error. We support our theoretical findings with experiments.
Après Midi
13h30 Denis Trystram
Sustainable resource management in HPC platforms
La course à la performance continue dans le domaine du HPC sans que les réels besoins soient questionnés. Je présenterai tout d'abord la dynamique de cette évolution et les outils pour y faire face.
Je présenterai les reflexions autour de la conception d'outils de gestion des ressources dans les grandes plate-formes qui prennent en compte les limites planétaires.
14h30 Damien Landré
Assessing power needs to run a workload with quality of service constraint on green datacenters
Since a decade, datacenters have become an essential part of the internet but their continuous development requires finding sustainable solutions to limit their impact on climate change. The ANR DATAZERO2 project aims to design datacenter running solely on local renewable energy. Energy demand of datacenter’s machines can be optimized to improve the use of energy. In this paper we tackle the problem of computing the minimum power demand to process a workload under quality of service constraint. We propose a binary search algorithm to solve this problem. This algorithm requires a machine configuration that maximizes computing power. In a heterogeneous environment, the difficulty of the problem lies in the choice of the machines to be switched-on and their DVFS (Dynamic voltage and frequency scaling) state. This can be computed by a MILP (Mixed-integer linear programming), but with an important computation time. Four heuristics are proposed to maximize computing power and give satisfactory results in a reasonable time, with an average deviation from optimal solution of 29.59%, 0.16%, 2.35% and 0.22%. These promising results encourage us to use heuristics to address the problem of computing the minimum power demand.
15h00 Lucas Perotin
Online Scheduling of Moldable Task Graphs under Common Speedup Models
The problem of scheduling moldable tasks on multiprocessor systems with the objective of minimizing the overall completion time (or makespan) has been widely studied, in particular when tasks have dependencies (i.e., task graphs), or when tasks are released on-the-fly (i.e., online). However, few studies have focused on both (i.e., online scheduling of moldable task graphs). In this paper, we design a new online algorithm and derive constant competitive ratios for this problem under several common yet realistic speedup models (i.e., roofline, communication, Amdahl, and a general combination). We also prove, for each model, a lower bound on the competitiveness of our algorithm, which is very close to the constant competitive ratio. Finally, we provide the first lower bound on the competitive ratio of any deterministic online algorithm for the arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.
15h30 Break / Discussion scientifique puis clôture de la journée
13 avril 2022
FEMTO-ST, site TEMIS (15B avenue des Montboucons, Besançon)
Amphithéâtre Jean-Jacques Gagnepain (comme s'y rendre)
Programme
Matin
10h00 Accueil café
10h30 Redouane Elghazi
Update on the Asymptotic Optimality of LPT
When independent tasks are to be scheduled onto identical processors, the typical goal is to minimize the makespan. A simple and efficient heuristic consists in scheduling first the task with the longest processing time (LPT heuristic), and to plan its execution as soon as possible. While the performance of LPT has already been largely studied, in particular its asymptotic performance, we revisit results and propose a novel analysis for the case of tasks generated through uniform integer compositions. Also, we perform extensive simulations to empirically assess the asymptotic performance of LPT, and compare it to four other classical heuristics. The results show that the absolute error rapidly tends to zero for several distributions of task costs, including distributions studied by theoretical models, and realistic distributions coming from benchmarks.
11h00 Fanny Dufossé
Dimensioning of multi-clouds with follow-the-renewable approaches for environmental impact minimization
Cloud computing has become an essential component of our digital society. Efforts for reducing its environmental impact are being made by academics and industry alike, with commitments from major cloud providers to be fully operated by renewable energy in the future. One strategy to reduce nonrenewable energy usage is the “follow-the-renewables”, in which the workload is migrated to be executed in the data centers with the most availability of renewable energy.
The objective consists in developing a realistic model of Clouds supplied by green energy, and a simulation platform to compare scheduling algorithms. We consider the problem of cloud dimensioning, to minimize the ecological impact of data centers in terms of brown energy consumption and IT products manufacturing.
11h30 Bertrand Simon
Learning-Augmented Online Algorithms
Guaranteed classical online algorithms are often designed in order to minimize the competitive ratio, i.e., the performance in the worst case, so usually do not perform well on easy inputs. An orthogonal approach has been developed in the last decade thanks to the progress in machine learning, which allows to predict key parameters of the instance. Its downside is that no guarantee can be given on the prediction quality, for instance if the training set does not represent the current instance. This statement is at the origin of the recently emerging field of learning-augmented algorithms. In this framework, an online algorithm has access to a predictor, which can give relevant information about the future instance, though without any guarantee on the quality. The objective is then to design an algorithm which performs well when the predictor turns out to be accurate, while being robust to imprecise predictions. This tutorial will introduce this new domain, present some recent works and discuss current research questions.
Après Midi
13h30 Imed Kacem, Hans Kellerer
Minimizing the Maximum Lateness for Scheduling with Release Times and Job Rejection
Abstract coming soon
14h00 Denis Trystram
Une catastrophe écologique, une opportunité pour la recherche en Informatique
Je commencerai par dresser le panorama du développement des objets connectés et les enjeux économiques associés. Je ciblerai plus particulièrement le domaine de l'apprentissage embarqué et l'orchestration de services des systèmes digitaux du edge et extreme edge. Les questions de recherche associées sont stimulantes mais non soutenables dans un monde toujours plus digital, je terminerai en posant au débat de nouvelles questions qui s'appuient sur des outils simples, maitrisés et respectueux de l'environnement.
14h30 Anthony Dugois
A Scheduling Framework for Distributed Key-Value Stores and Application to Tail Latency Minimization
Distributed key-value stores employ replication for high availability. Yet, they do not always efficiently take advantage of the availability of multiple replicas for each value, and read operations often exhibit high tail latencies. Various replica selection strategies have been proposed to address this problem, together with local request scheduling policies. It is difficult, however, to determine what is the absolute performance gain each of these strategies can achieve. We present a formal framework allowing the systematic study of request scheduling strategies in key-value stores. We contribute a definition of the optimization problem related to reducing tail latency in a replicated key-value store as a minimization problem with respect to the maximum weighted flow criterion. By using scheduling theory, we show the difficulty of this problem, and therefore the need to develop performance guarantees. We also study the behavior of heuristic methods using simulations, which highlight which properties are useful for limiting tail latency: for instance, the EFT strategy—which uses the earliest available time of servers—exhibits a tail latency that is less than half that of state-of-the-art strategies, often matching the lower bound. Our study also emphasizes the importance of metrics such as the stretch to properly evaluate replica selection and local execution policies.
15h00 Frederic Vivien
Scheduling Strategies for Overloaded Real-Time Systems
This work introduces and assesses novel strategies to schedule firm real-time jobs on an overloaded server. The jobs are released periodically and have the same relative deadline. Job execution times obey an arbitrary probability distribution and can take unbounded values (no WCET). We introduce three control parameters to decide when to start or interrupt a job. We couple this dynamic scheduling with several admission policies and investigate several optimization criteria, the most prominent being the Deadline Miss Ratio (DMR). Then we derive a Markov model and use its stationary distribution to determine the best value of each control parameter. Finally we conduct an extensive simulation campaign with 14 different probability distributions; the results nicely demonstrate how the new control parameters help improve system performance compared with traditional approaches. In particular, we show that (i) the best admission policy is to admit all jobs; (ii) the key control parameter is to upper bound the start time of each job; (iii) the best scheduling strategy decreases the DMR by up to 0.35 over traditional competitors.
15h30 Ilyes Kadri
Algorithms on variable-size rectangular interfaces
The purpose of this study is to elaborate approximate algorithms capable of generating two-dimensional interfaces that contain a given set of services adapted to the user's context. The proposed methods give good solutions that minimize the used surface, by taking in consideration the placement and the positioning of the tiles. The obtained interface from packing all the tiles into a variable-size bin should be adapted and very simple to use. In order to increase the usability of these solutions (configurations) we develop a mathematical modeling incorporating different Human-Computer Interaction constraints for an exact resolution. Such a placement is performed on the basis of the importance and the category of the proposed service in the tiles. Moreover, we suggest two approximate approaches: a heuristic based on shelf strategy and a memetic algorithm. The exact resolution of the proposed multi-objective mathematical model gives an optimal solution only for small and medium instances. For the other instances, it is difficult to reach the optimal value in a short running time, which shows the practical interest of the proposed approaches.
16h00 Break / Discussion scientifique puis clôture de la journée
le 3 décembre 2021
ENS-Lyon, site Monod (46 allée d’Italie)
Salles des conseils (2ème étage, centre)
Programme
Matin
10h00 Accueil café
10h30 Redouane Elghazi
Shelf schedules for independent moldable tasks to minimize the energy consumption
Scheduling independent tasks on a parallel platform is a widely-studied problem, in particular
when the goal is to minimize the total execution time, or makespan (P||Cmax problem
in Graham’s notations). Also, many applications do not consist of sequential tasks, but rather
parallel moldable tasks that can decide their degree of parallelism at execution (i.e., on how
many processors they are executed). Furthermore, since the energy consumption of data centers
is a growing concern, both from an environmental and economical point of view, minimizing the
energy consumption of a schedule is a main challenge to be addressed. One can then decide, for
each task, on how many processors it is executed, and at which speed the processors are operated,
with the goal to minimize the total energy consumption. We further focus on co-schedules,
where tasks are partitioned into shelves, and we prove that the problem of minimizing the energy
consumption remains NP-complete when static energy is consumed during the whole duration
of the application. We are however able to provide an optimal algorithm for the schedule within
one shelf, i.e., for a set of tasks that start at the same time. Several approximation results are
derived, and simulations are performed to show the performance of the proposed algorithms.
11h00 Lucas Perotin
Resilient Scheduling of Moldable Parallel Jobs to Cope with Silent Errors
This presentation features resilient scheduling of moldable parallel jobs on high-performance
computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before
execution, and their execution time obeys various speedup models. The objective is to minimize
the overall completion time or the makespan, when jobs can fail due to silent errors and hence
may need to be re-executed after each failure until successful completion. Our work introduce two
resilient scheduling algorithms that generalizes the classical scheduling framework for failure-free
jobs and cope with silent errors.
11h30 Vincent Fagnon
Scheduling avec oracle
Après Midi
14h00 Chifaa Dahik
Optimisation discrète robuste en présence d’incertitude ellipsoïdale
On s’intéresse à la version robuste des problèmes linéaires à variables binaires avec un ensemble
d’incertitude ellipsoïdal corrélé. Puisque ce problème est NP-difficile, une approche heuristique
intitulée DFW et basée sur l’algorithme de Frank-Wolfe est proposée. Dans cette approche, nous examinons la puissance d’exploration des itérations internes binaires de la méthode.
Pour les problèmes de petites tailles, la méthode est capable de fournir la solution optimale
fournie par CPLEX, après quelques centaines d’itérations. De plus, contrairement à la méthode
exacte, notre approche s’applique à des problèmes de grandes tailles également. Les résultats
numériques ont été appliqués au plus court chemin robuste. Un autre objectif est de proposer
une relaxation semi-définie positive (SDP) pour le plus court chemin robuste qui fournit une
borne inférieure pour valider des approches telles que l’algorithme DFW. Le problème relaxé
est le résultant d’une bidualisation du problème. Puis le problème relaxé est résolu en utilisant
une version creuse d’une méthode de décomposition dans un espace produit. Cette méthode
de validation est adaptée aux problèmes de grande taille. Finalement, une autre adaptation de
l’algorithme de Frank-Wolfe a été réalisé pour le problème du k-médiane, accompagnée d’un
algorithme d’arrondissement qui satisfait les contraintes
14h30 Anthony Dugois
Bounding the Flow Time in Online Scheduling with Structured Processing Sets
Replication in distributed key-value stores makes scheduling more challenging, as it introduces
processing set restrictions, which limits the number of machines that can process a given
task. We focus on the online minimization of the maximum response time in such systems, that
is, we aim at bounding the latency of each task. When processing sets have no structure, Anand
et al. (Algorithmica, 2017) derive a strong lower bound on the competitiveness of the problem :
no online scheduling algorithm can have a competitive ratio smaller than Omega(m), where m is the
number of machines. In practice, data replication schemes are regular, and structured processing
sets may make the problem easier to solve. We derive new lower bounds for various common
structures, including inclusive, nested or interval structures. In particular, we consider fixed sized
intervals of machines, which mimic the standard replication strategy of key-value stores. We
prove that EFT scheduling is (3 -2/k )-competitive when optimizing max-flow on disjoint intervals
of size k. However, we show that the competitive ratio of EFT is at least m - k + 1 when
these intervals overlap, even when unit tasks are considered. We compare these two replication
strategies in simulations and assess their efficiency when popularity biases are introduced, i.e.,
when some machines are accessed more frequently than others because they hold popular data.
15h00 Andrei Da Silva
Scheduling of Functions on Serverless Platforms
15h30 Manal Benaissa
Standalone Data-Center Sizing Combating the Over-provisioning of Both the IT
and Electrical Parts
During the two last decades, sustainability of IT infrastructures like data centers became a
major concern for computer giants like Google or Amazon. Data centers powered with renewable
energy have been proposed. But because of the intermittency of these power alternatives, these
platforms remains connected to the classical power grid. IT structure and electrical constraints
were often questioned separately, leading to a non-efficient global system. In this paper, an energy
self-sufficient green data center is modeled and designed, by proposing an electrically autonomous infrastructure including wind turbines, solar panels and its own short and long-term storage
devices mainly based on batteries and hydrogen system. The sizing of such an infrastructure one
of the main issue of the DATAZERO2 project. Based of previous sizing suggestions, the paper
shows how to reduce and combat the overprovioning by questioning its impact on the QoS, when
decreasing the number of computing or storage elements (servers and batteries).
16h00 Break / Discussion scientifique puis clôture de la journée