SmartOClock: Workload- and Risk-Aware Overclocking in the Cloud

Operating server components beyond their voltage and power design limit (i.e., overclocking) enables improving performance and lowering cost for cloud workloads. However, overclocking can significantly degrade component lifetime, increase power draw, and cause power capping events, eventually dimini...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) s. 437 - 451
Hlavní autoři: Stojkovic, Jovan, Misra, Pulkit A., Goiri, Inigo, Whitlock, Sam, Choukse, Esha, Das, Mayukh, Bansal, Chetan, Lee, Jason, Sun, Zoey, Qiu, Haoran, Zimmermann, Reed, Samal, Savyasachi, Warrier, Brijesh, Raniwala, Ashish, Bianchini, Ricardo
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 29.06.2024
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Operating server components beyond their voltage and power design limit (i.e., overclocking) enables improving performance and lowering cost for cloud workloads. However, overclocking can significantly degrade component lifetime, increase power draw, and cause power capping events, eventually diminishing the performance benefits. In this paper, we characterize the impact of overclocking on cloud workloads by studying their profiles from production deployments. Based on the characterization insights, we propose SmartOClock, the first distributed overclocking management platform specifically designed for cloud environments. SmartOClock is a workload-aware scheme that relies on power predictions to heterogeneously distribute the power budgets across its servers based on their needs and then enforce budget compliance locally, per-server, in a decentralized manner. SmartOClock reduces the tail latency by 9%, application cost by 30% and total energy consumption by 10% for latencysensitive microservices on a 36-server deployment. Simulation analysis using production traces show that SmartOClock reduces the number of power capping events by up to 95% while increasing the overclocking success rate by up to 62%. We also describe lessons from building a first-of-its-kind overclockable cluster in Microsoft Azure for production experiments.
DOI:10.1109/ISCA59077.2024.00040