SmartOClock: Workload- and Risk-Aware Overclocking in the Cloud

Operating server components beyond their voltage and power design limit (i.e., overclocking) enables improving performance and lowering cost for cloud workloads. However, overclocking can significantly degrade component lifetime, increase power draw, and cause power capping events, eventually dimini...

Full description

Saved in:
Bibliographic Details
Published in:2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) pp. 437 - 451
Main Authors: Stojkovic, Jovan, Misra, Pulkit A., Goiri, Inigo, Whitlock, Sam, Choukse, Esha, Das, Mayukh, Bansal, Chetan, Lee, Jason, Sun, Zoey, Qiu, Haoran, Zimmermann, Reed, Samal, Savyasachi, Warrier, Brijesh, Raniwala, Ashish, Bianchini, Ricardo
Format: Conference Proceeding
Language:English
Published: IEEE 29.06.2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Operating server components beyond their voltage and power design limit (i.e., overclocking) enables improving performance and lowering cost for cloud workloads. However, overclocking can significantly degrade component lifetime, increase power draw, and cause power capping events, eventually diminishing the performance benefits. In this paper, we characterize the impact of overclocking on cloud workloads by studying their profiles from production deployments. Based on the characterization insights, we propose SmartOClock, the first distributed overclocking management platform specifically designed for cloud environments. SmartOClock is a workload-aware scheme that relies on power predictions to heterogeneously distribute the power budgets across its servers based on their needs and then enforce budget compliance locally, per-server, in a decentralized manner. SmartOClock reduces the tail latency by 9%, application cost by 30% and total energy consumption by 10% for latencysensitive microservices on a 36-server deployment. Simulation analysis using production traces show that SmartOClock reduces the number of power capping events by up to 95% while increasing the overclocking success rate by up to 62%. We also describe lessons from building a first-of-its-kind overclockable cluster in Microsoft Azure for production experiments.
DOI:10.1109/ISCA59077.2024.00040