UDP: Utility-Driven Fetch Directed Instruction Prefetching

Datacenter applications exhibit large instruction footprints causing significant instruction cache misses and, as a result, frontend stalls. To address this issue, instruction prefetching mechanisms have been proposed, including state-of-the-art techniques such as fetch-directed instruction prefetch...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) s. 1188 - 1201
Hlavní autoři:	Oh, Surim, Xu, Mingsheng, Khan, Tanvir Ahmed, Kasikci, Baris, Litz, Heiner
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 29.06.2024
Témata:	Computer architecture Costs data center Data centers frontend stalls Instruction prefetching Particle measurements Prefetching Program processors Rendering (computer graphics)
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Datacenter applications exhibit large instruction footprints causing significant instruction cache misses and, as a result, frontend stalls. To address this issue, instruction prefetching mechanisms have been proposed, including state-of-the-art techniques such as fetch-directed instruction prefetching. However, our study shows that existing implementations still fall far short of an ideal system with a perfect instruction cache. In particular, up to 588.47 \% of potential IPC speedup of existing processors hides due to frontend stalls, and these frontend stalls are due to inaccurate and untimely instruction prefetches. We quantify the impact of these individual effects, observing that applications exhibit different characteristics that call for adaptive application-specific optimizations. Based on these insights, we propose two novel mechanisms, UDP and UFTQ, to improve the accuracy of FDIP without negatively affecting timeliness while leveraging prefetches on the wrong path. We evaluate our technique on 10 data center workloads showing a maximal IPC improvement of 16.1 \% and an average IPC improvement of 3.6 \%. Our techniques only introduce moderate hardware modifications and a storage cost of 8 KB.
DOI:	10.1109/ISCA59077.2024.00089