BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models
Parallel machine learning workloads have become prevalent in numerous application domains. Many of these workloads are iterative convergent, allowing different threads to compute in an asynchronous manner, relaxing certain read-after-write data dependencies to use stale values. While considerable ef...
Uloženo v:
| Vydáno v: | 2015 International Conference on Parallel Architecture and Compilation (PACT) s. 241 - 252 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.10.2015
|
| Témata: | |
| ISSN: | 1089-795X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Parallel machine learning workloads have become prevalent in numerous application domains. Many of these workloads are iterative convergent, allowing different threads to compute in an asynchronous manner, relaxing certain read-after-write data dependencies to use stale values. While considerable effort has been devoted to reducing the communication latency between nodes by utilizing asynchronous parallelism, inefficient utilization of relaxed consistency models within a single node have caused parallel implementations to have low execution efficiency. The long latency and serialization caused by atomic operations have a significant impact on performance. The data communication is not overlapped with the main computation, which reduces execution efficiency. The inefficiency comes from the data movement between where they are stored and where they are processed. In this work, we propose Bounded Staled Sync (BSSync), a hardware support for the bounded staleness consistency model, which accompanies simple logic layers in the memory hierarchy. BSSync overlaps the long latency atomic operation with the main computation, targeting iterative convergent machine learning workloads. Compared to previous work that allows staleness for read operations, BSSync utilizes staleness for write operations, allowing stale-writes. We demonstrate the benefit of the proposed scheme for representative machine learning workloads. On average, our approach outperforms the baseline asynchronous parallel implementation by 1.33x times. |
|---|---|
| ISSN: | 1089-795X |
| DOI: | 10.1109/PACT.2015.42 |