Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distri...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Scientific programming Ročník 2015; číslo 2015; s. 1 - 16
Hlavní autoři:	Muddukrishna, Ananya, Brorsson, Mats, Jonsson, Peter A.
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Cairo, Egypt Hindawi Publishing Corporation 01.01.2015 John Wiley & Sons, Inc
Témata:	Application programming interfaces (API) Architectural design Architectural knowledge Benchmarking Computer architecture Data distribution Distributing Improve performance Many-core processors Microprocessors Multiprocessing systems Multitasking Network management Non uniform data Performance degradation Performance enhancement Policies Processor architectures Processors Scheduling Scheduling algorithms Scheduling techniques Software architecture Task scheduling
ISSN:	1058-9244, 1875-919X, 1875-919X
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers.
AbstractList	Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers. Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and on manycore processors is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor, and we identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers. Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers
Author	Brorsson, Mats Jonsson, Peter A. Muddukrishna, Ananya
Author_xml	– sequence: 1 fullname: Muddukrishna, Ananya – sequence: 2 fullname: Brorsson, Mats – sequence: 3 fullname: Jonsson, Peter A.
BackLink	https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-166580$$DView record from Swedish Publication Index (Kungliga Tekniska Högskolan) https://urn.kb.se/resolve?urn=urn:nbn:se:ri:diva-41881$$DView record from Swedish Publication Index
BookMark	eNqF0ctrHCEYAHApKTRJe-q9CL21TKMzo6PHJds2hd0mkKT0Jt86umuy0Yk6LPvfx2X6OhRy8sHve-h3go588Aaht5R8opSxs5pQdiYF7Zh8gY6p6Fglqfx5VPaEiUrWbfsKnaR0RwgVlJBjNCyChq3L-2q2g2jwDaR7fK03ph-3zq8x-B7PIQOeu5SjW43ZBY9tiPhyMH55ha9iWEd4SLhcf79dzvD1PmVTzofIJfi9DiVtUdqkFGJ6jV5a2Cbz5td6im6_fL45v6gWl1-_nc8WlW4bkitqSW0lNLJvoW-AMyZ7YIxzQaFjICXlnMhaW2YJ76GtoRa8Mx3R1lqzappT9HHKm3ZmGFdqiO4B4l4FcGrufsxUiGsVnWqpELTo6nl9nzeqlGWCFP9-8kMMj6NJWd2FMfryIFUTIkjNaSeLopPSMaQUjVXaZTj8YI7gtooSdRibOoxNTWP72_efmN-9_F9_mPTG-R527hn8bsKmEGPhH9xxxtvmCcsgsOg
CitedBy_id	crossref_primary_10_1002_cpe_6887 crossref_primary_10_1145_3293448
Cites_doi	10.1007/978-3-642-02303-3_7 10.1145/1555815.1555779 10.1007/978-3-642-36949-0_39 10.3233/SPR-2010-0307 10.1007/978-3-642-21487-5_6 10.1177/1094342011434065 10.1109/tc.2010.199 10.1007/978-3-642-30961-8_14 10.1145/2426642.2259000 10.1007/978-3-642-40698-0_12 10.1109/mm.2010.31 10.1109/tpds.2012.322 10.1016/j.procs.2013.05.201 10.1007/978-3-642-19328-6_27 10.1007/978-3-642-24650-0_15
ContentType	Journal Article
Copyright	Copyright © 2015 Ananya Muddukrishna et al. Copyright © 2015 Ananya Muddukrishna et al.; This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright_xml	– notice: Copyright © 2015 Ananya Muddukrishna et al. – notice: Copyright © 2015 Ananya Muddukrishna et al.; This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
DBID	ADJCN AHFXO RHU RHW RHX AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D ADTPV AOWAS D8V D8T ZZAVC
DOI	10.1155/2015/981759
DatabaseName	الدوريات العلمية والإحصائية - e-Marefa Academic and Statistical Periodicals معرفة - المحتوى العربي الأكاديمي المتكامل - e-Marefa Academic Complete Hindawi Publishing Complete Hindawi Publishing Subscription Journals Hindawi Publishing Open Access CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional SwePub SwePub Articles SWEPUB Kungliga Tekniska Högskolan SWEPUB Freely available online SwePub Articles full text
DatabaseTitle	CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional
DatabaseTitleList	Technology Research Database CrossRef
Database_xml	– sequence: 1 dbid: RHX name: Hindawi Publishing Open Access url: http://www.hindawi.com/journals/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1875-919X
Editor	Chandrasekaran, Sunita
Editor_xml	– sequence: 1 givenname: Sunita surname: Chandrasekaran fullname: Chandrasekaran, Sunita
EndPage	16
ExternalDocumentID	oai_DiVA_org_ri_41881 oai_DiVA_org_kth_166580 10_1155_2015_981759 1076564
GroupedDBID	.DC 0R~ 24P 4.4 5VS AAFWJ AAMMB ABEFU ABJNI ABUBZ ACCMX ACGFS ACPQW ADBBV ADJCN AEFGJ AENEX AFRHK AGIAB AGXDD AHFXO AIDQK AIDYY ALMA_UNASSIGNED_HOLDINGS ASPBG AVWKF BCNDV CAG COF DU5 EBS EJD FEDTE H13 HZ~ IL9 IOS IPNFZ KQ8 MET MIO MV1 NGNOM O9- OK1 RIG VOH .4S AAJEY ABDBF ARCSS EAD EAP EDO EMK EPL EST ESX GROUPED_DOAJ I-F MK~ ML~ RHU RHW RHX TUS AAYXX ALUQN CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D ADTPV AOWAS D8V D8T ZZAVC
ID	FETCH-LOGICAL-c430t-1f02f9a39d4ad3a6559da556681a75a99166092cf5f06da42a2867e70cfffeb33
IEDL.DBID	RHX
ISICitedReferencesCount	12
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000364899300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	1058-9244 1875-919X
IngestDate	Tue Nov 04 16:12:28 EST 2025 Tue Nov 04 16:38:12 EST 2025 Fri Jul 25 09:31:29 EDT 2025 Sat Nov 29 04:06:54 EST 2025 Tue Nov 18 22:38:34 EST 2025 Sun Jun 02 18:51:42 EDT 2024 Thu Sep 25 15:09:57 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	2015
Language	English
License	This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. http://creativecommons.org/licenses/by/3.0
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c430t-1f02f9a39d4ad3a6559da556681a75a99166092cf5f06da42a2867e70cfffeb33
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
OpenAccessLink	https://dx.doi.org/10.1155/2015/981759
PQID	2008026179
PQPubID	2046410
PageCount	16
ParticipantIDs	swepub_primary_oai_DiVA_org_ri_41881 swepub_primary_oai_DiVA_org_kth_166580 proquest_journals_2008026179 crossref_citationtrail_10_1155_2015_981759 crossref_primary_10_1155_2015_981759 hindawi_primary_10_1155_2015_981759 emarefa_primary_1076564
PublicationCentury	2000
PublicationDate	2015-01-01
PublicationDateYYYYMMDD	2015-01-01
PublicationDate_xml	– month: 01 year: 2015 text: 2015-01-01 day: 01
PublicationDecade	2010
PublicationPlace	Cairo, Egypt
PublicationPlace_xml	– name: Cairo, Egypt – name: New York
PublicationTitle	Scientific programming
PublicationYear	2015
Publisher	Hindawi Publishing Corporation John Wiley & Sons, Inc
Publisher_xml	– name: Hindawi Publishing Corporation – name: John Wiley & Sons, Inc
References	Nikolopoulos D. S. Papatheodorou T. S. Polychronopoulos C. D. Labarta J. Ayguade E. Is data distribution necessary in OpenMP? Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM '07) November 2000 47 Tilera Tile Processor User Architecture Manual 2012 http://www.tilera.com/scm/docs/UG101-User-Architecture-Reference.pdf Wittmann M. Hager G. Optimizing ccNUMA locality for task-parallel execution under OpenMP and TBB on multicore-based systems Computing Research Repository, http://arxiv.org/abs/1101.0093 Li Y. Melhem R. Jones A. K. Practically private: enabling high performance CMPs through compiler-assisted data classification Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT '12) September 2012 ACM 231 240 10.1145/2370816.2370852 2-s2.0-84867548843 Majo Z. Gross T. R. Matching memory access patterns and data placement for NUMA systems Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO '12) April 2012 230 241 10.1145/2259016.2259046 2-s2.0-84863469851 Olivier S. L. de Supinski B. R. Schulz M. Prins J. F. Characterizing and mitigating work time inflation in task parallel programs Proceedings of the 24th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '12) November 2012 Salt Lake City, Utah, USA 1 12 10.1109/sc.2012.27 2-s2.0-84877698611 Terboven C. Mey D. Schmidl D. Jin H. Reichstein T. Data and thread affinity in OpenMP programs Proceedings of the Workshop on Memory Access on Future Processors: A Solved Problem? (MAW '08) May 2008 377 384 10.1145/1366219.1366222 2-s2.0-56849098650 Kleen A. A NUMA API for Linux 2005 Kirkland, Wash, USA Novel (11) 2010; 30 Muddukrishna A. Jonsson P. A. Vlassov V. Brorsson M. Locality-aware task scheduling and data distribution on NUMA systems OpenMP in the Era of Low Power Devices and Accelerators 2013 8122 Berlin, Germany Springer 156 170 Lecture Notes in Computer Science 10.1007/978-3-642-40698-0_12 Molka D. Schöne R. Hackenberg D. Müller M. Memory performance and SPEC OpenMP scalability on quad-socket x86_64 systems Algorithms and Architectures for Parallel Processing 2011 7016 Berlin, Germany Springer 170 181 Lecture Notes in Computer Science 10.1007/978-3-642-24650-0_15 (2) 2012; 47 Liu X. Mellor-Crummey J. A tool to analyze the performance of multithreaded programs on NUMA architectures Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14) February 2014 Orlando, Fla, USA ACM 259 271 10.1145/2555243.2555271 2-s2.0-84896838967 Pilla L. L. Ribeiro C. P. Cordeiro D. Méhaut J.-F. Charm++ on NUMA platforms: the impact of SMP optimizations and a NUMA-aware load balancer Proceedings of the 4th Workshop of the INRIA-Illinois Joint Laboratory on Petascale Computing 2010 Urbana, Ill, USA Tousimojarad A. Vanderbauwhede W. A parallel task-based approach to linear algebra Proceedings of the IEEE 13th International Symposium on Parallel and Distributed Computing (ISPDC '14) 2014 IEEE 59 66 (38) 2009; 37 Dashti M. Fedorova A. Funston J. Gaud F. Lachaize R. Lepers B. Quéma V. Roth M. Traffic management: a holistic approach to memory placement on NUMA systems Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13) March 2013 ACM 381 394 10.1145/2451116.2451157 2-s2.0-84875650624 AMD BIOS and kernel developer's guide for AMD family 10h processors, 2010 McCurdy C. Vetter J. S. Memphis: finding and fixing NUMA-related performance problems on multi-core platforms Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '10) March 2010 87 96 10.1109/ispass.2010.5452060 2-s2.0-77952562600 Ribeiro C. P. Castro M. Méhaut J.-F. Carissimi A. Improving memory affinity of geophysics applications on NUMA platforms using minas High Performance Computing for Computational Science—VECPAR 2010 2011 6449 Berlin, Germany Springer 279 292 Lecture Notes in Computer Science 10.1007/978-3-642-19328-6_27 Broquedis F. Furmento N. Goglin B. Namyst R. Wacrenier P. Dynamic task and data placement over numa architectures: an openmp runtime perspective Evolving OpenMP in an Age of Extreme Parallelism 2009 5568 Berlin, Germany Springer 79 92 Lecture Notes in Computer Science 10.1007/978-3-642-02303-3_7 Li Y. Abousamra A. Melhem R. Jones A. K. Compiler-assisted data distribution for chip multiprocessors Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10) September 2010 ACM 501 512 10.1145/1854273.1854335 2-s2.0-78149255516 (17) 2010; 18 (25) 2012; 26 Terboven C. Schmidl D. Cramer T. An Mey D. Assessing OpenMP tasking implementations on NUMA architectures OpenMP in a Heterogeneous World 2012 7312 Berlin, Germany Springer 182 195 Lecture Notes in Computer Science 10.1007/978-3-642-30961-8_14 McCool M. Reinders J. Robison A. Structured Parallel Programming: Patterns for Efficient Computation 2012 Elsevier (35) 2012; 61 (27) 2013; 24 Schmidl D. Terboven C. an Mey D. Towards NUMA support with distance information OpenMP in the Petascale Era 2011 6665 Berlin, Germany Springer 69 79 Lecture Notes in Computer Science 10.1007/978-3-642-21487-5_6 Chase D. Lev Y. Dynamic circular work-stealing deque Proceedings of the 17th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '05) July 2005 Las Vegas, Nev, USA ACM 21 28 10.1145/1073970.1073974 2-s2.0-32144435090 (32) 2013; 18 Lu Q. Alias C. Bondhugula U. Henretty T. Krishnamoorthy S. Ramanujam J. Rountev A. Sadayappan P. Chen Y. Ngai T.-F. Lin H. Data layout transformation for enhancing data locality on NUCA chip multiprocessors Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT '09) September 2009 IEEE 348 357 10.1109/pact.2009.36 2-s2.0-70449628310 Broquedis F. Clet-Ortega J. Moreaud S. Furmento N. Goglin B. Mercier G. Thibault S. Namyst R. Hwloc: a generic framework for managing hardware affinities in HPC applications Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP '10) February 2010 180 186 10.1109/pdp.2010.67 2-s2.0-77952719747 Muddukrishna A. Podobas A. Brorsson M. Vlassov V. Task scheduling on manycore processors with home caches Euro-Par 2012: Parallel Processing Workshops 2013 Berlin, Germany Springer 357 367 Lecture Notes in Computer Science 10.1007/978-3-642-36949-0_39 Broquedis F. Furmento N. Goglin B. Namyst R. Wacrenier P.-A. Dynamic task and data placement over NUMA architectures: an OpenMP runtime perspective Evolving OpenMP in an Age of Extreme Parallelism 2009 5568 Berlin, Germany Springer 79 92 Lecture Notes in Computer Science 10.1007/978-3-642-02303-3_7 Yoo R. M. Hughes C. J. Kim C. Chen Y.-K. Kozyrakis C. Locality-aware task management for unstructured parallelism: a quantitative limit study Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '13) July 2013 Portland, Ore, USA ACM 315 325 2-s2.0-84883495254 Goglin B. Furmento N. Enabling high-performance memory migration for multithreaded applications on linux Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '09) May 2009 1 9 10.1109/ipdps.2009.5161101 2-s2.0-70450079694 Duran A. Teruel X. Ferrer R. Martorell X. Ayguade E. Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP Proceedings of the International Conference on Parallel Processing (ICPP '09) September 2009 Vienna, Austria 124 131 10.1109/ICPP.2009.64 11 (14) 2012 35 25 (15) 2012 27 38 17 29 (5) 2005 2 4 6 8 9 30 10 21
References_xml	– reference: Molka D. Schöne R. Hackenberg D. Müller M. Memory performance and SPEC OpenMP scalability on quad-socket x86_64 systems Algorithms and Architectures for Parallel Processing 2011 7016 Berlin, Germany Springer 170 181 Lecture Notes in Computer Science 10.1007/978-3-642-24650-0_15 – reference: Terboven C. Mey D. Schmidl D. Jin H. Reichstein T. Data and thread affinity in OpenMP programs Proceedings of the Workshop on Memory Access on Future Processors: A Solved Problem? (MAW '08) May 2008 377 384 10.1145/1366219.1366222 2-s2.0-56849098650 – volume: 61 start-page: 222 issue: 2 year: 2012 end-page: 236 ident: 35 article-title: An OpenMP compiler for efficient use of distributed scratchpad memory in MPSoCs – reference: McCool M. Reinders J. Robison A. Structured Parallel Programming: Patterns for Efficient Computation 2012 Elsevier – reference: Schmidl D. Terboven C. an Mey D. Towards NUMA support with distance information OpenMP in the Petascale Era 2011 6665 Berlin, Germany Springer 69 79 Lecture Notes in Computer Science 10.1007/978-3-642-21487-5_6 – volume: 18 start-page: 169 issue: 3-4 year: 2010 end-page: 181 ident: 17 article-title: Enabling locality-aware computations in OpenMP – volume: 26 start-page: 110 issue: 2 year: 2012 end-page: 124 ident: 25 article-title: OpenMP task scheduling strategies for multicore NUMA systems – volume: 18 start-page: 379 year: 2013 end-page: 388 ident: 32 article-title: Topology aware task stealing for on-chip NUMA multi-core processors – reference: Nikolopoulos D. S. Papatheodorou T. S. Polychronopoulos C. D. Labarta J. Ayguade E. Is data distribution necessary in OpenMP? Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM '07) November 2000 47 – reference: Ribeiro C. P. Castro M. Méhaut J.-F. Carissimi A. Improving memory affinity of geophysics applications on NUMA platforms using minas High Performance Computing for Computational Science—VECPAR 2010 2011 6449 Berlin, Germany Springer 279 292 Lecture Notes in Computer Science 10.1007/978-3-642-19328-6_27 – reference: Broquedis F. Clet-Ortega J. Moreaud S. Furmento N. Goglin B. Mercier G. Thibault S. Namyst R. Hwloc: a generic framework for managing hardware affinities in HPC applications Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP '10) February 2010 180 186 10.1109/pdp.2010.67 2-s2.0-77952719747 – reference: Chase D. Lev Y. Dynamic circular work-stealing deque Proceedings of the 17th Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '05) July 2005 Las Vegas, Nev, USA ACM 21 28 10.1145/1073970.1073974 2-s2.0-32144435090 – reference: Li Y. Melhem R. Jones A. K. Practically private: enabling high performance CMPs through compiler-assisted data classification Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT '12) September 2012 ACM 231 240 10.1145/2370816.2370852 2-s2.0-84867548843 – volume: 37 start-page: 184 issue: 3 year: 2009 end-page: 195 ident: 38 article-title: Reactive NUCA: near-optimal block placement and replication in distributed caches – reference: Muddukrishna A. Jonsson P. A. Vlassov V. Brorsson M. Locality-aware task scheduling and data distribution on NUMA systems OpenMP in the Era of Low Power Devices and Accelerators 2013 8122 Berlin, Germany Springer 156 170 Lecture Notes in Computer Science 10.1007/978-3-642-40698-0_12 – reference: Broquedis F. Furmento N. Goglin B. Namyst R. Wacrenier P.-A. Dynamic task and data placement over NUMA architectures: an OpenMP runtime perspective Evolving OpenMP in an Age of Extreme Parallelism 2009 5568 Berlin, Germany Springer 79 92 Lecture Notes in Computer Science 10.1007/978-3-642-02303-3_7 – reference: Tilera Tile Processor User Architecture Manual 2012 http://www.tilera.com/scm/docs/UG101-User-Architecture-Reference.pdf – reference: Lu Q. Alias C. Bondhugula U. Henretty T. Krishnamoorthy S. Ramanujam J. Rountev A. Sadayappan P. Chen Y. Ngai T.-F. Lin H. Data layout transformation for enhancing data locality on NUCA chip multiprocessors Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT '09) September 2009 IEEE 348 357 10.1109/pact.2009.36 2-s2.0-70449628310 – reference: Duran A. Teruel X. Ferrer R. Martorell X. Ayguade E. Barcelona OpenMP tasks suite: a set of benchmarks targeting the exploitation of task parallelism in OpenMP Proceedings of the International Conference on Parallel Processing (ICPP '09) September 2009 Vienna, Austria 124 131 10.1109/ICPP.2009.64 – reference: Tousimojarad A. Vanderbauwhede W. A parallel task-based approach to linear algebra Proceedings of the IEEE 13th International Symposium on Parallel and Distributed Computing (ISPDC '14) 2014 IEEE 59 66 – volume: 24 start-page: 2334 issue: 12 year: 2013 end-page: 2343 ident: 27 article-title: Adaptive cache aware bitier work-stealing in multisocket multicore architectures – reference: Dashti M. Fedorova A. Funston J. Gaud F. Lachaize R. Lepers B. Quéma V. Roth M. Traffic management: a holistic approach to memory placement on NUMA systems Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13) March 2013 ACM 381 394 10.1145/2451116.2451157 2-s2.0-84875650624 – reference: McCurdy C. Vetter J. S. Memphis: finding and fixing NUMA-related performance problems on multi-core platforms Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '10) March 2010 87 96 10.1109/ispass.2010.5452060 2-s2.0-77952562600 – reference: Wittmann M. Hager G. Optimizing ccNUMA locality for task-parallel execution under OpenMP and TBB on multicore-based systems Computing Research Repository, http://arxiv.org/abs/1101.0093 – reference: Pilla L. L. Ribeiro C. P. Cordeiro D. Méhaut J.-F. Charm++ on NUMA platforms: the impact of SMP optimizations and a NUMA-aware load balancer Proceedings of the 4th Workshop of the INRIA-Illinois Joint Laboratory on Petascale Computing 2010 Urbana, Ill, USA – reference: Kleen A. A NUMA API for Linux 2005 Kirkland, Wash, USA Novel – reference: Liu X. Mellor-Crummey J. A tool to analyze the performance of multithreaded programs on NUMA architectures Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14) February 2014 Orlando, Fla, USA ACM 259 271 10.1145/2555243.2555271 2-s2.0-84896838967 – reference: Terboven C. Schmidl D. Cramer T. An Mey D. Assessing OpenMP tasking implementations on NUMA architectures OpenMP in a Heterogeneous World 2012 7312 Berlin, Germany Springer 182 195 Lecture Notes in Computer Science 10.1007/978-3-642-30961-8_14 – reference: Broquedis F. Furmento N. Goglin B. Namyst R. Wacrenier P. Dynamic task and data placement over numa architectures: an openmp runtime perspective Evolving OpenMP in an Age of Extreme Parallelism 2009 5568 Berlin, Germany Springer 79 92 Lecture Notes in Computer Science 10.1007/978-3-642-02303-3_7 – reference: Li Y. Abousamra A. Melhem R. Jones A. K. Compiler-assisted data distribution for chip multiprocessors Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10) September 2010 ACM 501 512 10.1145/1854273.1854335 2-s2.0-78149255516 – volume: 30 start-page: 16 issue: 2 year: 2010 end-page: 29 ident: 11 article-title: Cache hierarchy and memory subsystem of the AMD opteron processor – reference: Goglin B. Furmento N. Enabling high-performance memory migration for multithreaded applications on linux Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS '09) May 2009 1 9 10.1109/ipdps.2009.5161101 2-s2.0-70450079694 – reference: Olivier S. L. de Supinski B. R. Schulz M. Prins J. F. Characterizing and mitigating work time inflation in task parallel programs Proceedings of the 24th International Conference for High Performance Computing, Networking, Storage and Analysis (SC '12) November 2012 Salt Lake City, Utah, USA 1 12 10.1109/sc.2012.27 2-s2.0-84877698611 – reference: Yoo R. M. Hughes C. J. Kim C. Chen Y.-K. Kozyrakis C. Locality-aware task management for unstructured parallelism: a quantitative limit study Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '13) July 2013 Portland, Ore, USA ACM 315 325 2-s2.0-84883495254 – reference: Muddukrishna A. Podobas A. Brorsson M. Vlassov V. Task scheduling on manycore processors with home caches Euro-Par 2012: Parallel Processing Workshops 2013 Berlin, Germany Springer 357 367 Lecture Notes in Computer Science 10.1007/978-3-642-36949-0_39 – reference: Majo Z. Gross T. R. Matching memory access patterns and data placement for NUMA systems Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO '12) April 2012 230 241 10.1145/2259016.2259046 2-s2.0-84863469851 – volume: 47 start-page: 3 issue: 11 year: 2012 end-page: 14 ident: 2 article-title: Memory management for many-core processors with software configurable locality policies – reference: AMD BIOS and kernel developer's guide for AMD family 10h processors, 2010 – ident: 21 doi: 10.1007/978-3-642-02303-3_7 – ident: 38 doi: 10.1145/1555815.1555779 – year: 2005 ident: 5 – ident: 9 doi: 10.1007/978-3-642-36949-0_39 – year: 2012 ident: 14 – ident: 17 doi: 10.3233/SPR-2010-0307 – ident: 29 doi: 10.1007/978-3-642-21487-5_6 – ident: 25 doi: 10.1177/1094342011434065 – ident: 35 doi: 10.1109/tc.2010.199 – ident: 6 doi: 10.1007/978-3-642-30961-8_14 – ident: 2 doi: 10.1145/2426642.2259000 – ident: 8 doi: 10.1007/978-3-642-40698-0_12 – ident: 10 doi: 10.1109/mm.2010.31 – ident: 30 doi: 10.1007/978-3-642-02303-3_7 – year: 2012 ident: 15 – ident: 27 doi: 10.1109/tpds.2012.322 – volume: 18 start-page: 379 year: 2013 ident: 32 publication-title: Procedia Computer Science doi: 10.1016/j.procs.2013.05.201 – ident: 4 doi: 10.1007/978-3-642-19328-6_27 – ident: 11 doi: 10.1007/978-3-642-24650-0_15
SSID	ssj0018100
Score	2.0942783
Snippet	Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing...
SourceID	swepub proquest crossref hindawi emarefa
SourceType	Open Access Repository Aggregation Database Enrichment Source Index Database Publisher
StartPage	1
SubjectTerms	Application programming interfaces (API) Architectural design Architectural knowledge Benchmarking Computer architecture Data distribution Distributing Improve performance Many-core processors Microprocessors Multiprocessing systems Multitasking Network management Non uniform data Performance degradation Performance enhancement Policies Processor architectures Processors Scheduling Scheduling algorithms Scheduling techniques Software architecture Task scheduling
Title	Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors
URI	https://search.emarefa.net/detail/BIM-1076564 https://dx.doi.org/10.1155/2015/981759 https://www.proquest.com/docview/2008026179 https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-166580 https://urn.kb.se/resolve?urn=urn:nbn:se:ri:diva-41881
Volume	2015
WOSCitedRecordID	wos000364899300001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVWIB databaseName: Wiley Online Library Open Access customDbUrl: eissn: 1875-919X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0018100 issn: 1875-919X databaseCode: 24P dateStart: 19920101 isFulltext: true titleUrlDefault: https://authorservices.wiley.com/open-science/open-access/browse-journals.html providerName: Wiley-Blackwell
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fT9swED7RCqS9wNgPKJTK0tAeJkWLEztxHisK4qGtqqlFfbMcO1YrtoKaAP8-58StYJsQb4l9Tiyfne8u9n0HcI6l2uZaBMZkecAQQoM8pSowLspTGxWxOiXLzTAdj8V8nk38Adny3y18RDt0zyn_mQnEuawFLcHdwa1f1_PtXoGgYcM5wHHpIlr5KLy_mr7Cnb3ij8ILBKO9hXN9n5avDcyXpKE10Fx9hH1vIZJ-o9JD2ClWn-Bgk32B-MX4Ge6HDobQiA76T_gaMlXlLdYuEDtciDlRK0MGqlJk4LhxfVorgjYqcYdIRhMyaY5mlQSLx7NRn3j28rrlCD8SjuGS-EiCu3X5BWZXl9OL68DnTwg0i8MqoDaMbKbizDBlYpWg82AUR_tNUJVy5SzDJMwibbkNE6NYpCKRpEUaamstOtnxV2iv7lbFMZDcFjqJTcQKnTEjVK7TWGvKjGWcpoXowI_N6ErtycVdjovfsnYyOJdOFbJRRQfOt8L3DafG_8WOvJpeSKVof7IOfPNqe7t9d6NS6RdmWWfdrFnosfp7o-btMxzT9mB505c4--RttZA4PlyE2N23BNdLyagQ9ORdnTqFD-6u-YnThXa1fijOYFc_Vsty3YNWxCa9eno_Aw238wg
linkProvider	Hindawi Publishing
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Locality-aware+Task+Scheduling+and+Data+Distribution+for+OpenMP+Programs+on+NUMA+Systems+and+Manycore+Processors&rft.jtitle=Scientific+programming&rft.au=Muddukrishna%2C+Ananya&rft.au=Jonsson%2C+Peter+A.&rft.au=Brorsson%2C+Mats&rft.date=2015-01-01&rft.issn=1875-919X&rft_id=info:doi/10.1155%2F2015%2F981759&rft.externalDocID=oai_DiVA_org_kth_166580
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1058-9244&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1058-9244&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1058-9244&client=summon