Efficient execution of parallel computer programs

Saved in:
Bibliographic Details
Title: Efficient execution of parallel computer programs
Patent Number: 9,542,231
Publication Date: January 10, 2017
Appl. No: 13/086132
Application Filed: April 13, 2011
Abstract: The present invention, known as runspace, relates to the field of computing system management, data processing and data communications, and specifically to synergistic methods and systems which provide resource-efficient computation, especially for decomposable many-component tasks executable on multiple processing elements, by using a metric space representation of code and data locality to direct allocation and migration of code and data, by performing analysis to mark code areas that provide opportunities for runtime improvement, and by providing a low-power, local, secure memory management system suitable for distributed invocation of compact sections of code accessing local memory. Runspace provides mechanisms supporting hierarchical allocation, optimization, monitoring and control, and supporting resilient, energy efficient large-scale computing.
Inventors: Khan, Rishi L. (Wilmington, DE, US); Orozco, Daniel (Newark, DE, US); Gao, Guang R. (Newark, DE, US)
Assignees: ET International, Inc. (Newark, DE, US)
Claim: 1. A method of executing one or more programs on a dataflow processing system, the method comprising: a) obtaining a group of codelets configured to accomplish at least one task of a program, wherein the respective codelets are blocks of instructions arranged to execute non-preemptively to completion, once respective input conditions, including availability of all required inputs and any other requirements for execution, of the respective codelets are satisfied; b) awaiting completion of data percolation in the dataflow processing system, which occurs when data, code or both involved in codelet execution reside locally with respect to the codelet; c) verifying, after data percolation, that the input conditions for a given codelet are met, and placing the given codelet in an event pool, to indicate that the given codelet is enabled for execution; and d) scheduling, from the event pool, the given codelet for execution on a set of data processing system resources of the dataflow processing system.
Claim: 2. The method of claim 1 , further comprising: dynamically mapping a given codelet within the group of codelets to a set of data processing system resources for execution of the given codelet, based at least in part on dependencies among the codelets and on an availability of data processing system resources.
Claim: 3. The method of claim 2 , wherein the dynamically mapping comprises at least one function selected from a group consisting of: placing, locating, re-locating, moving and migrating.
Claim: 4. The method of claim 2 , wherein the mapping comprises at least one function selected from a group consisting of: determining a start time for execution of the given codelet, and determining a place for the execution of the given codelet.
Claim: 5. The method of claim 2 , wherein the mapping comprises performing mapping based on at least one criterion selected from a group consisting of: 1) improving a performance metric of an application program, 2) improving utilization of the data processing system resources, and 3) maximizing a performance metric of an application program while complying with a given set of resource consumption targets.
Claim: 6. The method of claim 2 , wherein the mapping is performed in order to optimize a performance of the program with respect to at least one measure selected from a group consisting of: a total runtime of the program, a runtime of the program within a particular section, a maximum delay before an execution of particular instruction, a quantity of processing units used, a quantity of memory used, a usage of register files, a usage of cache memory, a usage of level 1 cache, a usage of level 2 cache, a usage of level 3 cache, a usage of level N cache wherein N is a positive number, a usage of static RAM memory, a usage of dynamic RAM memory, a usage of global memory, a usage of virtual memory, a quantity processors available for uses other than executing the program, a quantity of memory available for uses other than executing the program, energy consumption, a peak energy consumption, a longevity cost to a computing system, a volume of register updates needed, a volume of memory clearing needed, an efficacy of security enforcement and a cost of security enforcement.
Claim: 7. The method of claim 2 , wherein the mapping is performed in order to operate within a resource constraint that is selected from the group consisting of: a total runtime of the program, a runtime of the program within a particular section, a maximum delay before an execution of particular instruction, a quantity of processing units used, a quantity of memory used, a usage of register files, a usage of cache memory, a usage of level 1 cache, a usage of level 2 cache, a usage of level 3 cache, a usage of level N cache wherein N is a positive number, a usage of static RAM memory, a usage of dynamic RAM memory, a usage of global memory, a usage of virtual memory, a quantity processors available for uses other than executing the program, a quantity of memory available for uses other than executing the program, energy consumption, a peak energy consumption, a longevity cost to a computing system, a volume of register updates needed, a volume of memory clearing needed, an efficacy of security enforcement and a cost of security enforcement.
Claim: 8. The method of claim 2 , wherein the mapping is performed in order to pursue a time-variable mixture of goals, wherein said mixture changes over time due to a factor selected from the group consisting of: pre-specified change and dynamically emerging changes.
Claim: 9. The method of claim 2 , further comprising applying a set of compile-time directives, to aid in carrying out one or more of the obtaining or the dynamically mapping.
Claim: 10. The method of claim 9 , wherein the compile-time directives are selected from the group consisting of: a floating point unit desired, a floating point accuracy desired, a frequency of access, a locality of access, a stalled access, a read-only data type, an initially read-only data type, a finally read-only data type, and a conditionally read-only data type.
Claim: 11. A dataflow processing system including multiple run-time system cores, a respective run-time system core comprising: a set of multiple system management agents, said set including: a data percolation manager, a codelet scheduler, a codelet migration manager, a load balancer, a thread scheduler, and a power regulator or a performance manager, wherein a respective codelet is a block of instructions arranged to execute non-preemptively to completion, once respective input conditions, including availability of all required inputs and any other requirements for execution, of the respective codelets are satisfied, wherein the set of system management agents are configured to interact in a synergistic manner to optimize program execution in the multiple cores, wherein the data percolation manager is configured to percolate data, code or both to reside locally with respect to a codelet, wherein the codelet scheduler is configured to verify, once the data percolation manager has completed data percolation, that the input conditions for a given codelet are met, and to place the given codelet in an event pool, to indicate that the given codelet is enabled for execution, and wherein the thread scheduler is configured to take enabled codelets from the event pool and place them in one or more work queues associated with one or more execution cores.
Claim: 12. The dataflow processing system of claim 11 , wherein the load balancer is configured to analyze the event pool, the work queues, or both, to determine if work should be performed locally, with respect to the respective run-time system, or if work should be migrated elsewhere.
Claim: 13. The dataflow processing system of claim 12 , wherein the respective run-time system core further comprises a thread migrator configured to migrate work to at least one other run-time system core.
Claim: 14. A computer program product, in a computer system comprising multiple nodes, the computer program product comprising a storage medium having stored thereon computer program code designed for execution on at least one of said nodes, which upon executed results in the implementation of operations according to the method of claim 1 .
Claim: 15. A computer usable storage medium or network accessible storage medium having executable program code stored thereon, wherein at least a portion of said program code, upon execution, results in the implementation of operations according to the method of claim 1 .
Patent References Cited: 6061709 May 2000 Bronte
2003/0191927 October 2003 Joy et al.
2005/0188177 August 2005 Gao et al.
2008/0189522 August 2008 Meil et al.
2009/0164399 June 2009 Bell, Jr. et al.
2009/0259713 October 2009 Blumrich et al.
2009/0288086 November 2009 Ringseth et al.
101533417 September 2009
03102758 December 2003










Other References: International Preliminary Report on Patenability issued in PCT/US2011/032316, mailed Jun. 17, 2011. cited by applicant
International Search Report and Written Opinion, PCT/US11/32316, issued Jun. 17, 2011. cited by applicant
Extended European Search Report issued Nov. 17, 2014 in EP Application No. 11769525.4. cited by applicant
Augonnet et al, “Automatic Calibration of Performance Models on Hetergeneous Multicore Architectures,” Parallel Processing Workshops, The Netherlands, pp. 56-65 (Aug. 25, 2009). cited by applicant
Hormati et al, “Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures,” 18th International Conference on Parallel Architectures and Compilation Techniques, Piscataway, NJ, pp. 214-223 (Sep. 12, 2009). cited by applicant
Maheswaran et al, “A dynamic matching and scheduling algorithm for heterogeneous computing systems,” Seventh Proceedings of the Heterogeneous Computer Workshop, Orlando, FL, pp. 57-69 (Mar. 30, 1998). cited by applicant
Pllana et al, “Towards an Intelligent Environment for Programming Multi-core Computing Systems,” Euro-par 2008 workshops—parallel processing, Berlin, DE, pp. 141-151 (Aug. 25, 2008). cited by applicant
Dolbeau et al, “HMPP: A Hybrid Multi-core Parallel Programming Environment,” Proceedings of the Workshop on General Purpose Processing on Graphics Processing Unit, Boston, MA, pp. 1-5 (Oct. 4, 2007). cited by applicant
English translation of an Office Action and Search Report issued Aug. 12, 2015 in CN Application No. 201180019116.4. cited by applicant
Veen, “Dataflow Machine Architecture”, ACM Computing Surveys, vol. 18, No. 4, pp. 365-396, Dec. 1986. cited by applicant
Culler, “Dataflow Architectures” Laboratory for Computer Science, pp. 1-34, Feb. 12, 1986. cited by applicant
Primary Examiner: Kessler, Gregory A
Attorney, Agent or Firm: Panitch Schwarze Belisario & Nadel LLP
Accession Number: edspgr.09542231
Database: USPTO Patent Grants
Description
Abstract:The present invention, known as runspace, relates to the field of computing system management, data processing and data communications, and specifically to synergistic methods and systems which provide resource-efficient computation, especially for decomposable many-component tasks executable on multiple processing elements, by using a metric space representation of code and data locality to direct allocation and migration of code and data, by performing analysis to mark code areas that provide opportunities for runtime improvement, and by providing a low-power, local, secure memory management system suitable for distributed invocation of compact sections of code accessing local memory. Runspace provides mechanisms supporting hierarchical allocation, optimization, monitoring and control, and supporting resilient, energy efficient large-scale computing.