Automated job flow cancellation for multiple task routine instance errors in many task computing

Uložené v:
Podrobná bibliografia
Názov: Automated job flow cancellation for multiple task routine instance errors in many task computing
Patent Number: 11748,159
Dátum vydania: September 05, 2023
Appl. No: 18/091691
Application Filed: December 30, 2022
Abstrakt: An apparatus including a processor to: within a kill container, in response to a set of error messages indicative of errors in executing multiple instances of a task routine to perform a task of a job flow with multiple data object blocks of a data object, and in response to the quantity of error messages reaching a threshold, output a kill tasks request message that identifies the job flow; within a task container, in response to the kill tasks request message, cease execution of the task routine and output a task cancelation message that identifies the task and the job flow; and within a performance container, in response to he task cancelation message, output a job cancelation message to cause the transmission of an indication of cancelation of the job flow, via a network, and to a requesting device that requested the performance of the job flow.
Inventors: SAS Institute Inc. (Cary, NC, US)
Assignees: SAS INSTITUTE INC. (Cary, NC, US)
Claim: 1. An apparatus comprising at least one processor and a storage to store instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising: within a kill container, the at least one processor is caused to perform operations comprising: monitor a task kill queue for error messages that each indicate an occurrence of an error in executing a task routine to perform a task of a set of tasks of a job flow, and for messages that each indicate a successful execution of a task routine to perform a task of the set of tasks; in response to output, onto the task kill queue, of a first set of error messages indicative of errors in executing multiple instances of a first task routine to perform a first task of the set of tasks with multiple data object blocks of a data object, compare a quantity of error messages within of the first set of error messages to a first predetermined threshold quantity; in response to a lack of receipt, via the task kill queue, of a message that indicates a successful execution of an instance of the first task routine, and in response to the quantity of error messages within the first set of error messages reaching the first predetermined threshold quantity, output a kill tasks request message that identifies the job flow onto the task kill queue; and in response to output, onto the task kill queue, of at least one message that indicates a successful execution of an instance of the first task routine, increase the first predetermined threshold quantity or refrain from outputting the kill tasks request message; within at least one task container of a set of task containers, and in response to the output of the kill tasks request message onto the task kill queue, the at least one processor is caused to perform operations comprising: cease execution of the first task routine to cancel the performance of the first task; and output, onto a task queue, a task cancelation message indicative of cessation of execution of the first task routine, and that identifies the first task and the job flow; and within a performance container, and in response to the output of the task cancelation message onto the task queue, the at least one processor is caused to perform operations comprising: output a job cancelation message indicative of cancelation of the job flow onto a job queue to cause a transmission of an indication of cancelation of the job flow, via a network, and to a requesting device that requested the performance of the job flow.
Claim: 2. The apparatus of claim 1 , wherein: within the kill container, the at least one processor is caused to perform operations comprising: in response to output, onto the task kill queue, of a second set of error messages indicative of errors in executing a second task routine to perform a second task of the set of tasks with just one data object block of the data object or with the entirety of the data object, compare a quantity of the second set of error messages to a second predetermined threshold quantity, and in response to the quantity of error messages within the second set of error messages reaching the second predetermined threshold quantity, output the kill tasks request message that identifies the job flow onto the task kill queue; and within at least one task container in which second task routine is being executed, and in response to the kill tasks request message within the task kill queue, the at least one processor is caused to perform operations comprising: cease execution of the second task routine to cease performance of the second task; and output a task cancelation message indicative of cancelation of execution of the second task routine, and that identifies the job flow, onto the task queue.
Claim: 3. The apparatus of claim 1 , wherein: each error message of the first set of error messages specifies a type of error; the kill tasks request message includes an indication of a type of error derived from the type of error specified in each error message of the first set of error messages; and the derived type of error is relayed through the task cancelation message, the job cancelation message, and the indication of cancelation transmitted to the requesting device.
Claim: 4. The apparatus of claim 1 , wherein within each task container of the set of task containers, and in response to each occurrence of an error in executing the first task routine, the at least one processor is caused to perform operations comprising: output onto the task kill queue an error message of the first set of error messages; and uninstantiate the task container.
Claim: 5. The apparatus of claim 1 , wherein: the error specified as occurring in each error message comprises at least one of an instance of failure of execution, or an instance of a level of a parameter of execution exceeding a threshold limit level during execution; and the parameter of execution of the first task routine comprises at least one of: a level of consumption of a processing resource of the at least one processor by the execution of the first task routine; a level of consumption of a storage resource by the execution of the first task routine; and an amount of time elapsing since commencement of the execution of the first task routine.
Claim: 6. The apparatus of claim 5 , wherein the first set of error messages includes status messages that convey an indication of a level of a parameter of execution of the first task routine that are determined to exceed a threshold limit level.
Claim: 7. The apparatus of claim 1 , wherein: each task container of the set of task containers is of a first type that supports executions of multiple instances of task routines at least partially in parallel; the at least one processor executes instructions of a resource allocation routine to cause the at least one processor to dynamically allocate multiple containers based on availability of at least one of processing resources and storage resources; and within the performance container, and in response to the output of the task cancelation message onto the task queue, the at least one processor is caused to provide, to the resource allocation routine, an indication that fewer task containers of the first type are needed to enable reallocation of resources to other task containers of a second type that supports executions of single instances of task routines.
Claim: 8. The apparatus of claim 1 , wherein: the task queue comprises a group sub-queue to which access is shared by the set of task containers, and a set of individual sub-queues; and each individual sub-queue of the set of individual sub-queues is accessible to a different task container of the set of task containers to provide each task container of the set of task containers with a path of communication to exchange messages with the performance container that is not shared with any other task container.
Claim: 9. The apparatus of claim 8 , wherein: the group sub-queue is maintained throughout at least the performance of the job flow; each individual sub-queue of the set of individual sub-queues is newly instantiated each time the corresponding task container accedes to executing a task routine that is requested in a task routine execution request message that is output onto the group sub-queue; and within each task container of the set of task containers, the at least one processor is caused, in response to receiving the task cancelation message, uninstantiate the corresponding individual sub-queue.
Claim: 10. A computer-program product tangibly embodied in a non-transitory machine-readable storage medium, the computer-program product including instructions operable to cause at least one processor to perform operations comprising: within a kill container, the at least one processor is caused to perform operations comprising: monitor a task kill queue for error messages that each indicate an occurrence of an error in executing a task routine to perform a task of a set of tasks of a job flow, and for messages that each indicate a successful execution of a task routine to perform a task of the set of tasks; in response to output, onto the task kill queue, of a first set of error messages indicative of errors in executing multiple instances of a first task routine to perform a first task of the set of tasks with multiple data object blocks of a data object, compare a quantity of error messages within of the first set of error messages to a first predetermined threshold quantity; in response to a lack of receipt, via the task kill queue, of a message that indicates a successful execution of an instance of the first task routine, and in response to the quantity of error messages within the first set of error messages reaching the first predetermined threshold quantity, output a kill tasks request message that identifies the job flow onto the task kill queue; and in response to output, onto the task kill queue, of at least one message that indicates a successful execution of an instance of the first task routine, increase the first predetermined threshold quantity or refrain from outputting the kill tasks request message; within at least one task container of a set of task containers, and in response to the output of the kill tasks request message onto the task kill queue, the at least one processor is caused to perform operations comprising: cease execution of the first task routine to cancel the performance of the first task; and output, onto a task queue, a task cancelation message indicative of cessation of execution of the first task routine, and that identifies the first task and the job flow; and within a performance container, and in response to the output of the task cancelation message onto the task queue, the at least one processor is caused to perform operations comprising: output a job cancelation message indicative of cancelation of the job flow onto a job queue to cause a transmission of an indication of cancelation of the job flow, via a network, and to a requesting device that requested the performance of the job flow.
Claim: 11. The computer-program product of claim 10 , wherein: within the kill container, the at least one processor is caused to perform operations comprising: in response to output, onto the task kill queue, of a second set of error messages indicative of errors in executing a second task routine to perform a second task of the set of tasks with just one data object block of the data object or with the entirety of the data object, compare a quantity of the second set of error messages to a second predetermined threshold quantity, and in response to the quantity of error messages within the second set of error messages reaching the second predetermined threshold quantity, output the kill tasks request message that identifies the job flow onto the task kill queue; and within at least one task container in which second task routine is being executed, and in response to the kill tasks request message within the task kill queue, the at least one processor is caused to perform operations comprising: cease execution of the second task routine to cease performance of the second task; and output a task cancelation message indicative of cancelation of execution of the second task routine, and that identifies the job flow, onto the task queue.
Claim: 12. The computer-program product of claim 10 , wherein: each error message of the first set of error messages specifies a type of error; the kill tasks request message includes an indication of a type of error derived from the type of error specified in each error message of the first set of error messages; and the derived type of error is relayed through the task cancelation message, the job cancelation message, and the indication of cancelation transmitted to the requesting device.
Claim: 13. The computer-program product of claim 10 , wherein within each task container of the set of task containers, and in response to each occurrence of an error in executing the first task routine, the at least one processor is caused to perform operations comprising: output onto the task kill queue an error message of the first set of error messages; and uninstantiate the task container.
Claim: 14. The computer-program product of claim 10 , wherein: the error specified as occurring in each error message comprises at least one of an instance of failure of execution, or an instance of a level of a parameter of execution exceeding a threshold limit level during execution; and the parameter of execution of the first task routine comprises at least one of: a level of consumption of a processing resource of the at least one processor by the execution of the first task routine; a level of consumption of a storage resource by the execution of the first task routine; and an amount of time elapsing since commencement of the execution of the first task routine.
Claim: 15. The computer-program product of claim 14 , wherein the first set of error messages includes status messages that convey an indication of a level of a parameter of execution of the first task routine that are determined to exceed a threshold limit level.
Claim: 16. The computer-program product of claim 10 , wherein: each task container of the set of task containers is of a first type that supports executions of multiple instances of task routines at least partially in parallel; the at least one processor executes instructions of a resource allocation routine to cause the at least one processor to dynamically allocate multiple containers based on availability of at least one of processing resources and storage resources; and within the performance container, and in response to the output of the task cancelation message onto the task queue, the at least one processor is caused to provide, to the resource allocation routine, an indication that fewer task containers of the first type are needed to enable reallocation of resources to other task containers of a second type that supports executions of single instances of task routines.
Claim: 17. The computer-program product of claim 10 , wherein: the task queue comprises a group sub-queue to which access is shared by the set of task containers, and a set of individual sub-queues; and each individual sub-queue of the set of individual sub-queues is accessible to a different task container of the set of task containers to provide each task container of the set of task containers with a path of communication to exchange messages with the performance container that is not shared with any other task container.
Claim: 18. The computer-program product of claim 17 , wherein: the group sub-queue is maintained throughout at least the performance of the job flow; each individual sub-queue of the set of individual sub-queues is newly instantiated each time the corresponding task container accedes to executing a task routine that is requested in a task routine execution request message that is output onto the group sub-queue; and within each task container of the set of task containers, the at least one processor is caused, in response to receiving the task cancelation message, uninstantiate the corresponding individual sub-queue.
Claim: 19. A computer-implemented method comprising: within a kill container, performing operations comprising: monitoring a task kill queue for error messages that each indicate an occurrence of an error in executing a task routine to perform a task of a set of tasks of a job flow, and for messages that each indicate a successful execution of a task routine to perform a task of the set of tasks; in response to output, onto the task kill queue, of a first set of error messages indicative of errors in executing multiple instances of a first task routine to perform a first task of the set of tasks with multiple data object blocks of a data object, comparing a quantity of error messages within of the first set of error messages to a first predetermined threshold quantity; in response to a lack of receipt, via the task kill queue, of a message that indicates a successful execution of an instance of the first task routine, and in response to the quantity of error messages within the first set of error messages reaching the first predetermined threshold quantity, outputting a kill tasks request message that identifies the job flow onto the task kill queue; or in response to output, onto the task kill queue, of at least one message that indicates a successful execution of an instance of the first task routine, increasing the first predetermined threshold quantity or refraining from outputting the kill tasks request message; within at least one task container of a set of task containers, and in response to the output of the kill tasks request message onto the task kill queue, performing operations comprising: ceasing execution, by at least one processor, of the first task routine to cancel the performance of the first task; and outputting, onto a task queue, a task cancelation message indicative of cessation of execution of the first task routine, and that identifies the first task and the job flow; and within a performance container, and in response to the output of the task cancelation message onto the task queue, performing operations comprising: outputting a job cancelation message indicative of cancelation of the job flow onto a job queue to cause a transmission of an indication of cancelation of the job flow, via a network, and to a requesting device that requested the performance of the job flow.
Claim: 20. The computer-implemented method of claim 19 , comprising: within the kill container, performing operations comprising: in response to output, onto the task kill queue, of a second set of error messages indicative of errors in executing a second task routine to perform a second task of the set of tasks with just one data object block of the data object or with the entirety of the data object, comparing a quantity of the second set of error messages to a second predetermined threshold quantity, and in response to the quantity of error messages within the second set of error messages reaching the second predetermined threshold quantity, outputting the kill tasks request message that identifies the job flow onto the task kill queue; and within at least one task container in which second task routine is being executed by the at least one processor, and in response to the kill tasks request message within the task kill queue, performing operations comprising: ceasing execution, by the at least one processor, of the second task routine to cease performance of the second task; and outputting a task cancelation message indicative of cancelation of execution of the second task routine, and that identifies the job flow, onto the task queue.
Claim: 21. The computer-implemented method of claim 19 , wherein: each error message of the first set of error messages specifies a type of error; the kill tasks request message includes an indication of a type of error derived from the type of error specified in each error message of the first set of error messages; and the derived type of error is relayed through the task cancelation message, the job cancelation message, and the indication of cancelation transmitted to the requesting device.
Claim: 22. The computer-implemented method of claim 19 , comprising, within each task container of the set of task containers, and in response to each occurrence of an error in executing, by the at least one processor, the first task routine, performing operations comprising: outputting onto the task kill queue an error message of the first set of error messages; and uninstantiating the task container.
Claim: 23. The computer-implemented method of claim 19 , wherein: the error specified as occurring in each error message comprises at least one of an instance of failure of execution, or an instance of a level of a parameter of execution exceeding a threshold limit level during execution; and the parameter of execution of the first task routine comprises at least one of: a level of consumption of a processing resource of the at least one processor by the execution of the first task routine; a level of consumption of a storage resource by the execution of the first task routine; and an amount of time elapsing since commencement of the execution of the first task routine.
Claim: 24. The computer-implemented method of claim 23 , wherein the first set of error messages includes status messages that convey an indication of a level of a parameter of execution, by the at least one processor, of the first task routine that are determined, by the at least one processor, to exceed a threshold limit level.
Claim: 25. The computer-implemented method of claim 19 , wherein: each task container of the set of task containers is of a first type that supports executions, by the at least one processor, of multiple instances of task routines at least partially in parallel; the at least one processor executes instructions of a resource allocation routine to cause the at least one processor to dynamically allocate multiple containers based on availability of at least one of processing resources and storage resources; and the method comprises, within the performance container, and in response to the output of the task cancelation message onto the task queue, providing, to the resource allocation routine, an indication that fewer task containers of the first type are needed to enable reallocation of resources to other task containers of a second type that supports executions of single instances of task routines.
Claim: 26. The computer-implemented method of claim 19 , wherein: the task queue comprises a group sub-queue to which access is shared by the set of task containers, and a set of individual sub-queues; and each individual sub-queue of the set of individual sub-queues is accessible to a different task container of the set of task containers to provide each task container of the set of task containers with a path of communication to exchange messages with the performance container that is not shared with any other task container.
Claim: 27. The computer-implemented method of claim 26 , wherein: the group sub-queue is maintained throughout at least the performance of the job flow; each individual sub-queue of the set of individual sub-queues is newly instantiated each time the corresponding task container accedes to executing a task routine that is requested in a task routine execution request message that is output onto the group sub-queue; and the method comprises, within each task container of the set of task containers, in response to receiving the task cancelation message, uninstantiating the corresponding individual sub-queue.
Patent References Cited: 8549536 October 2013 Vasil
8793691 July 2014 Devadhar
9313133 April 2016 Yeddanapudi
9454323 September 2016 Dausner
9577972 February 2017 Word
9760376 September 2017 Bequet
9946719 April 2018 Bowman
9984004 May 2018 Little
9998418 June 2018 Clark
10042886 August 2018 Saadat-Panah
10169121 January 2019 Vibhor
10360053 July 2019 Christensen
10361919 July 2019 Yang
10437689 October 2019 Taubler
10635642 April 2020 Haggerty
10691501 June 2020 Hussain
10846204 November 2020 Vaishnav
11086607 August 2021 Bequet
11086608 August 2021 Bequet
11086671 August 2021 Bequet
11144363 October 2021 Francis Conde
11481245 October 2022 Oliver
20060029068 February 2006 Frank
20130290979 October 2013 Kawano
20130332612 December 2013 Cai
20140040905 February 2014 Tsunoda
20150149745 May 2015 Eble
20150205633 July 2015 Kaptur
20160371122 December 2016 Nair
20200133728 April 2020 Nataraj
Other References: Yildiz et al.; “Fault-Tolerance in Dataflow-based Scientific Workflow Management”; 2010 IEEE 6th World Congress on Services; (Yildiz_2010.pdf; pp. 336-343) (Year: 2010). cited by examiner
Primary Examiner: Patel, Hiren P
Attorney, Agent or Firm: KDW Firm PLLC
Prístupové číslo: edspgr.11748159
Databáza: USPTO Patent Grants
Popis
Abstrakt:An apparatus including a processor to: within a kill container, in response to a set of error messages indicative of errors in executing multiple instances of a task routine to perform a task of a job flow with multiple data object blocks of a data object, and in response to the quantity of error messages reaching a threshold, output a kill tasks request message that identifies the job flow; within a task container, in response to the kill tasks request message, cease execution of the task routine and output a task cancelation message that identifies the task and the job flow; and within a performance container, in response to he task cancelation message, output a job cancelation message to cause the transmission of an indication of cancelation of the job flow, via a network, and to a requesting device that requested the performance of the job flow.