Building Fault-Tolerant Automation Systems: A Case Study in Enterprise IT Resilience.

Saved in:
Bibliographic Details
Title: Building Fault-Tolerant Automation Systems: A Case Study in Enterprise IT Resilience.
Authors: Pagidipalli, Peda Venkata Rao
Source: International Journal of Computational & Experimental Science & Engineering Experimental Science & Engineering (IJCESEN); 2025, Vol. 11 Issue 4, p9959-9967, 9p
Subject Terms: FAULT-tolerant computing, AUTOMATION, COMPUTER network architectures, HETEROGENEOUS computing, INFORMATION technology
Abstract: Enterprise-grade fault-tolerant automation architectures are essential in preserving operational continuity in mission-critical IT support operations. The impact of outages on revenue streams due to downtime is especially evident in industries like Financial Services and Supply Chain ecosystems. This is a review of a completed actual deployment of a highly available workload orchestration Platform using BMC Control-M, Tidal Enterprise Scheduler, and Kubernetes-based container orchestration. The project utilized Microservices Decomposition Patterns, Zero Trust API Gateway architecture, and Real Time Telemetry Pipelines (via Splunk & App Dynamics) in the implementation phase. There were technical challenges in the implementation, including: (1) Stateful Workload Migration Management; (2) Active-Active Failover Topologies; and (3) Orchestration of Unix Agent Deployment Across Heterogeneous Compute Platforms. The architecture contains such items as Oracle RAC Configurations, Message Queue Persistence Layer, and Circuit Breaker Design Patterns to minimize the potential for cascading failure. Operational Metrics demonstrate significant improvements in throughput capacity, mean time between failures, and the time taken to respond to security incidents. The implementation validates that the combination of a Modern Container Orchestration Platform and an established enterprise scheduling platform provides a resilient, fault-tolerant automation infrastructure. Environmental benefits materialized through dynamic resource provisioning algorithms that reduced idle compute overhead. Economic gains stemmed from eliminating manual intervention costs and improved service level agreement adherence. These advances further support the broader Digital Infrastructure Modernization efforts as part of the Federal Resilience Framework. [ABSTRACT FROM AUTHOR]
Copyright of International Journal of Computational & Experimental Science & Engineering Experimental Science & Engineering (IJCESEN) is the property of Journal of Computational Experimental Science, Engineering Experimental Science & Engineering and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Complementary Index
Description
Abstract:Enterprise-grade fault-tolerant automation architectures are essential in preserving operational continuity in mission-critical IT support operations. The impact of outages on revenue streams due to downtime is especially evident in industries like Financial Services and Supply Chain ecosystems. This is a review of a completed actual deployment of a highly available workload orchestration Platform using BMC Control-M, Tidal Enterprise Scheduler, and Kubernetes-based container orchestration. The project utilized Microservices Decomposition Patterns, Zero Trust API Gateway architecture, and Real Time Telemetry Pipelines (via Splunk & App Dynamics) in the implementation phase. There were technical challenges in the implementation, including: (1) Stateful Workload Migration Management; (2) Active-Active Failover Topologies; and (3) Orchestration of Unix Agent Deployment Across Heterogeneous Compute Platforms. The architecture contains such items as Oracle RAC Configurations, Message Queue Persistence Layer, and Circuit Breaker Design Patterns to minimize the potential for cascading failure. Operational Metrics demonstrate significant improvements in throughput capacity, mean time between failures, and the time taken to respond to security incidents. The implementation validates that the combination of a Modern Container Orchestration Platform and an established enterprise scheduling platform provides a resilient, fault-tolerant automation infrastructure. Environmental benefits materialized through dynamic resource provisioning algorithms that reduced idle compute overhead. Economic gains stemmed from eliminating manual intervention costs and improved service level agreement adherence. These advances further support the broader Digital Infrastructure Modernization efforts as part of the Federal Resilience Framework. [ABSTRACT FROM AUTHOR]
ISSN:21499144
DOI:10.22399/ijcesen.4585