Adaptive fault diagnosis and resolution system for enterprise data replication system using deep reinforcement learning
PhD Thesis
Title | Adaptive fault diagnosis and resolution system for enterprise data replication system using deep reinforcement learning |
---|---|
Type | PhD Thesis |
Authors | |
Author | Wee, Chee Keong |
Supervisor | |
1. First | Prof Xujuan Zhou |
2. Second | Prof Raj Gururajan |
2. Second | Prof Xiaohui Tao |
Institution of Origin | University of Southern Queensland |
Qualification Name | Doctor of Philosophy |
Number of Pages | 177 |
Year | 2022 |
Publisher | University of Southern Queensland |
Place of Publication | Australia |
Digital Object Identifier (DOI) | https://doi.org/10.26192/q7486 |
Abstract | Modern business IT systems in large organisations have high levels of collaboration and interoperability to support various business functions. In heterogeneous IT systems, data is one of the most important entities that are constantly exchanged. The method of data exchange or transfer among these collaborating IT systems can occur in near real-time or in batches, and they are arranged in either hierarchical or mesh structure relationships. There are several ways of conducting these data transfers and one of the methods is to use data replicating software. Maintaining both the business IT system and the data replication services is always a challenge to the IT administrators, and with mission-critical systems that demand 24x7 uptime, the data replicating services are expected to have a high level of operational standards and services to the organisation with minimum downtime. The job of the IT administrator is to maintain and support all the IT systems and infrastructure to meet the expected service level agreement (SLA). This includes monitoring the IT systems and data replications for anomalies or defects and rectifying them as soon as possible to minimize downtime. However, humans need rest, are prone to fatigue, and are unable to scale their operational work effectively. Therefore, an alternative is needed to overcome these limits. It is the goal of this thesis to meet this challenge by developing a novel autonomous and adaptive system in monitoring and proactively rectifying any technical problems encountered in the data replicating environment. This novel approach utilizes the research in the domain of deep learning and reinforcement learning that can take appropriate actions to rectify faults encountered in the data replication environment to maximize the concept of cumulative rewards. The proposed system will go through a series of learning cycles starting by learning through trial-and-error by interacting with the data replicating environment, then gradually move to learn to predict the course of faults' resolution actions and their associated scores of successes. It will refine and build up its knowledgebase incrementally and for any faults that it cannot resolve, it will need an IT administrator to help it out, which enrich its knowledgebase at the same time. The approach is novel as there has been no precedence in the use of Reinforcement learning in the domain of software's fault diagnosis and resolution for Near real-time Data replication before. The result will be an autonomous fault diagnostics and rectifying system that can function at near human's IT level troubleshooting skills to support the data replicating environment. It is evaluated based on the results of the cost functions from the fault diagnosis and resolution of intelligent agents, against the guiding software routines that perform similar activities. The contribution that this thesis makes can be classified into two main groups: adaptively intelligent fault diagnosis and resolutions. The first group is to develop an adaptive self-learning approach that can learn to diagnose the service outage across the multitudes of software services which cannot be ascertained by manual IT system administration. This feature has significant benefits as it defies the traditional rule-based diagnostic procedures which are limited to the set of pre-assigned rules that they are strictly designed for. It has the flexibility to scale and augment its coverage adaptively. For the second group, the self-learning approach is used to resolve specific software faults adaptively discovered in the diagnosis phase. This gives an edge over the rule-based procedures of fault resolution which depend on predefined rules and conditions to act, and they have the limitation of scalability and adaptiveness. Given the complexity of a large enterprise data replication setup with tens of thousands of software's configuration and parameters, including a high volume of statistics and logs outputs, this thesis can contribute a significant value to the IT support and management community to automate their operations intelligently. |
Keywords | reinforcement learning, enterprise data replication, shareplex, database management, data warehousing |
ANZSRC Field of Research 2020 | 460201. Artificial life and complex adaptive systems |
Public Notes | File reproduced in accordance with the copyright policy of the publisher/author. |
Byline Affiliations | School of Business |
https://research.usq.edu.au/item/q7486/adaptive-fault-diagnosis-and-resolution-system-for-enterprise-data-replication-system-using-deep-reinforcement-learning
Download files
131
total views114
total downloads4
views this month5
downloads this month