Adaptive fault diagnosis and resolution system for enterprise data replication system using deep reinforcement learning

PhD Thesis


Wee, Chee Keong. 2022. Adaptive fault diagnosis and resolution system for enterprise data replication system using deep reinforcement learning. PhD Thesis Doctor of Philosophy. University of Southern Queensland. https://doi.org/10.26192/q7486
Title

Adaptive fault diagnosis and resolution system for enterprise data replication system using deep reinforcement learning

TypePhD Thesis
Authors
AuthorWee, Chee Keong
Supervisor
1. FirstProf Xujuan Zhou
2. SecondProf Raj Gururajan
2. SecondProf Xiaohui Tao
Institution of OriginUniversity of Southern Queensland
Qualification NameDoctor of Philosophy
Number of Pages177
Year2022
PublisherUniversity of Southern Queensland
Place of PublicationAustralia
Digital Object Identifier (DOI)https://doi.org/10.26192/q7486
Abstract

Modern business IT systems in large organisations have high levels of collaboration and interoperability to support various business functions. In heterogeneous IT systems, data is one of the most important entities that are constantly exchanged. The method of data exchange or transfer among these collaborating IT systems can occur in near real-time or in batches, and they are arranged in either hierarchical or mesh structure relationships. There are several ways of conducting these data transfers and one of the methods is to use data replicating software. Maintaining both the business IT system and the data replication services is always a challenge to the IT administrators, and with mission-critical systems that demand 24x7 uptime, the data replicating services are expected to have a high level of operational standards and services to the organisation with minimum downtime.

The job of the IT administrator is to maintain and support all the IT systems and infrastructure to meet the expected service level agreement (SLA). This includes monitoring the IT systems and data replications for anomalies or defects and rectifying them as soon as possible to minimize downtime. However, humans need rest, are prone to fatigue, and are unable to scale their operational work effectively. Therefore, an alternative is needed to overcome these limits.

It is the goal of this thesis to meet this challenge by developing a novel autonomous and adaptive system in monitoring and proactively rectifying any technical problems encountered in the data replicating environment. This novel approach utilizes the research in the domain of deep learning and reinforcement learning that can take appropriate actions to rectify faults encountered in the data replication environment to maximize the concept of cumulative rewards. The proposed system will go through a series of learning cycles starting by learning through trial-and-error by interacting with the data replicating environment, then gradually move to learn to predict the course of faults' resolution actions and their associated scores of successes. It will refine and build up its knowledgebase incrementally and for any faults that it cannot resolve, it will need an IT administrator to help it out, which enrich its knowledgebase at the same time. The approach is novel as there has been no precedence in the use of Reinforcement learning in the domain of software's fault diagnosis and resolution for Near real-time Data replication before. The result will be an autonomous fault diagnostics and rectifying system that can function at near human's IT level troubleshooting skills to support the data replicating environment. It is evaluated based on the results of the cost functions from the fault diagnosis and resolution of intelligent agents, against the guiding software routines that perform similar activities.

The contribution that this thesis makes can be classified into two main groups: adaptively intelligent fault diagnosis and resolutions. The first group is to develop an adaptive self-learning approach that can learn to diagnose the service outage across the multitudes of software services which cannot be ascertained by manual IT system administration. This feature has significant benefits as it defies the traditional rule-based diagnostic procedures which are limited to the set of pre-assigned rules that they are strictly designed for. It has the flexibility to scale and augment its coverage adaptively. For the second group, the self-learning approach is used to resolve specific software faults adaptively discovered in the diagnosis phase. This gives an edge over the rule-based procedures of fault resolution which depend on predefined rules and conditions to act, and they have the limitation of scalability and adaptiveness. Given the complexity of a large enterprise data replication setup with tens of thousands of software's configuration and parameters, including a high volume of statistics and logs outputs, this thesis can contribute a significant value to the IT support and management community to automate their operations intelligently.

Keywordsreinforcement learning, enterprise data replication, shareplex, database management, data warehousing
ANZSRC Field of Research 2020460201. Artificial life and complex adaptive systems
Public Notes

File reproduced in accordance with the copyright policy of the publisher/author.

Byline AffiliationsSchool of Business
Permalink -

https://research.usq.edu.au/item/q7486/adaptive-fault-diagnosis-and-resolution-system-for-enterprise-data-replication-system-using-deep-reinforcement-learning

Download files


Published Version
Chee Keong Wee_Redacted.pdf
License: CC BY 4.0
File access level: Anyone

  • 131
    total views
  • 114
    total downloads
  • 4
    views this month
  • 5
    downloads this month

Export as

Related outputs

Adaptive Fault Resolution for Database Replication Systems
Wee, Chee Keong, Zhou, Xujuan, Gururajan, Raj, Tao, Xiaohui and Wee, Nathan. 2022. "Adaptive Fault Resolution for Database Replication Systems." Li, Bohan, Yue, Lin, Jiang, Jing, Chen, Weitong, Li, Xue, Long, Guodong, Fang, Fei and Yu, Han (ed.) 17th International Conference on Advanced Data Mining and Applications (ADMA 2021). Sydney, Australia 02 - 04 Feb 2022 Berlin. Springer. https://doi.org/10.1007/978-3-030-95405-5_26
Adaptive fault diagnosis for data replication systems
Wee, Chee Keong and Wee, Nathan. 2021. "Adaptive fault diagnosis for data replication systems." Qiao, Miao, Vossen, Gottfried, Wang, Sen and Li, Lei (ed.) 32nd Australasian Database Conference: Database Theory and Applications (ADC 2021). Dunedin, New Zealand 29 Jan - 05 Feb 2021 Cham, Switzerland. Springer. https://doi.org/10.1007/978-3-030-69377-0_11