Automating Operations via Closed-Loop Remediation

404 views

November 13, 2020

It is hard enough to run an operations center in the best of times, especially in large, complex environments supporting myriad applications. Some of the many challenges are:

  • Very tight windows for level 1 before they have to escalate to level 2 support
  • Insufficient information for triage compounded by out-of-date documentation
  • Many disparate teams with limited collaboration across disparate devices such as operating systems, and element management systems
  • Proliferation of applications that need to be supported
  • Static or declining budgets

Now throw in the current set of challenges with personnel being remote, and the problems get compounded exponentially. The ability to "tap the shoulder" or "conference room huddle," while not always the most efficient to begin with, is no longer an option. The lack of in-person coordination often leads to:

  • Greater volumes of tickets, in the form of incidents and service requests/change requests (SRs/CRs)
  • Degradations in overall performance
  • Negative impacts on availability, response times and mean time to resolution

What's an operations leader to do?

Incident management, like firefighting, requires instruments of detection (e.g., smoke detector) and remediation (e.g., fire extinguisher). Prior to the pandemic, this was achieved via a mix of manual processes and automating root-cause analysis, processes and runbooks. With a crisis taking hold and conditions exacerbated, the level of automation needs to be taken to the next level. For the detection phase of the incident management life cycle, this implies collecting all types of performance data - building real-time IT service models that train machine learning algorithms to predict and eliminate outages. Zenoss Cloud is the first SaaS-based intelligent IT operations management platform that streams and normalizes all machine data, uniquely enabling the emergence of context for preventing service disruptions.

To ready the issue for remediation, the incident generator should be integrated with an IT service management (ITSM) system where the issue is registered in the form of a ticket. An API-based integration with an intelligent data collection platform dramatically reduces alert noise and allows IT Ops and ITSM teams to focus on up-to-date, accurate and actionable information, available at all times, to initiate the resolution process quickly and minimize the negative impacts on the business.

From there, it would typically be picked up by an operator who would:

  • Analyze the data to recognize the problem
  • Determine the root cause and separate from the symptoms
  • Identify a remedial course of action
  • Execute the said procedure
  • Validate success

Initially, this procedure was incorporated into runbooks, and entire scripts could be executed in an automated fashion — giving rise to the term "runbook automation." The many runbook actions were condensed down to a single execution step. However, in this paradigm, the remainder of the process was still manual and required operator intelligence to execute those steps. Automating that is referred to as intelligent automation.

This is especially true for SRs/CRs — which are input by humans and are free-form — as it entails the ability to do natural language processing (NLP). Key to this is the capacity to understand the intent of the request and the associated entities, something that has traditionally required a subject matter expert (SME) to do (even if the SME is a Level 1 support technician, for the more mundane of these). These capabilities are typically beyond the scope of traditional robotic process automation/runbook automation tools, which have a static, rules-based approach, standardized inputs and structures, and sequential control flow. Enter artificial intelligence (AI) and NLP, which power the engine underlying DRYiCE's intelligent automation solution, DRYiCE iAutomate.

iAutomate provides incident remediation and task automation using AI, proprietary NLP algorithms, and knowledge analysis across the infrastructure and applications landscape. It comes with a repository of more than 1,500 configurable and reusable runbooks that enable rapid implementation and time to value. Like Zenoss, it comes with integrations with often-used ITSM solutions like ServiceNow and BMC Remedy. As a first step, it will intelligently evaluate ticket data to identify potential automation candidates and configure/customize runbooks for them. Human agents can then rely on iAutomate to resolve user issues in tandem with the existing ITSM platform. Over time, the solution learns from the operator's actions and can be elevated to the fully automatic mode where it resolves user issues autonomously without any human intervention.

Click here to know more about DRYiCE iAutomate.

Raj Jathar

Raj Jathar, Global Head- Technical Sales, Solution Engineering and Customer Success, DRYiCE Software

Raj has over 25 years of experience in Technology and IT, in both pre and post-sales leadership roles in a variety of different capacities, including Technical Sales, Customer Success, Professional Services, and Product Management.  Prior to DRYiCE Software Raj was with CloudHealth Tech, SevOne, and CA Tech.