Article Preview
TopIntroduction
Firms are increasingly virtualizing manual processes (Overby, 2008) as automated services. In this context, service quality, defined by Xu et al. (2013) as “a customer’s global, subjective assessment of the quality of an interaction with a vendor, including the degree to which specific service needs have been met,” is an important concern. To date, much of the literature in service quality considers the challenges inherent in ensuring functional correctness when converting manual processes to automated services (Linton, 2003), improving service quality for physical processes (Mukherjee et al., 1998; Soteriou and Chase, 2000), or leveraging information technology to improve provisioning of a firm’s customer service in general (Ray et al., 2005; Karimi et al., 2001).
As enterprise service delivery platforms, such as SAP and PeopleSoft, grow in scale, and are assigned increasingly larger sets of responsibility for service processing, the impact of service interruptions has a proportionally larger impact on a firm’s ability to function – when a relatively small set of servers are tasked with handling critical internal and external service tasks, any issue that impacts the service delivery platform has the potential bring entire departments, perhaps even an entire firm, to a standstill (Pang and Whitt, 2009).
Creating effective service quality maintenance processes is an active area of work (Trienekens et al., 2004) and a billion-dollar industry (Oracle, 2008). Quality of Service (QoS) models do not mandate that problems must not occur at all. In fact, modern enterprise technology is complex enough that it is generally accepted that issues will occur. QoS models, typically codified in Service Level Agreements (SLAs), take this as a given, and focus on guarantees of recovery, i.e., how quickly issues are resolved, and how much downtime a service will suffer. Of primary importance is how quickly issues are resolved, usually quantified by the time to resolution (TTR) metric (Hiles, 2002). Organizations prefer TTRs on the order of minutes, not hours. However, for application support staff, TTRs can be on the order of multiple hours or longer, when particularly tricky issues arise. In this work, we collaborated with an application development and support team for a major US media company. In the experience the media company’s enterprise applications manager, TTR for some production issues on their PeopleSoft platform could run 5-8 hours. The service downtimes that result from this have significant impacts on the organization’s bottom line in terms of lost productivity and lost sales opportunities.
Problem resolution in general follows a four-step workflow (Johnson, 2002), as depicted in Figure 1. Here, the workflow begins when the issue is logged as a new trouble ticket, and alerts are sent to the appropriate support staff. In the second step, the root cause analysis (RCA) step, the application support staff member assigned to the ticket gathers information aimed at determining why the problem occurred. In the third step, the support staff cleans up any partially completed processes and restarts them to move the impacted business service(s) forward. For example, a partially-completed payroll calculation will need to be restarted with its original inputs so that the next scheduled steps can take place as designed. In the fourth step, the application support analyst develops recommendations for preventing the issue’s recurrence in the future, and documents the problem characteristics and resolution process for future use.
Figure 1. Problem resolution workflow