Article Preview
Top1. Introduction
It is somewhat ironic that users and organizations hesitate to apply patches — whose stated purpose is to support availability or reliability — precisely because the process of doing so can lead to downtime (both from the patching process itself as well as unanticipated issues with the patch). Periodic reboots in desktop systems — irrespective of the vendor — are at best annoying. Reboots in enterprise environments (e.g., trading, e-commerce, core network systems), even for a few minutes, imply large revenue loss — or require an extensive backup and failover infrastructure with rolling updates to mitigate such loss.
We question whether this de facto acceptance of significant downtime and redundant infrastructure should not be abandoned in favor of a reliable hot patching process.
Software, the product of an inherently human process, remains a flawed and incomplete artifact. This reality leads to the uncomfortable inevitability of future fixes, upgrades, and enhancements. Given the way such fixes are currently applied (i.e., patch and reboot), developers accept downtime as a foregone conclusion even as the software is released — and deployers who resist downtime resist the patches.
While patches themselves are a necessity, we believe that the process of applying them remains rather crude. First, the target process is terminated; the new binary and corresponding libraries (if any) are then written over the older versions; the system is restarted if necessary; and finally the upgraded application begins execution. Besides the appreciable loss in uptime, all context held by the application is also lost, unless the application had saved its state to persistent storage (Candea & Fox, 2003;Brown & Patterson, 2002) and later restored it (which is expensive to design for, implement, and execute). In the case of mission-critical services, even after a major flaw is unveiled and a patch subsequently created, administrators must choose between security (applying a patch) and availability. This conundrum serves as our motivation for hot patching, without restarting the program and losing state and time. We focus on systems, such as those found in the cyber infrastructure for the power grid, which require high availability and which store significant state (that would be lost on a restart).