[Bernstein09] Chapter 7. System Recovery

Stefen 2010-06-05

展開全文

Chapter 7. System Recovery

Causes of System Failure

A Model for System Recovery

Introduction to Database Recovery

The System Model

Database Recovery Manager

Shadow-paging Algorithm

Log-based Database Recovery Algorithms

Optimizing Restart in Log-based Algorithms

Media Recovery

Summary

7.1. Causes of System Failure

A critical requirement for most TP systems is that they be up all the time; in other words, highly available. Such systems often are called “24 by 7” (or 24 × 7), since they are intended to run 24 hours per day, 7 days per week. Defining this concept more carefully, we say that a system is available if it is running correctly and yielding the expected results. The availability of a system is defined as the fraction of time that the system is available. Thus, a highly available system is one that, most of the time, is running correctly and yielding expected results.

Availability is reduced by two factors. One is the rate at which the system fails. By fails, we mean the system gives the wrong answer or no answer. Other things being equal, if it fails frequently, it is less available. The second factor is recovery time. Other things being equal, the longer it takes to fix the system after it fails, the less it is available. These concepts are captured in two technical terms: mean time between failures and mean time to recovery. The mean time between failures, or MTBF, is the average time the system runs before it fails. MTBF is a measure of system reliability. The mean time to repair, or MTTR, is how long it takes to fix the system after it does fail. Using these two measures, we can define availability precisely as MTBF/(MTBF+MTTR), which is the fraction of time the system is running. Thus, availability improves when reliability (MTBF) increases and when repair time (MTTR) decreases.

In many practical settings, the system is designed to meet a service level agreement (SLA), which is typically a combination of availability, response time, and throughput. That is, it is not enough that the system is available. It must also have satisfactory performance. Of course, poor performance may arise from many sources, such as the database system, network, or operating system. Performance problems are sometimes TP-specific, such as the cases of locking performance discussed in Chapter 6 . More often, they are specific to other component technologies. These problems are important, but since they are not specific to the TP aspects of the system, we will not consider them here. Instead, we focus entirely on failures and how to recover from them.

Failures come from a variety of sources. We can categorize them as follows:

The environment: Effects on the physical environment that surrounds the computer system, such as power, communication, air conditioning, fire, and flood.
System management: What people do to manage the system, including vendors doing preventative maintenance and system operators taking care of the system.
Hardware: All hardware devices including processors, memory, I/O controllers, storage devices, etc.
Software: The operating system, communication systems, database systems, transactional middleware, other system software, and application software.

Let’s look at each category of failures and see how we can reduce their frequency.

Hardening the Environment

One part of the environment is communications systems that are not under the control of the people building the computer system, such as long distance communication provided by a telecommunications company. As a customer of communication services, sometimes one can improve communications reliability by paying more to buy more reliable lines. Otherwise, about all one can do is lease more communication lines than are needed to meet functional and performance goals. For example, if one communication line is needed, lease two independent lines instead, so if one fails, the other one will probably still be operating.

A second aspect of the environment is power. Given its failure rate, it’s often appropriate to have battery backup for the computer system. In the event of power failure, battery backup can at least keep main memory alive, so the system can restart immediately after power is restored without rebooting the operating system, thereby reducing MTTR. Batteries may be able to run the system for a short period, either to provide useful service (thereby increasing MTBF) or to hibernate the system by saving main memory to a persistent storage device (which can improve availability if recovering from hibernation is faster than rebooting). To keep running during longer outages, an uninterruptible power supply (UPS) is needed. A full UPS generally includes a gas or diesel powered generator, which can run the system much longer than batteries. Batteries are still used to keep the system running for a few minutes until the generator can take over.

A third environmental issue is air conditioning. An air conditioning failure can bring down the computer system, so when a computer system requires an air conditioned environment, a redundant air conditioning system is often advisable.

Systems can fail due to natural disasters, such as fire, flood, and earthquake, or due to other extraordinary external events, such as war and vandalism. There are things one can do to defend against some of these events: build buildings that are less susceptible to fire, that are able to withstand strong earthquakes, and that are secured against unauthorized entry. How far one goes depends on the cost of the defense, the benefit to availability, and the cost of downtime to the enterprise. When the system is truly “mission critical,” as in certain military, financial, and transportation applications, an enterprise will go to extraordinary lengths to reduce the probability of such failures. One airline system is housed in an underground bunker.

After hardening the environment, the next step is to replicate the system, ideally in a geographically distant location whose environmental disasters are unlikely to be correlated to those at other replicas. For example, many years ago one California bank built an extra computer facility east of the San Andreas Fault, so they could still operate if their Los Angeles or San Francisco facility were destroyed by an earthquake. More recently, geographical replication has become common practice for large-scale Internet sites. Since a system replica is useful only if it has the data necessary to take over processing for a failed system, data replication is an important enabling technology. Data replication is the subject of Chapter 9.

System Management

System management is another cause of failures. People are part of the system. Everybody has an off day or an occasional lapse of attention. It’s only a matter of time before even the best system operator does something that causes the system to fail.

There are several ways to mitigate the problem. One is simply to design the system so that it doesn’t require maintenance, such as using automated procedures for functions that normally would require operator intervention. Even preventative maintenance, which is done to increase availability by avoiding failures later on, may be a source of downtime. Such procedures should be designed to be done while the system is operating.

Simplifying maintenance procedures also helps, if maintenance can’t be eliminated entirely. So does building redundancy into maintenance procedures, so an operator has to make at least two mistakes to cause the system to malfunction. Training is another factor. This is especially important for maintenance procedures that are needed infrequently. It’s like having a fire drill, where people train for rare events, so when the events do happen, people know what actions to take.

Software installation is often a source of planned failures. The installation of many software products requires rebooting the operating system. Developing installation procedures that don’t require rebooting is a way to improve system reliability.

Many operation errors involve reconfiguring the system. Sometimes adding new machines to a rack or changing the tuning parameters on a database system causes the system to malfunction. Even if it only degrades performance, rather than causing the system to crash, the effect may be the same from the end user’s perspective. One can avoid unpleasant surprises by using configuration management tools that simulate a new configuration and demonstrate that it will behave as predicted, or to have test procedures on a test system that can prove that a changed configuration will perform as predicted. Moreover, it is valuable to have reconfiguration procedures that can be quickly undone, so that when a mistake is made, one can revert to the previous working configuration quickly.

If a system is not required to be 24 × 7, then scheduled downtime can be used to handle many of these problems, such as preventative maintenance, installing software that requires a reboot, or reconfiguring a system. However, from a vendor’s viewpoint, offering products that require such scheduled downtime limits their market only to customers that don’t need 24 × 7.

Hardware

The third cause of failures is hardware problems. To discuss hardware failures precisely, we need a few technical terms. A fault is an event inside the system that is believed to have caused a failure. A fault can be either transient or permanent. A transient fault is one that does not reoccur if you retry the operation. A permanent fault is not transient; it is repeatable.

The vast majority of hardware faults are transient. If the hardware fails, simply retry the operation; there’s a very good chance it will succeed. For this reason, operating systems have many built-in recovery procedures to handle transient hardware faults. For example, if the operating system issues an I/O operation to a disk or a communications device and gets an error signal back, it normally retries that operation many times before it actually reports an error back to the caller.

Of course, some hardware faults are permanent. The most serious ones cause the operating system to fail, making the whole system unavailable. In this case, rebooting the operating system may get the system back into a working state. The reboot procedure will detect malfunctioning hardware and try to reconfigure around it. If the reboot fails or the system fails shortly after reboot, then the next step is usually to reimage the disk with a fresh copy of the software, in case it became corrupted. If that doesn’t fix the problem, then repairing the hardware is usually the only option.

Software

This brings us to software failures. The most serious type of software failure is an operating system crash, since it stops the entire computer system. Since many software problems are transient, a reboot often repairs the problem. This involves rebooting the operating system, running software that repairs disk state that might have become inconsistent due to the failure, recovering communications sessions with other systems in a distributed system, and restarting all the application programs. These steps all increase the MTTR and therefore reduce availability. So they should be made as fast as possible. The requirement for faster recovery inspired operating systems vendors in the 1990s to incorporate fast file system recovery procedures, which was a major component of operating system boot time. Some operating systems are carefully engineered for fast boot. For example, highly available communication systems have operating systems that reboot in under a minute, worst case. Taking this goal to the extreme, if the repair time were zero, then failures wouldn’t matter, since the system would recover instantaneously, and the user would never know the difference. Clearly reducing the repair time can have a big impact on availability.

Some software failures only degrade a system’s capabilities, not cause it to fail. For example, consider an application that offers functions that require access to a remote service. When the remote service is unavailable, those functions stop working. However, through careful application design, other application functions can still be operational. That is, the system degrades gracefully when parts of it stop working. A real example we know of is an application that used a TP database and a data warehouse, where the latter was nice to have but not mission-critical. The application was not designed to degrade gracefully, so when the data warehouse failed, the entire application became unavailable, which caused a large and unnecessary loss of revenue.

When an application process or database system does fail, the failure must be detected and the application or database system process must be recovered. This is where TP-specific techniques become relevant.