Mean time between failures is a measurement of the average amount of time that takes place between catastrophic system failures in a computer system. In order to calculate mean time between failures, a system needs to operate and then fail. The system is then repaired and put back into operation, where it will eventually fail again. The time between these failures is the first value in the mean time—as the system accumulates more failures, the mean time becomes more accurate.
There are two basic terms that make up this concept: mean time and failure. The type of mean referred to is an arithmetic mean, better known as an average. As with all averages, the more numbers used to calculate the average, the more accurate the final result. Since computer systems don’t constantly fail, the mean time is generally an average from a large number of different systems that all are used and built in a similar manner.
The other big part of the term is failure. In computer terms, there are many different types of failure. In this case, the failure is a total system shutdown. The system is broken beyond its ability to continue operating and must be repaired before it can go back into service. If a single part of the computer fails, like a single memory stick, it is not considered a failure when calculating mean time between failures. In addition, scheduled downtime like maintenance is not a failure.
These values are often used as an early warning sign of undiagnosed hardware issues. If a system’s mean time between failures is very low, then there is obviously a problem in the system somewhere. Computer designers also look at what caused the failure in addition to the length of time. This gives a clearer indication of where the problem may exist and what needs to happen for it to be fixed.
Maintenance personnel use the average time between failures to design their system maintenance schedule. If one system is pushing its mean time while another’s is months away, it makes it easier to determine which system to work on first. A full overhaul and check-up won’t technically reset a system’s mean time, but it should create longer intervals between failures, effectively pushing the mean time higher.
The mean time between failures value is just one of many values used in the computer and manufacturing industry to denote system failures. Other common terms include mean time to failure, how long it takes to catastrophically fail the first time, and mean time between critical failures for failures that are important, but do not take the system offline. There is also mean time between unit replacement, which measures the average time before one system needs to be replaced by another.