What is Fault Tolerance?

Fault tolerance is the ability of a system to continue operating without interruption when one or more of its components fail. By designing robust systems that can handle unexpected issues, we ensure reliability and continuous service. How does this resilience shape the technology you rely on every day? Join us as we examine the impact of fault tolerance on your digital world.
Troy Holmes
Troy Holmes

Most important computer applications require a design that includes several redundant components. This fault tolerant design typically includes hardware, software, power backup, and network fail-safe measures. Fault tolerance is a design that ensures a computer application will remain functioning in the event of catastrophic failure. Most banks, governments, and utility companies use this type design for their critical applications.

Power fault tolerance is an engineering design that provides multiple power inlets to computer equipment. Some examples of power redundancy include multiple power circuits, power inlet providers, or battery backup systems. This system will automatically turn on back up power if electrical service is lost.

Man holding computer
Man holding computer

Back up power plans designed to preserve computer systems typically include fuel-powered generators and large battery units. When a data center loses electrical power, the generator system automatically becomes active. These buildings can typically maintain power for several weeks without impacting overall performance.

Hardware fault tolerance is a design that distributes business processes over multiple computer components. This enables an application to remain functional when a piece of equipment fails due to mechanical problems. A clustered database is an example of a use of fault tolerant hardware. In this design, a physical database is distributed and replicated over multiple hardware devices. If any equipment fails within the cluster, the database remains active because it is distributed across multiple hardware units.

Network fault tolerance is another example of redundancy in a computer system. Most data center operations include network fault tolerant configurations. This requires the use of multiple telecommunication vendors and phone lines into a building. In the event of a complete failure by one vendor, the other network providers would automatically replace it. This type of configuration typically requires two active telecommunication lines within one physical building.

Many large organization and government agencies require fault tolerance to support their physical infrastructure. This ensures that catastrophic events to include power damage and network destruction do not impact the business operations of these organizations. While fault tolerance does not guarantee applications won't fail, it does reduce the likelihood of complete system failure due to computer issues.

The most critical government institutions include fault tolerance for entire business units. This typically includes relocation of personnel, equipment, and resources that can sustain natural disasters for extended periods of time. This type of fault tolerant solution is typically located deep underground, where natural disasters have little impact on physical operations.

You might also Like

Discuss this Article

Post your comments
Forgot password?
    • Man holding computer
      Man holding computer