
What is Fault Tolerance?
Achieve Fault Tolerance for Critical Systems
Commonly referred to as "FT," fault tolerance is the ability of a computing system or component to continue normal operation even when hardware or software failures have occurred. Fault tolerance is normally sought after in high availability or critical systems.
A system is considered fault tolerant when it is fully redundant and has maximum up-time. Should a component of the network fail for any reason, the network will continue processing requests and clients will not even be aware of the failure. Since a fault tolerant system has multiple critical components such as CPUs, memory, disks and power supplies in the same computer, it's important to ensure that if one component fails, another one will seamlessly take over.
Some basic characteristics of fault tolerance include:
- No single point of repair
- Fault isolation to the failing component
- Availability of reversion modes
- Fault containment to prevent failure
Fault tolerant systems are also characterized in terms of both planned and unplanned service outages. These systems are typically based on the concept of redundancy. Usually fault tolerance is achieved through redundant elements that quickly and automatically replicate activities of failed elements. Many fault tolerant computer systems mirror all operations. In other words, every operation is performed on two or more duplicate systems, so if one fails the other will take over. In the case of an individual system, fault tolerance is achieved by anticipating exceptional conditions and building the system to cope with them.
Learn more about how Marathon Technologies can provide fault tolerance to your applications.