Hardware fault tolerance is a well understood problem. It consists of including two or more identical copies of every working part, constructed in such a way that the backup can take over for any system that breaks. It also requires that broken parts can be swapped out with new ones while the system is still operational.
Such a system implemented with a single backup is known as single point tolerant, and represents the vast majority of fault tolerant systems. In such systems the mean time between failures is long enough that the operators will have more than enough time to fix the broken devices before the second could fail as well. It helps if the time between failures is as long as possible, but this is not specifically required.
It should be noted that there is a difference between fault tolerance, systems that can work even when a fault occurs, and systems that rarely have problems. For instance, the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore were hightly fault resistant. But when a fault did occur they still stopped operating completely, and therefore are not truely fault tolerant.
Fault tolerance is notably successful in computer applications. Tandem Computers built their entire business on such machines, which used single point tolerance to create their NonStop systems with uptimes measured in decades.