« #1 Salesman | Main | The Perfect Shoelace? »

November 14, 2008

Soft Failures

Modern design has made products much more robust and adaptable to failure and error conditions. This includes soft failure modes that maintain some form of operation even after some component or process has failed and various error handling and correction methods. This is all very very good, however an insidious down side has emerged. Since these products are so robust when presented with failure, they keep on functioning even when broken, detecting a broken device becomes much more difficult. It often seems to me that some of theses devices are significantly broken for long periods of time before being repaired. This can certainly lead to much more catastrophic failures and wide spread performance degradation. The big issue in my mind is that there aren't sufficient monitoring or alerting utilities added to the product to let people know they are experiencing issues. Some examples of products that exhibit these traits: disk storage products, RAID arrays, traffic control systems. Modern hard disks correct media errors and communication errors automatically on their own. In the vast majority of cases however, there is absolutely no report of errors generated. RAID arrays offer enhanced data security with redundant disks, when a disk fails however, it is frequently not communicated effectively. Locally I have noticed several traffic light control systems that exhibit failures of their vehicle detection loops, they appear to fail in an ON state and the error is not detected at all. Further to that, it appears many of the local traffic lights continue to operate on the old daylight savings time standard.


Post A Comment

Remember me?

Created By: Steven Nikkel (steven_nikkel@ertyu.org)
This webpage and others materials are Copyright © 1997-2016 Steven Nikkel, All Rights Reserved