In computer operations, a fault describes an unforeseen outage or loss of service within an application. Fault monitoring is the process used to monitor all hardware, software, and network configurations for any deviations from normal operating conditions. This monitoring process typically includes major and minor changes to the expected bandwidth, performance, and utilization of the established computer environment.
Successful implementations of computer software requires significant infrastructure in the area of hardware, software, and networks. This complex integration and collaboration between interoperable components leads to multiple fault opportunities within the application environment. In an effort to reduce down time, proactive fault monitoring provides quick notification and mitigation of computer environmental errors.
The level of proactive monitoring for a computer environment should be based on the importance of the infrastructure. Advance fault monitoring processes can become expensive and time-consuming. Care should be taken to ensure the correct level of monitoring is designed based on the quality of service that is required for the application suite.
A simple monitoring process could include reviewing error logs within an application log file, or operating system. This type of monitoring can be automated to provide real-time notification when errors occur. Once the errors are propagated, administrators can quickly implement mitigation strategies to resolve the identified issue.
Within enterprise application environments, advance fault monitoring is typically implemented, which includes all levels of monitoring. These environments are critical for the business as system down time affects revenue. This type of monitoring typically includes an enterprise data center with advance introspection of all facets of the enterprise configuration.
With advance fault monitoring configurations, any deviations from normal are quickly identified and mitigation strategies are implemented. An example of advance fault monitoring is the ability to recognize abnormal spikes in network traffic. Once identified, traffic can be proactively routed to additional servers and network paths to ensure the quality of service is maintained.
Computer applications rely on hardware and networks, which over time will inevitably have a hard failure or defect. The mean time between failures is a computer term used for predicting the time between each hard failure based on the current configuration. Fault monitoring is a technique used for identifying errors and quickly enacting countermeasures when an inevitable failure does occur.