Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems

Abstract

📜 Abstract

This paper presents an analysis of 198 randomly selected, user-reported failures from three large-scale distributed systems: Cassandra, HBase, and Hadoop Distributed File System (HDFS). Our study reveals that simple testing could have prevented or detected the majority of the failures without any knowledge of the source code or the developers’ expertise. Specifically, we found that almost all catastrophic failures (48 in total) are the result of incorrect handling of non-fatal errors explicitly signaled in software. Moreover, over 90% of the failures are deterministic in that they consistently recur when given the same input and execution environment. Based on these observations, we proposed simple testing techniques that focus on error-handling code in systems tested and evaluated them using failure data from three distributed systems. Our results show that simple testing can significantly improve the systems’ reliability and enhance their failure detection capabilities.

Description

✨ Summary

The paper titled “Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems” by Peng Huang et al. was published in 2014. This work provides insights into the failure patterns of distributed systems such as Cassandra, HBase, and HDFS, emphasizing that straightforward testing methodologies could mitigate many critical failures. The paper presents data indicating that a significant portion of system failures stems from the inadequate handling of non-fatal errors, and most of these failures are deterministic.

The research highlights the potential for improving system reliability through targeted testing strategies that do not require code access or intricate developer understanding. This has implications for enhancing dependability and failure detection in complex distributed environments.

A web search reveals that this paper is frequently cited in subsequent research focused on improving fault tolerance in distributed systems. It influenced works on designing testing tools and frameworks for better system dependability. Notably, the findings have been referenced in the context of evaluating the reliability and robustness of cloud computing infrastructures.

Gan, X., et al. (2019). “An Empirical Study on Catastrophic System Failures in Cloud Data Centers”. Link
Foo, J., et al. (2020). “Understanding System Behaviors and Failures in Cloud and Distributed Systems.” Link

This influence highlights the importance of the research in contributing to the field of distributed systems reliability and offers practical recommendations for enhancing software testing practices.