Papers tagged fault tolerance
- A History of the Virtual Synchrony Replication Model
- A Response to Cheriton and Skeen's Criticism of Causal and Totally Ordered Communication
- Building Secure and Reliable Network Applications
- Byzantine Chain Replication
- Ceph: A Scalable, High-Performance Distributed File System
- Chain Replication for Supporting High Throughput and Availability
- Commodifying Replicated State Machines with OpenReplica
- Consensus algorithms for parallel machines
- Consensus in the Presence of Partial Synchrony
- Consistency Tradeoffs in Modern Distributed Database System Design: CAP is only part of the story
- Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
- Design and Implementation of a Non-Disruptive Upgrade Infrastructure for Scalable Servers
- Distributed Snapshots: Determining Global States of Distributed Systems
- End-To-End Arguments in System Design
- End-To-End Arguments in System Design
- EPaxos: Consensus on values, not ballots
- Epidemic Algorithms for Replicated Database Maintenance
- Epidemic Algorithms for Replicated Database Maintenance
- Epidemic Broadcast Trees
- Gossip-based Broadcast
- Harvest, Yield and Scalable Tolerant Systems
- Herbivore: A Scalable and Efficient Protocol for Anonymous Communication and Broadcasting
- Hints for Computer System Design
- Hotos Jeremiad
- HyperDex: A Distributed, Searchable Key-Value Store
- Implementing the Omega Failure Detector in the Crash-Recovery Failure Model
- Impossibility of Distributed Consensus with One Faulty Process
- In Search of an Understandable Consensus Algorithm
- Kafka: a Distributed Messaging System for Log Processing
- Life Beyond Distributed Transactions: an Apostate's Opinion
- Making reliable distributed systems in the presence of software errors
- Making reliable distributed systems in the presence of software errors
- Making reliable distributed systems in the presence of software errors
- MapReduce: Simplified Data Processing on Large Clusters
- MapReduce: Simplified Data Processing on Large Clusters
- MDCC: Multi-Data Center Consistency
- Microreboot – A Technique for Cheap Recovery
- Microreboots: A Technique for Cheap Recovery
- Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems
- Paxos Made Moderately Complex
- Paxos Made Simple
- Paxos Made Simple
- Practical Byzantine Fault Tolerance and Proactive Recovery
- Readings in Distributed Systems
- Recurring Virtual Machine for Reliable Systems
- Self-Stabilizing Systems in Spite of Distributed Control
- SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control
- Simplified Local Recovery for Wide Area File Systems
- Sinfonia: A New Paradigm for Building Scalable Distributed Systems
- Spartan: A Distributed Array Framework with Smart Tiling
- The Akamai Network: A Fast and Reliable Software System for Serving the World's Web Sites
- The Byzantine Generals Problem
- The Byzantine Generals Problem
- The Google File System
- The Google File System
- The Google File System
- The Hadoop Distributed File System
- Towards Practical Default-On Multi-Core Record/Replay
- Understanding the Limitations of Causally and Totally Ordered Communication
- Why Research in Distributed Systems is Hard
- Zab: High-performance broadcast for primary-backup systems
- Zab: High-performance broadcasting using a non-blocking voting scheme