Papers
Videos
Chapters
Submission Guidelines
Start A Chapter

Papers tagged fault tolerance

A History of the Virtual Synchrony Replication Model
A Response to Cheriton and Skeen's Criticism of Causal and Totally Ordered Communication
Building Secure and Reliable Network Applications
Byzantine Chain Replication
Ceph: A Scalable, High-Performance Distributed File System
Chain Replication for Supporting High Throughput and Availability
Commodifying Replicated State Machines with OpenReplica
Consensus algorithms for parallel machines
Consensus in the Presence of Partial Synchrony
Consistency Tradeoffs in Modern Distributed Database System Design: CAP is only part of the story
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
Design and Implementation of a Non-Disruptive Upgrade Infrastructure for Scalable Servers
Distributed Snapshots: Determining Global States of Distributed Systems
End-To-End Arguments in System Design
End-To-End Arguments in System Design
EPaxos: Consensus on values, not ballots
Epidemic Algorithms for Replicated Database Maintenance
Epidemic Algorithms for Replicated Database Maintenance
Epidemic Broadcast Trees
Gossip-based Broadcast
Harvest, Yield and Scalable Tolerant Systems
Herbivore: A Scalable and Efficient Protocol for Anonymous Communication and Broadcasting
Hints for Computer System Design
Hotos Jeremiad
HyperDex: A Distributed, Searchable Key-Value Store
Implementing the Omega Failure Detector in the Crash-Recovery Failure Model
Impossibility of Distributed Consensus with One Faulty Process
In Search of an Understandable Consensus Algorithm
Kafka: a Distributed Messaging System for Log Processing
Life Beyond Distributed Transactions: an Apostate's Opinion
Making reliable distributed systems in the presence of software errors
Making reliable distributed systems in the presence of software errors
Making reliable distributed systems in the presence of software errors
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
MDCC: Multi-Data Center Consistency
Microreboot – A Technique for Cheap Recovery
Microreboots: A Technique for Cheap Recovery
Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems
Paxos Made Moderately Complex
Paxos Made Simple
Paxos Made Simple
Practical Byzantine Fault Tolerance and Proactive Recovery
Readings in Distributed Systems
Recurring Virtual Machine for Reliable Systems
Self-Stabilizing Systems in Spite of Distributed Control
SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control
Simplified Local Recovery for Wide Area File Systems
Sinfonia: A New Paradigm for Building Scalable Distributed Systems
Spartan: A Distributed Array Framework with Smart Tiling
The Akamai Network: A Fast and Reliable Software System for Serving the World's Web Sites
The Byzantine Generals Problem
The Byzantine Generals Problem
The Google File System
The Google File System
The Google File System
The Hadoop Distributed File System
Towards Practical Default-On Multi-Core Record/Replay
Understanding the Limitations of Causally and Totally Ordered Communication
Why Research in Distributed Systems is Hard
Zab: High-performance broadcast for primary-backup systems
Zab: High-performance broadcasting using a non-blocking voting scheme

Browse
All Keywords
By Category

Home
Papers
Chapters
Submission Guidelines
Video Feed
Paper Feed

© 2026 Papers We Love^SM, all rights reserved | Watch the videos from PWLConf!