Sparrow: Distributed, Low Latency Scheduling

Abstract

📜 Abstract

This paper presents Sparrow, a scheduling system designed to provide low latency, scalable, and fault-tolerant scheduling for parallel jobs. Sparrow uses a combination of randomized sampling and late binding to achieve low latency and high throughput. Experiments show that Sparrow can make scheduling decisions in milliseconds and achieves higher throughput on large clusters than existing approaches. Sparrow’s decentralized design provides fault tolerance by allowing schedulers to operate independently, making Sparrow suitable for both private clusters and public clouds.

Description

✨ Summary

Sparrow is a scheduling system designed to address challenges in low latency and high throughput scheduling for parallel jobs on clusters. The paper was presented at the 2013 ACM Symposium on Operating Systems Principles (SOSP), highlighting Sparrow’s use of randomized sampling and late binding, which allows the system to make scheduling decisions within milliseconds compared to traditional methods. Additionally, its decentralized design enhances fault tolerance, making it suitable for use in both private cluster environments and public cloud infrastructures.

Sparrow’s introduction influenced further research into low latency task scheduling and scheduling frameworks for real-time analytics in distributed systems. For instance, it was referenced by subsequent works focused on refining scheduling models (e.g., “A survey of decentralized online task allocation among different agents in the era of IoT” - Springer Link) as well as improved scheduling performance in big data frameworks.

Despite its influence, specific applications or deployments of Sparrow technology in industry are not evident from a quick search, suggesting its impact is primarily within academic research and theoretical exploration of distributed systems and scheduling algorithms.