Papers We Love - QCon NYC Edition | Gwen Shapira on Realtime Data Processing at Facebook

Meetup: http://bit.ly/2uSFBlB
Paper: http://bit.ly/2u2Nwj2
Slides: http://bit.ly/2w0VBSI
Audio: http://bit.ly/2wgaqju

----------------------------------------------------------------------------------------
Sponsored by Two Sigma (@twosigma) and QCon NYC
----------------------------------------------------------------------------------------

Description
------------------
Realtime data processing powers many use cases at Facebook, including realtime reporting of the aggregated, anonymized voice of Facebook users, analytics for mobile applications, and insights for Facebook page administrators. Many companies have developed their own systems; we have a realtime data processing ecosystem at Facebook that handles hundreds of Gigabytes per second across hundreds of data pipelines.

Many decisions must be made while designing a realtime stream processing system. In this paper, we identify five important design decisions that affect their ease of use, performance, fault tolerance, scalability, and correctness. We compare the alternative choices for each decision and contrast what we built at Facebook to other published systems.

Our main decision was targeting seconds of latency, not milliseconds. Seconds is fast enough for all of the use cases we support and it allows us to use a persistent message bus for data transport. This data transport mechanism then paved the way for fault tolerance, scalability, and multiple options for correctness in our stream processing systems Puma, Swift, and Stylus...

Bio
-----
Gwen Shapira is a product manager at Confluent. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen is the author of Kafka - The Definitive Guide and Hadoop Application Architectures, and a frequent presenter at industry conferences. Gwen is a PMC member on the Apache Kafka project and committer on Apache Sqoop. When Gwen isn't building data pipelines or thinking up new features, you can find her pedaling on her bike exploring the roads and trails of California, and beyond.