Ben Linsay on HyperLogLog [PWL NYC]

Video

2018-06-06

Meetup: https://bit.ly/2sPHrDU
Paper: https://bit.ly/1QlcaxD
Slides: https://bit.ly/2JMeza6
Audio: https://bit.ly/2t5EwqL

-----------------------------------------------------------------------------------
Sponsored and hosted by Two Sigma (@twosigma)
-----------------------------------------------------------------------------------

Description
------------------

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HyperLogLog, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes"), HyperLogLog performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/√m. This improves on the best previously known cardinality estimator, LogLog, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.

Bio
-----

Ben Linsay (http://blinsay.com/) (@blinsay (https://twitter.com/blinsay)) is somehow still a software engineer. He's worked on distributed data processing pipelines in adtech, built and maintained APIs for small startups, and has accidentally been a DBA twice. Ben has written a couple HyperLogLog implementations in his spare time and doesn't really want to show them to anyone.