Ben Linsay on HyperLogLog

New York - May 14th, 2018


Description

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HyperLogLog, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes"), HyperLogLog performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/√m. This improves on the best previously known cardinality estimator, LogLog, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.

Bio

Ben Linsay (http://blinsay.com/) (@blinsay (https://twitter.com/blinsay)) is somehow still a software engineer. He's worked on distributed data processing pipelines in adtech, built and maintained APIs for small startups, and has accidentally been a DBA twice. Ben has written a couple HyperLogLog implementations in his spare time and doesn't really want to show them to anyone.

Audio

Slides


TwoSigma The New York Chapter would like to thank TwoSigma for helping to make this meetup possible.