login

{
"title": "Aggregate Bloom Filter",
"related": []
}

Recall that a Bloom filter^{1} is a bit array comprising $m$ bits, with $n$ entries. An element is inserted by executing $k$ hash functions (each modulo $m$) and setting those indexes to 1.

A Compact Client Side Filter^{2} (aka Neutrino filter) is a sorted list of hash values. It inserts an element by hashing using SipHash distributed over a range of $n/fpr$, where $fpr$ is the desired false positive rate. It sorts all resulting values, and stores the resulting list using lossless compression. This filter is therefore equivalent to a Bloom filter comprising $m=n/fpr$ bits, $k=1$ hash function, and $n$ entries.

Given a desired false-positive rate and number of entries, equations deriving the optimal size and number of hash functions are known^{4} for Bloom filters. Less analyzed is the behavior of Bloom filters with sub-optimal settings. In particular, let us consider the compressability of a Bloom filter with low $n$.

To maintain the best false-positive rate Bloom filter hash functions must distribute their result randomly across $m$ such that every result is equally probable. It is therefore a safe assumption that the hash functions approximate a random oracle. This allows the maximum compressibility of a bloom filter to be calculated, since each bit in the Bloom filter can be modeled as a discrete random variable with 2 outcomes.

Recall that Shannon entropy^{3} is defined as:

$H(X) = -\sum_{i=1}^n P(x_i)log(P(x_i))$

where $X$ is a discrete random variable with outcomes $x_i$, and $P(x_i)$ is the probability of outcome $i$.

Each bit in a Bloom filter has 2 states, 1 or 0. We can calculate the probability of a bit being set – it is the total number of bits set divided by the total number of bits ($m$). An approximation of the total number of bits set is $k*n$ (number of hash functions * number of elements inserted), but this imagines that $m$ is so large that no collisions happen. Using a recursive strategy, a more accurate approximation can be derived. The number of bits set given n random insertions into an m bit space is that of n-1 random insertions + the probability that the nth bit lands in an empty slot or:

$BitsSet(n, m) =\begin{cases} 1, \text{if n is 1} \\ BitsSet(n-1, m) + (1-BitsSet(n-1, m)/m) \end{cases}$

Note, this is a well-known variant of the Birthday problem and a non-recursive solution exists^{[1]}.

And so the probability of a bit being cleared is simply the inverted probability or

$1 - BitsSet(n,m)/M$.

The Shannon Entropy of a Bloom filter bit is therefore roughly:

$H_b(X) = nk/m*log(nk/m) + ( 1-nk/m)*log(1-nk/m)$

Or more accurately:

$H_b(X) = BitsSet(n*k,m)/m*log(BitsSet(n*k,m)/m) + ( 1-BitsSet(n*k,m)/m)*log(1-BitsSet(n*k,m)/m)$

Using base 2 logarithms, this give us the minimum number of bits required to encode the information contained within a Bloom filter.

Figure 1 plots the Shannon Entropy (theoretical minimum size) for filters based on various algorithms.

Bloom filters perform worse than Neutrino filters for low numbers of insertions. This can be understood intuitively because Bloom filters are setting k bits (in this case 10) per insertion, whereas Neutrino filters set 1 bit. As the number of insertions approach the optimal for this Bloom filter size, Bloom filter insertions are more likely to “land” on existing bits as compared to Neutrino filters,

Figure 2 shows the false positive rate (FPR) for constant-size Bloom and Neutrino filters, optimized at a FPR of 0.001 with 1 million insertions. Bloom filters show a very low false positive rate for low N. This can be understood as an effect of the probability intersection of k hashes.

Given a constant space, it is possible to efficiently calculate the union of 2 Bloom or 2 Neutrino filters. For Bloom filters, construct the bit-wise OR of all bit arrays. For Neutrino filters, merge the 2 lists, sort, and drop any duplicates.

Let us propose a filter that aggregates a range of blocks $R = 2^D$. For block height H, let us mask off the bottom D bits idx=H&(R-1). Let us specify that idx == R-1 contains a filter that aggregates all blocks in R.

A light client can access aggregate filters to minimize the amount of data passed.

Let us define the **false positive overhead** to be the number of unnecessary bytes transferred due to false positive errors. TODO: SPV proof or block?

Let us explore the relationship between the false positive overhead and the total required storage.

[1] with items = n, and array size = k, the number of bits set S is:

$S = k * (1 - (1-1/k)^n)$

- Bloom, Burton H. (1970), “Space/Time Trade-offs in Hash Coding with Allowable Errors”, Communications of the ACM 13 (7): 422–426, CiteSeerX 10.1.1.641.9096, doi:10.1145/362686.362692, S2CID: 7931252
- Osuntokun, Olaoluwa (laolu32@gmail.com), Akselrod, Alex (alex@akselrod.org), Compact Client Side Filtering For Light Clients, Github
- Shannon, C.E. (1948), A Mathematical Theory of Communication,
*Bell System Technical Journal*, 27, pp. 379–423 & 623–656, July & October, 1948. - Wikipedia, Bloom Filter Summary