Skip to content

How Bloom Filters work

About 852 wordsAbout 3 min

Big Data

2024-12-18

Bloom filters are a probabilistic data structure that checks for presence of an item in a set.

Use Scenarios

The main use scenario of Bloom filter is to quickly determine whether an element is in a set. It is usually used to filter massive data requests and improve the efficiency of data search. For example, when watching Douyin, if you want to determine whether the video has been collected by the current user, it is a question of determining whether an element is in a set. We can use Bloom filter to improve the efficiency of query.

Basic Principle

Bloom filter is a probabilistic data structure with space advantage. The core is a super large bit array and hash function, which is used to answer the question of whether an element exists in a set, but there may be misjudgment-that is, an element is not in the set but is considered to be in the set.

  1. Given a hash space of length N bits.
  2. Select d hash functions, each of which maps a given element to [0, N-1], and set that position to 1.
  3. Use the d hash functions in 2 to calculate the d positions of the element to be judged, a1,a2,,ada_1, a_2, \dots, a_d.
  4. If one of the corresponding bits of a1,a2,,ada_1, a_2, \dots, a_d is not 1, then the element is definitely not in the set.
  5. If the corresponding bits of a1,a2,,ada_1, a_2, \dots, a_d are all 1, then the element may exist in the set.
Basic Principles of Bloom Filter
Basic Principles of Bloom Filter

Parameters

From the above, it can be seen that a Bloom filter should have at least the following parameters:

  1. The size of the hash space, denoted as mm. In the above example, mm = 20 bits.
  2. The size of the element set, denoted as nn. In the above example, nn = 2.
  3. The number of hash functions, denoted as kk. In the above example, kk = 2.
  4. Because BF allows errors, an element may not be in the set but is mistakenly judged to be in the set. The probability of this mistake is called false positive, denoted as ε\varepsilon.

When the error rate is the smallest, the relationship between the parameters is as follows:

k=mnln2 k = \frac{m}{n} \ln2

m=nlnϵ(ln2)2 m = - \frac{n \ln \epsilon}{(\ln 2)^2}

mn=log2ϵln21.44log2ϵ \frac{m}{n}=- \frac{\log_2 \epsilon}{\ln 2} \approx -1.44 \log_2 \epsilon

How To Choose A Hash Function?

From the perspective of probability calculation and speed, the hash function must meet the following requirements:

  1. Independent and uniformly distributed.
  2. Fast calculation speed.

Here we recommend to learn about the murmur algorithm

Advantages And Disadvantages

Advantages

  • High memory efficiency.
  • Fast query speed.
  • Parallel processing.

Disadvantages

  • There is a false positive rate. It mainly depends on the number of hash functions and the size of the bit array. A larger bit array can reduce the false positive rate, but it will increase memory consumption, so a trade-off is needed.
  • There is a hash conflict.
  • Deletion is not supported.
  • The original data cannot be obtained.

Application

  • Database prevents database penetration. Use BloomFilter to reduce disk searches for non-existent rows or columns. Avoiding costly disk searches will greatly improve the performance of database query operations.
  • Determine whether a user has read a video or article in a business scenario. For example, Douyin or Toutiao.

Demo

In Go language, we can use the following package to easily implement a Bloom filter.

main.go
package main

import (
"fmt"
"github.com/bits-and-blooms/bloom"
)

func main(){
m, k := bloom.EstimateParameters(uint(len(md)), 0.001)
filter := bloom.New(m, k)
for d := range md {
if len(d) == 0 {
continue
}
filter.Add([]byte(d))
}
if filter.Test([]byte(d)) {
fmt.print("data already exist!")
}
}

Performance Comparison

Input data size 0.01Bloom memory/CPU peakMap memory/CPU peakMemory savings
1w0.8MB1.18MB32.5%
5w1.5MB3.3MB54.5%
10w1.37MB3.66MB62%
50w2.24MB23.2MB90%
100w2.7MB46.1MB94%
500w9.3MB191.4MB95%
1000w17.6MB382.5MB95%
5000w61.7MB1705.2MB96%

Memory usage Bloom reduction 60% - 90% memory usage

Input data volume 0.01Bloom query timeMap query timeTime increase
1w1+1=2ms508+508=1ms200%
5w5.6+4.8=10.5ms3.2+3.0=6.3ms166%
10w12+9.6=21.8ms9+6=15ms145%
50w61.1+52.1=113.2ms51.6+47.6=99.1ms114%
100w125.9+109.4=235.3ms136.5+121.5=258ms91%
500w665.5+592=1.26s723.5+711.8=1.4s90%
1000w1.87+1.5=3.9s1.48+1.4=2.9s134%
3000w16.5s9.8s168%
5000w15+13=28s7.6+7.6=15.2s184%

Time consumption record of full insert + full query




Changelog

Last Updated: View All Changelog
  • feat(docs): add new english articles

    On 12/18/24

Keep It Simple