Click on the top ofzhisheng” to follow, star or pin to grow together

Flink goes from beginner to proficient Series

original: www.cnblogs.com/CodeBear/p/10911177.html

this article is to discuss the Bloom filter from the perspective of Xiaobai, if you are from a science class, or are smart, or really want to fully understand the Bloom filter, you can move.

I don’t know when it began, the originally unknown Bloom filter suddenly became famous, as if it was on the Internet, doing development, everyone knows, everyone knows, even small partners who are not very concerned about technology have heard its name.

I also spent a lot of

time studying Bloom filters, read a lot of blogs, but I was not from a science class, and I didn’t have such a smart mind, and I was lazy… After the infinite reincarnation of “give up, pick up, give up, pick up”, it should be regarded as understanding the core idea of Bloom’s filter, so I want to share it with you.

Application of

Bloom filters

Let’s first look at the application scenarios of Bloom filters to let everyone know what the magical Bloom filter can do.

Cache penetration

We often put some data in Redis and other caches, such as product details. In this way, when a query request comes in, we can directly go to the cache to fetch data according to the product ID without reading the database, which is the simplest, most common, and most effective way to improve performance. Interview FAQs, cache three major questions and solutions!

The general query request flow is like this: first check the cache, return directly if there is a cache, if there is no cache, then go to the database query, and then put the data taken out of the database into the cache, everything looks good.

But what happens if there are a lot of requests coming in right now, all requesting a product ID that doesn’t exist? Since the product IDs do not exist, then there must be no cache, no cache, so a large number of requests are intimidated to the database, the pressure of the database suddenly rises, and it is possible to kill the database.

Although there are many ways to solve this problem, our main character is the “Bloom filter”, and yes, the “Bloom filter” can solve (mitigate) the cache penetration problem. As for why it is “mitigation”, you will understand it by reading it.

A large amount of data, judge whether a given has

a large amount of data in it, and the size of this data has far exceeded the memory of the server, now give you another data, how to judge that the data given to you is not in it

.

If the memory of the server is

large enough, then using HashMap is a good solution, the theoretical time complexity can reach O(1), but now the size of the data has far exceeded the memory of the server, so it is not possible to use HashMap, at this time you can use the “Bloom filter” to solve this problem. But again, there will be a certain “false positive rate”.

What is a Bloom filter

The Bloom filter was proposed by a person named “Bloom”, which itself is a very long binary vector, since it is a binary vector, then it is obvious that it is stored either 0 or 1.

Now let’s create a new bloom filter with a length of 16, and the default value is 0, like this:

now you need to add a data:

We calculate Hash1 (data) = 5 by some calculation method, such as Hash1, and we change the grid with subscript 5 to 1, like the following:

We calculated Hash2 (data) = 9 by some calculation method, such as Hash2, and we changed the grid with subscript 9 to 1, like the following:

Or through some calculation method, such as Hash3, we calculate Hash3 (data) = 2, we change the grid with subscript 2 to 1, like the following:

> so , the data just added occupies the three cells of the Bloom filter “5”, “9”, and “2”.

It can be seen that from the bloom filter itself, there is no complete data stored at all, but a series of random mapping functions are used to calculate the position, and then fill the binary vector.

What’s the use of that? For example, now give you another data, you want to judge whether this data is duplicated, how do you do it?

You only need to use the above three fixed calculation methods to calculate which cells this data occupies, and then see if these cells are placed in 1, if there is a grid that is not 1, then it means that this number is not in it.

This is very easy to understand, for example, now you have given you the data you just added, you through three fixed calculation methods, the calculated result must be exactly the same as the above, but also occupy the bloom filter “5”, “9”, “2” three grids.

However, there is a problem to note that if the 1s placed in these grids do not necessarily mean that the given data must be repeated, perhaps the results of other data are the same after three fixed calculations. This is also easy to understand, for example, we need to judge whether objects are equal, not just whether their hashes are equal.

That is to say, the Bloom filter can only determine whether the data must not exist, but cannot determine whether the data must exist.

Logically speaking, after introducing the process of adding and querying, it is necessary to introduce the process of deletion, but unfortunately it is difficult for Bloom filter to delete data, why? Think about it, for example, if you want to delete the data just given to you, you change the three grids of “5”, “9” and “2” to 0, but maybe other data is also mapped to the three grids of “5”, “9”, “2”, isn’t this chaotic?

I believe that after my introduction, everyone should have a shallow understanding of the Bloom filter, at least you should be clear about the advantages and disadvantages of the Bloom filter

:

  • advantages: because the stored data is not complete, it occupies very little memory, and it is added, and the query speed is fast enough;

  • Disadvantages: As the data increases, the false positive rate increases; Inability to delete data; You can only judge whether the data must not exist, but you cannot judge whether the data must exist.

As you can see, the advantages and disadvantages of Bloom filters are just as obvious.

In the above, I gave an example binary vector length of 16, by three random mapping functions to calculate the position, in actual development, if you want to add a large amount of data, only 16 bits is not enough, in order to reduce the false positive rate, we can also use more random mapping functions, longer binary vectors to calculate positions.

Guava implements the Bloom filter

Now I believe you should have a more emotional understanding of the Bloom filter, the core idea of the Bloom filter is actually not difficult, the difficulty lies in how to design the random mapping function, in the end how many times to map, the length of the binary vector is set to how much is better, which may not be the general development can control.

Fortunately, Google guys provide us with out-of-the-box components to help us implement bloom filters, so let’s see how Google guys give us a “gift”.

The first to introduce “gifts” in pom:

<dependency>
>com.google.guava groupId>
  <artifactId>guavaartifactId>
  < version>19.0version>
dependency>

and then you can test:

private static int size = 1000000; How much data is expected to be

insertedPrivate static double FPP = 0.01; Expected false positive rate

private static BloomFilter bloomFilter = BloomFilter.create(Funnels.integerFunnel(), size, fpp);

public static void main(String[] args) {


insert data
for  (int i = 0; i < 1000000; i++) {    bloomFilter.put(i);  }

  int count = 0;


  for (int i = 1000000; i < 2000000; i++) {
    if (bloomFilter.mightContain(i)) {      count++;

      System. out.println(i + mispositive);

    }  }

  System. out.println (total number of false positives: + count);


Simple

code analysis:

We define a bloom filter with two important parameters, which are how much data we expect to insert, the false positive rate we expect, and the false positive rate cannot be 0.

I inserted 0-1000000

into the bloom filter and then tested the false positive rate with 1000000-2000000.

Running result:

1999501 misjudged 1999567 misjudged 1999640 misjudged




1999697 misjudged
1999827 misjudged
1999942 misjudged
the total number of mispositives: 10314

Now a total of 1 million data does not exist, 10314 false positives, let’s calculate the false positive rate:

That’s not much different from our definition of an expected false positive rate of 0.01.

Redis implements the Bloom filter

The above use guava to implement the bloom filter is to put the data in local memory, can not achieve the sharing of the bloom filter, we can also put the data in redis, use redis to implement the bloom filter, we want to use the data structure is bitmap, you may have questions, redis supports five data structures: String, List, Hash, Set, ZSet, no bitmap. That’s right, in fact, the essence of bitmap is String.

Some friends may say, Nani, the bloom filter has not been introduced, how to come out with a bitmap, it’s okay, you can understand bitmap as a binary vector.

To implement the bloom filter with redis, we need to design our own mapping function and measure the length of the binary vector ourselves, which is undoubtedly an impossible task for me, and can only use the search engine to release the code directly.

public class RedisMain {
    static final int expectedInsertions = 100; How much data to insert
static final double fpp = 0.01; Expected false positive rate

bit array length


private static long numBits;

Number of hash functions


private static int numHashFunctions;

    static {

        numBits = optimalNumOfBits(expectedInsertions, fpp);        numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);    }

    public static void main(String[] args) {


        Jedis jedis = new Jedis( 192.168.0.1096379);
        for (int i = 0; i < 100; i++) {
            long[] indexs = getIndexs(String.valueOf(i));
            for (long index : indexs) {
                jedis.setbit(codebear:bloom, index, true);            }        }

        for (int i = 0; i < 100; i++) {


            long[] indexs = getIndexs(String.valueOf(i));
            for (long index : indexs) {
                Boolean isContain = jedis.getbit(codebear:bloom, index);
if (!isContain) {
System.out.println(i + certainly not duplicated);                } }

System.out.println(i + may repeat);

        }    }
    private static long[] getIndexs(String key) {
        long hash1 = hash(key);
        long hash2 = hash1 >>> 16;
        long[] result = new long[numHashFunctions];
        for (int i = 0; i < numHashFunctions; i++) {
            long combinedHash = hash1 + i * hash2;
            if (combinedHash < 0) {                combinedHash = ~combinedHash;            }            result[i] = combinedHash % numBits;        }

        return result;

    }

    private static long hash(String key) {


        Charset charset = Charset.forName( UTF-8);
        return Hashing.murmur3_128().hashObject(key, Funnels.stringFunnel(charset)).asLong();    }

Calculate the number of hash functions


private static int optimalNumOfHashFunctions(long n, long m) {
        return Math.max(1, (int) Math.round((double) m / n * Math.log(2 )));    }

Calculate the length of the bit array


private static long optimalNumOfBits(long n, double  p) {
        if (p == 0) {            p = Double.MIN_VALUE;        }

        return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));

}

}Run

result:

88 may repeat 89 may repeat 90 may repeat 91 may repeat 92 may repeat 93 may repeat 94 may repeat 95 may repeat 96 may repeat 97 may repeat 98 may repeat

99 may be repeated




public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__80"> to view more articles corresponding to keywords.

Like + watch, less bugs 👇

Buy Me A Coffee