Bloom Filters vs. Cuckoo Filters vs. MinHash vs. HyperLogLog: A Comprehensive Comparison

ADARSH KUMAR
10 min readMay 22, 2024

--

Are you tired of the limitations of traditional methods for handling large datasets? Let’s take a deep dive into the world of string comparison and explore cutting-edge, memory-optimized solutions. As software engineers, we’ve all had those moments — whether it’s the classic == vs. equals mix-up in Java or the == vs. === conundrum in JavaScript 🔥

In this article, we’ll delve into the fascinating realm of comparing arrays of strings, finding intersections, and counting elements. Plus, we’ll uncover some innovative probabilistic techniques that can revolutionize your data processing. Get ready to optimize your approach and boost your efficiency with these advanced methods!

Problem Statement

How do you compare two arrays of strings and find their intersection, especially when one of them contains millions of rows? This challenge not only tests your algorithm’s efficiency but also its scalability.

Before diving into probabilistic data structures, we’ll explore traditional methods and evaluate their effectiveness in managing large datasets. By the end, you’ll understand both conventional and advanced approaches to optimize your data processing tasks.

Solution 1: Brute force (I mean, why not? 😆)

As first step we will just iterate over two arrays to find the intersection.

// Function to find the intersection of two arrays using brute force
func bruteForceIntersection(arr1, arr2 []string) []string {
var intersection []string
for _, item1 := range arr1 {
for _, item2 := range arr2 {
if item1 == item2 {
intersection = append(intersection, item1)
break
}
}
}
return intersection
}

Efficiency

  • Time Complexity: O(m*n), where m and n are the length of the array.

Solution 2: Using Hashmap

// Function to find the intersection of two arrays using a hashmap
func hashMapIntersection(arr1, arr2 []string) []string {
// Determine the smaller and larger array
var smallArr, largeArr []string
if len(arr1) <= len(arr2) {
smallArr = arr1
largeArr = arr2
} else {
smallArr = arr2
largeArr = arr1
}

// Create a map to store the elements of the smaller array
elementsMap := make(map[string]struct{})
for _, item := range smallArr {
elementsMap[item] = struct{}{}
}

// Create a slice to store the intersection
var intersection []string
for _, item := range largeArr {
if _, found := elementsMap[item]; found {
intersection = append(intersection, item)
// Remove the item from the map to avoid duplicates in the result
delete(elementsMap, item)
}
}

return intersection
}

Efficiency

  • Time Complexity: O(n + m), where n is the length of the smaller array and m is the length of the larger array.
  • Space Complexity: O(min(n, m)) for storing the elements of the smaller array in the hashmap.

Solution 3: Two pointer

package main

import (
"fmt"
"sort"
)

// Function to find the intersection of two sorted arrays using the two-pointer approach
func twoPointerIntersection(arr1, arr2 []string) []string {

// Sort both arrays
sort.Strings(arr1)
sort.Strings(arr2)

i, j := 0, 0
var intersection []string

for i < len(arr1) && j < len(arr2) {
if arr1[i] == arr2[j] {
intersection = append(intersection, arr1[i])
i++
j++
} else if arr1[i] < arr2[j] {
i++
} else {
j++
}
}

return intersection
}

Efficiency

  • Time Complexity: O(n log n + m log m) for sorting the arrays, and O(n + m) for finding the intersection, where n and m are the lengths of the two arrays. If the arrays are already sorted, the time complexity is O(n + m).
  • Space Complexity: O(1) additional space, not including the space required for the input and output arrays.

Comparison

All the 3 approaches are to find exact intersection, thus not the best approach when the array has millions on items inside. In the later part of this article we will discuss other probabilistic approach to determine the intersection which is more memory optimised but take a hit of accuracy.

Solution 4: Bloom filter

A Bloom filter is a probabilistic data structure that efficiently tests whether an element is a member of a set. It provides a fast and memory-efficient way to handle large datasets, with the trade-off of allowing false positives (i.e., it might incorrectly indicate that an element is in the set) but never false negatives. Bloom filters are ideal for applications like cache filtering, network packet analysis, and database query optimization, where memory efficiency and speed are crucial, and occasional false positives are acceptable.

package main

import (
"fmt"
"github.com/willf/bloom"
)

func main() {
// Define two arrays of strings to find their intersection
arr1 := []string{"apple", "banana", "cherry", "date", "elderberry"}
arr2 := []string{"fig", "grape", "cherry", "apple", "banana"}

// Create a Bloom filter with an estimated capacity of 10,000 elements and a false positive rate of 0.01
bf := createBloomFilter(10000, 0.01)

// Add elements from the first array to the Bloom filter
addElementsToBloomFilter(bf, arr1)

// Find the intersection of the two arrays using the Bloom filter
intersection := bloomFilterIntersection(bf, arr2)

// Print the intersection
fmt.Println("Intersection:", intersection)
}

// createBloomFilter initializes a Bloom filter with the specified capacity and false positive rate.
func createBloomFilter(capacity uint, falsePositiveRate float64) *bloom.BloomFilter {
// NewWithEstimates creates a Bloom filter with the given capacity and false positive rate.
return bloom.NewWithEstimates(capacity, falsePositiveRate)
}

// addElementsToBloomFilter adds the elements from the provided array to the Bloom filter.
func addElementsToBloomFilter(bf *bloom.BloomFilter, elements []string) {
// Iterate over each element in the array
for _, item := range elements {
// Add the element to the Bloom filter after converting it to a byte slice
bf.Add([]byte(item))
}
}

// bloomFilterIntersection finds the intersection of elements in the second array with those in the Bloom filter.
func bloomFilterIntersection(bf *bloom.BloomFilter, elements []string) []string {
// Initialize a slice to store the intersection
var intersection []string

// Iterate over each element in the second array
for _, item := range elements {
// Check if the element is present in the Bloom filter
if bf.Test([]byte(item)) {
// If present, add the element to the intersection slice
intersection = append(intersection, item)
}
}

// Return the intersection slice
return intersection
}

Efficiency

  • Bloom Filter Efficiency: Bloom filters are highly efficient in terms of space and time for membership tests and insertions, making them suitable for large datasets.
  • Overall Time Complexity: O(n + m), where n is the size of the first array and m is the size of the second array.
  • Overall Space Complexity: Proportional to the Bloom filter’s size, which depends on the specified capacity and false positive rate.

Solution 4': Cuckoo filter (Extension of bloom filter)

Cuckoo filters provide an efficient way to perform approximate set membership tests similar to Bloom filters but with additional benefits like support for deletions.

package main

import (
"fmt"
"github.com/seiflotfy/cuckoofilter"
)

func main() {
// Define two arrays of strings to find their intersection
arr1 := []string{"apple", "banana", "cherry", "date", "elderberry"}
arr2 := []string{"fig", "grape", "cherry", "apple", "banana"}

// Create a Cuckoo filter with an estimated capacity of 10,000 elements
cf := createCuckooFilter(10000)

// Add elements from the first array to the Cuckoo filter
addElementsToCuckooFilter(cf, arr1)

// Find the intersection of the two arrays using the Cuckoo filter
intersection := cuckooFilterIntersection(cf, arr2)

// Print the intersection
fmt.Println("Intersection:", intersection)
}

// createCuckooFilter initializes a Cuckoo filter with the specified capacity.
// Efficiency: This operation is O(1) in terms of space allocation, but the size of the filter is
// proportional to the capacity.
func createCuckooFilter(capacity uint) *cuckoofilter.CuckooFilter {
// NewCuckooFilter creates a Cuckoo filter with the given capacity.
// This sets up internal data structures but does not process elements yet.
return cuckoofilter.NewCuckooFilter(capacity)
}

// addElementsToCuckooFilter adds the elements from the provided array to the Cuckoo filter.
// Efficiency: This operation is O(n) where n is the number of elements in the array. Each element
// insertion is O(1) on average, due to the fixed number of hash functions used by the Cuckoo filter.
func addElementsToCuckooFilter(cf *cuckoofilter.CuckooFilter, elements []string) {
// Iterate over each element in the array
for _, item := range elements {
// Add the element to the Cuckoo filter after converting it to a byte slice
cf.InsertUnique([]byte(item))
}
}

// cuckooFilterIntersection finds the intersection of elements in the second array with those in the Cuckoo filter.
// Efficiency: This operation is O(m) where m is the number of elements in the second array. Each membership
// test is O(1) on average. The overall time complexity for finding the intersection is O(n + m).
func cuckooFilterIntersection(cf *cuckoofilter.CuckooFilter, elements []string) []string {
// Initialize a slice to store the intersection
var intersection []string

// Iterate over each element in the second array
for _, item := range elements {
// Check if the element is present in the Cuckoo filter
if cf.Lookup([]byte(item)) {
// If present, add the element to the intersection slice
intersection = append(intersection, item)
}
}

// Return the intersection slice
return intersection
}

Efficiency

  • Cuckoo Filter Efficiency: Cuckoo filters are highly efficient in terms of space and time for membership tests and insertions, with the added benefit of supporting deletions.
  • Overall Time Complexity: O(n + m), where n is the size of the first array and m is the size of the second array.
  • Overall Space Complexity: Proportional to the Cuckoo filter’s size, which depends on the specified capacity.

Solution 5: MinHash

MinHash is efficient in terms of memory usage and can provide approximate results for large datasets. MinHash is typically used to estimate similarity rather than finding exact intersections. The below implementation provides an approximate intersection based on the MinHash signatures.

package main

import (
"fmt"
"hash/fnv"
"github.com/dgryski/go-minhash"
)

func main() {
// Define two arrays of strings to find their intersection
arr1 := []string{"apple", "banana", "cherry", "date", "elderberry"}
arr2 := []string{"fig", "grape", "cherry", "apple", "banana"}

// Number of hash functions to use for MinHash
numHashes := 128

// Create MinHash signatures for both arrays
mh1 := createMinHashSignature(numHashes, arr1)
mh2 := createMinHashSignature(numHashes, arr2)

// Find the intersection of the two arrays using MinHash
intersection := approximateIntersection(mh1, mh2, arr1, arr2)

// Print the estimated intersection
fmt.Println("Approximate Intersection:", intersection)
}

// createMinHashSignature creates a MinHash signature for the given array of elements.
// Efficiency: This operation is O(n * k), where n is the number of elements in the array
// and k is the number of hash functions used.
func createMinHashSignature(numHashes int, elements []string) *minhash.MinWise {
// Initialize a new MinHash structure
mh := minhash.NewMinWise(uint(numHashes), 0, 0, 0, 0)

// Initialize a hash function
h := fnv.New64()

// Add each element to the MinHash structure
for _, element := range elements {
h.Reset()
h.Write([]byte(element))
mh.Push(h)
}

return mh
}

// approximateIntersection finds the intersection of two arrays based on their MinHash signatures.
// This method approximates the intersection by checking if elements from the second array
// are likely to be in the set represented by the first MinHash signature.
// Efficiency: This operation is O(m * k), where m is the number of elements in the second array
// and k is the number of hash functions.
func approximateIntersection(mh1 *minhash.MinWise, mh2 *minhash.MinWise, arr1, arr2 []string) []string {
// Initialize a slice to store the intersection
var intersection []string

// Initialize a hash function
h := fnv.New64()

// Create a map of elements in arr1 for quick lookup
elementMap := make(map[string]struct{})
for _, element := range arr1 {
elementMap[element] = struct{}{}
}

// Iterate over each element in the second array
for _, element := range arr2 {
h.Reset()
h.Write([]byte(element))
hashValue := h.Sum64()

// Check if the element's hash is likely present in the first MinHash signature
if mh1.Query(h) == mh2.Query(h) {
// If likely present, add the element to the intersection slice
if _, exists := elementMap[element]; exists {
intersection = append(intersection, element)
}
}
}

// Return the intersection slice
return intersection
}

Efficiency

  • Overall Time Complexity: O(n * k + m * k), where n is the size of the first array, m is the size of the second array, and k is the number of hash functions.
  • Overall Space Complexity: O(n + k), where n is the size of the first array and k is the number of hash functions.

Solution 6: HyperLogLog

HyperLogLog (HLL) is a probabilistic algorithm used primarily for cardinality estimation (counting unique elements) in large datasets. While it isn’t typically used for finding exact intersections, it can provide an estimate of the intersection size using the principle of inclusion-exclusion.

package main

import (
"fmt"
"log"

"github.com/axiomhq/hyperloglog"
)

func main() {
// Define two arrays of strings to find their approximate intersection
arr1 := []string{"apple", "banana", "cherry", "date", "elderberry"}
arr2 := []string{"fig", "grape", "cherry", "apple", "banana"}

// Create HyperLogLog structures for both arrays
hll1 := createHyperLogLog(arr1)
hll2 := createHyperLogLog(arr2)

// Estimate the cardinality (number of unique elements) of each set
count1 := hll1.Count()
count2 := hll2.Count()

// Merge the two HyperLogLog structures to estimate the union's cardinality
hllUnion := createHyperLogLog(nil)
hllUnion.Merge(hll1)
hllUnion.Merge(hll2)
countUnion := hllUnion.Count()

// Estimate the intersection size using the principle of inclusion-exclusion
estimateIntersection := count1 + count2 - countUnion

// Print the estimated intersection size
fmt.Printf("Estimated Intersection Size: %d\n", estimateIntersection)
}

// createHyperLogLog creates a HyperLogLog structure from an array of elements.
// Efficiency: This operation is O(n) where n is the number of elements in the array.
func createHyperLogLog(elements []string) *hyperloglog.Sketch {
hll, err := hyperloglog.NewPlus(18) // 18 is the precision parameter
if err != nil {
log.Fatalf("Failed to create HyperLogLog: %v", err)
}

for _, element := range elements {
hll.Insert([]byte(element))
}

return hll
}

Efficiency

  • HyperLogLog Efficiency: HyperLogLog is highly efficient for cardinality estimation with a fixed memory footprint, making it suitable for large datasets.
  • Overall Time Complexity: O(n + m), where n is the size of the first array and m is the size of the second array.
  • Overall Space Complexity: O(1) additional space, as the HyperLogLog structure uses a fixed amount of memory based on the precision parameter.

Comparison

Thanks for reading till the end. Hope you learnt something new today 🥁

--

--

ADARSH KUMAR
ADARSH KUMAR

No responses yet