Skip to content

Commit 6a72711

Browse files
authored
Merge pull request #69 from github/sc-20250718-geofilter-customize-hasher
Geofilter customize hasher
2 parents a514c1a + b9b0bae commit 6a72711

12 files changed

Lines changed: 317 additions & 59 deletions

File tree

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Choosing a hash function
2+
3+
## Reproducibility
4+
5+
This library uses hash functions to assign values to buckets deterministically. The same item
6+
will hash to the same value, and modify the same bit in the geofilter.
7+
8+
When comparing geofilters it is important that the same hash functions, using the same seed
9+
values, have been used for *both* filters. Attempting to compare geofilters which have been
10+
produced using different hash functions or the same hash function with different seeds will
11+
produce nonsensical results.
12+
13+
Similar to the Rust standard library, this crate uses the `BuildHasher` trait and creates
14+
a new `Hasher` for every item processed.
15+
16+
To help prevent mistakes caused by mismatching hash functions or seeds we introduce a trait
17+
`ReproducibleBuildHasher` which you must implement if you wish to use a custom hashing function.
18+
By marking a `BuildHasher` with this trait you're asserting that `Hasher`s produced using
19+
`Default::default` will hash identical items to the same `u64` value across multiple calls
20+
to `BuildHasher::hash_one`.
21+
22+
The following is an example of some incorrect code which produces nonsense results:
23+
24+
```rust
25+
use std::hash::RandomState;
26+
27+
// Implement our marker trait for `RandomState`.
28+
// You should _NOT_ do this as `RandomState::default` does not produce
29+
// reproducible hashers.
30+
impl ReproducibleBuildHasher for RandomState {}
31+
32+
#[test]
33+
fn test_different_hash_functions() {
34+
// The last parameter in this FixedConfig means we're using RandomState as the BuildHasher
35+
pub type FixedConfigRandom = FixedConfig<Diff, u16, 7, 112, 12, RandomState>;
36+
37+
let mut a = GeoDiffCount::new(FixedConfigRandom::default());
38+
let mut b = GeoDiffCount::new(FixedConfigRandom::default());
39+
40+
// Add our values
41+
for n in 0..100 {
42+
a.push(n);
43+
b.push(n);
44+
}
45+
46+
// We have inserted the same items into both filters so we'd expect the
47+
// symmetric difference to be zero if all is well.
48+
let diff_size = a.size_with_sketch(&b);
49+
50+
// But all is not well. This assertion fails!
51+
assert_eq!(diff_size, 0.0);
52+
}
53+
```
54+
55+
The actual value returned in this example is ~200. This makes sense because the geofilter
56+
thinks that there are 100 unique values in each of the filters, so the difference is approximated
57+
as being ~200. If we were to rerun the above example with a genuinely reproducible `BuildHasher`
58+
then the resulting diff size would be `0`.
59+
60+
In debug builds, as part of the config's `eq` implementation, our library will assert that the `BuildHasher`s
61+
produce the same `u64` value when given the same input but this is not enabled in release builds.
62+
63+
## Stability
64+
65+
Following from this, it might be important that your hash functions and seed values are stable, meaning,
66+
that they won't change from one release to another.
67+
68+
The default function provided in this library is *NOT* stable as it is based on the Rust standard libraries
69+
`DefaultHasher` which does not have a specified algorithm and may change across releases of Rust.
70+
71+
Stability is especially important to consider if you are using serialized geofilters which may have
72+
been created in a previous version of the Rust standard library.
73+
74+
This library provides an implementation of `ReproducibleBuildHasher` for the `FnvBuildHasher` provided
75+
by the `fnv` crate version `1.0`. This is a _stable_ hash function in that it won't change unexpectedly
76+
but it doesn't have good diffusion properties. This means if your input items have low entropy (for
77+
example numbers from `0..10000`) you will find that the geofilter is not able to produce accurate estimations.
78+
79+
## Uniformity and Diffusion
80+
81+
In order to produce accurate estimations it is important that your hash function is able to produce evenly
82+
distributed outputs for your input items.
83+
84+
This property must be balanced against the performance requirements of your system as stronger hashing
85+
algorithms are often slower.
86+
87+
Depending on your input data, different functions may be more or less appropriate. For example, if your input
88+
items have high entropy (e.g. SHA256 values) then the diffusion of your hash function might matter less.
89+
90+
## Implementing your own `ReproducibleBuildHasher` type
91+
92+
If you are using a hash function that you have not implemented yourself you will not be able to implement
93+
`ReproducibleBuildHasher` on that type directly due to Rust's orphan rules. The easiest way to get around this
94+
is to create a newtype which proxies the underlying `BuildHasher`.
95+
96+
In addition to `BuildHasher` `ReproducibleBuildHasher` needs `Default` and `Clone`, which is usually implemented
97+
on `BuildHasher`s, so you can probably just `#[derive(...)]` those. If your `BuildHasher` doesn't have those
98+
traits then you may need to implement them manually.
99+
100+
Here is an example of how to use new types to mark your `BuildHasher` as reproducible.
101+
102+
```rust
103+
#[derive(Clone, Default)]
104+
pub struct MyBuildHasher(BuildHasherDefault<DefaultHasher>);
105+
106+
impl BuildHasher for MyBuildHasher {
107+
type Hasher = DefaultHasher;
108+
109+
fn build_hasher(&self) -> Self::Hasher {
110+
self.0.build_hasher()
111+
}
112+
}
113+
114+
impl ReproducibleBuildHasher for MyBuildHasher {}
115+
```

crates/geo_filters/evaluation/accuracy.rs

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ use std::fs::File;
22
use std::path::PathBuf;
33

44
use clap::Parser;
5+
use geo_filters::build_hasher::UnstableDefaultBuildHasher;
56
use geo_filters::config::VariableConfig;
67
use itertools::Itertools;
78
use once_cell::sync::Lazy;
@@ -156,19 +157,22 @@ static SIMULATION_CONFIG_FROM_STR: Lazy<Vec<SimulationConfigParser>> = Lazy::new
156157
let [b, bytes, msb] = capture_usizes(&c, [2, 3, 4]);
157158
match t {
158159
BucketType::U8 => {
159-
let c = VariableConfig::<_, u8>::new(b, bytes, msb);
160+
let c = VariableConfig::<_, u8, UnstableDefaultBuildHasher>::new(b, bytes, msb);
160161
Box::new(move || Box::new(GeoDiffCount::new(c.clone())))
161162
}
162163
BucketType::U16 => {
163-
let c = VariableConfig::<_, u16>::new(b, bytes, msb);
164+
let c =
165+
VariableConfig::<_, u16, UnstableDefaultBuildHasher>::new(b, bytes, msb);
164166
Box::new(move || Box::new(GeoDiffCount::new(c.clone())))
165167
}
166168
BucketType::U32 => {
167-
let c = VariableConfig::<_, u32>::new(b, bytes, msb);
169+
let c =
170+
VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(b, bytes, msb);
168171
Box::new(move || Box::new(GeoDiffCount::new(c.clone())))
169172
}
170173
BucketType::U64 => {
171-
let c = VariableConfig::<_, u64>::new(b, bytes, msb);
174+
let c =
175+
VariableConfig::<_, u64, UnstableDefaultBuildHasher>::new(b, bytes, msb);
172176
Box::new(move || Box::new(GeoDiffCount::new(c.clone())))
173177
}
174178
}
@@ -185,19 +189,22 @@ static SIMULATION_CONFIG_FROM_STR: Lazy<Vec<SimulationConfigParser>> = Lazy::new
185189

186190
match t {
187191
BucketType::U8 => {
188-
let c = VariableConfig::<_, u8>::new(b, bytes, msb);
192+
let c = VariableConfig::<_, u8, UnstableDefaultBuildHasher>::new(b, bytes, msb);
189193
Box::new(move || Box::new(GeoDistinctCount::new(c.clone())))
190194
}
191195
BucketType::U16 => {
192-
let c = VariableConfig::<_, u16>::new(b, bytes, msb);
196+
let c =
197+
VariableConfig::<_, u16, UnstableDefaultBuildHasher>::new(b, bytes, msb);
193198
Box::new(move || Box::new(GeoDistinctCount::new(c.clone())))
194199
}
195200
BucketType::U32 => {
196-
let c = VariableConfig::<_, u32>::new(b, bytes, msb);
201+
let c =
202+
VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(b, bytes, msb);
197203
Box::new(move || Box::new(GeoDistinctCount::new(c.clone())))
198204
}
199205
BucketType::U64 => {
200-
let c = VariableConfig::<_, u64>::new(b, bytes, msb);
206+
let c =
207+
VariableConfig::<_, u64, UnstableDefaultBuildHasher>::new(b, bytes, msb);
201208
Box::new(move || Box::new(GeoDistinctCount::new(c.clone())))
202209
}
203210
}

crates/geo_filters/evaluation/performance.rs

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
use criterion::{black_box, criterion_group, criterion_main, Criterion};
2+
use geo_filters::build_hasher::UnstableDefaultBuildHasher;
23
use geo_filters::config::VariableConfig;
34
use geo_filters::diff_count::{GeoDiffCount, GeoDiffCount13};
45
use geo_filters::distinct_count::GeoDistinctCount13;
@@ -20,7 +21,7 @@ fn criterion_benchmark(c: &mut Criterion) {
2021
})
2122
});
2223
group.bench_function("geo_diff_count_var_13", |b| {
23-
let c = VariableConfig::<_, u32>::new(13, 7680, 256);
24+
let c = VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(13, 7680, 256);
2425
b.iter(move || {
2526
let mut gc = GeoDiffCount::new(c.clone());
2627
for i in 0..*size {
@@ -59,7 +60,7 @@ fn criterion_benchmark(c: &mut Criterion) {
5960
})
6061
});
6162
group.bench_function("geo_diff_count_var_13", |b| {
62-
let c = VariableConfig::<_, u32>::new(13, 7680, 256);
63+
let c = VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(13, 7680, 256);
6364
b.iter(move || {
6465
let mut gc = GeoDiffCount::new(c.clone());
6566
for i in 0..*size {
@@ -104,7 +105,7 @@ fn criterion_benchmark(c: &mut Criterion) {
104105
})
105106
});
106107
group.bench_function("geo_diff_count_var_13", |b| {
107-
let c = VariableConfig::<_, u32>::new(13, 7680, 256);
108+
let c = VariableConfig::<_, u32, UnstableDefaultBuildHasher>::new(13, 7680, 256);
108109
b.iter(move || {
109110
let mut gc1 = GeoDiffCount::new(c.clone());
110111
let mut gc2 = GeoDiffCount::new(c.clone());
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
use std::hash::{BuildHasher, BuildHasherDefault, DefaultHasher, Hasher as _};
2+
3+
use fnv::FnvBuildHasher;
4+
5+
/// Trait for a hasher factory that can be used to produce hashers
6+
/// for use with geometric filters.
7+
///
8+
/// It is a super set of [`BuildHasher`], enforcing additional requirements
9+
/// on the hasher builder that are required for the geometric filters and
10+
/// surrounding code.
11+
///
12+
/// When performing operations across two different geometric filters,
13+
/// the hashers must be equal, i.e. they must produce the same hash for the
14+
/// same input.
15+
pub trait ReproducibleBuildHasher: BuildHasher + Default + Clone {
16+
#[inline]
17+
fn debug_assert_hashers_eq() {
18+
// In debug builds we check that hash outputs are the same for
19+
// self and other. The library user should only have implemented
20+
// our build hasher trait if this is already true, but we check
21+
// here in case they have implemented the trait in error.
22+
debug_assert_eq!(
23+
Self::default().build_hasher().finish(),
24+
Self::default().build_hasher().finish(),
25+
"Hashers produced by ReproducibleBuildHasher do not produce the same output with the same input"
26+
);
27+
}
28+
}
29+
30+
/// Note that this `BuildHasher` has a consistent implementation of `Default`
31+
/// but is NOT stable across releases of Rust. It is therefore dangerous
32+
/// to use if you plan on serializing the geofilters and reusing them due
33+
/// to the fact that you can serialize a filter made with one version and
34+
/// deserialize with another version of the hasher factor.
35+
pub type UnstableDefaultBuildHasher = BuildHasherDefault<DefaultHasher>;
36+
37+
impl ReproducibleBuildHasher for UnstableDefaultBuildHasher {}
38+
impl ReproducibleBuildHasher for FnvBuildHasher {}

crates/geo_filters/src/config.rs

Lines changed: 60 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
33
use std::{marker::PhantomData, sync::Arc};
44

5-
use crate::Method;
5+
use crate::{build_hasher::ReproducibleBuildHasher, Method};
66

77
mod bitchunks;
88
mod buckets;
@@ -30,8 +30,9 @@ use once_cell::sync::Lazy;
3030
/// Those conversions can be shared across multiple geo filter instances. This way, the
3131
/// conversions can also be optimized via e.g. lookup tables without paying the cost with every
3232
/// new geo filter instance again and again.
33-
pub trait GeoConfig<M: Method>: Clone + Eq + Sized + Send + Sync {
33+
pub trait GeoConfig<M: Method>: Clone + Eq + Sized {
3434
type BucketType: IsBucketType + 'static;
35+
type BuildHasher: ReproducibleBuildHasher;
3536

3637
/// The number of most-significant bits that are stored sparsely as positions.
3738
fn max_msb_len(&self) -> usize;
@@ -79,9 +80,16 @@ pub trait GeoConfig<M: Method>: Clone + Eq + Sized + Send + Sync {
7980
/// Instantiating this type may panic if `T` is too small to hold the maximum possible
8081
/// bucket id determined by `B`, or `B` is larger than the largest statically defined
8182
/// lookup table.
82-
#[derive(Clone, Eq, PartialEq)]
83-
pub struct FixedConfig<M: Method, T, const B: usize, const BYTES: usize, const MSB: usize> {
84-
_phantom: PhantomData<(M, T)>,
83+
#[derive(Clone)]
84+
pub struct FixedConfig<
85+
M: Method,
86+
T,
87+
const B: usize,
88+
const BYTES: usize,
89+
const MSB: usize,
90+
H: ReproducibleBuildHasher,
91+
> {
92+
_phantom: PhantomData<(M, T, H)>,
8593
}
8694

8795
impl<
@@ -90,9 +98,11 @@ impl<
9098
const B: usize,
9199
const BYTES: usize,
92100
const MSB: usize,
93-
> GeoConfig<M> for FixedConfig<M, T, B, BYTES, MSB>
101+
H: ReproducibleBuildHasher,
102+
> GeoConfig<M> for FixedConfig<M, T, B, BYTES, MSB, H>
94103
{
95104
type BucketType = T;
105+
type BuildHasher = H;
96106

97107
#[inline]
98108
fn max_msb_len(&self) -> usize {
@@ -148,42 +158,76 @@ impl<
148158
const B: usize,
149159
const BYTES: usize,
150160
const MSB: usize,
151-
> Default for FixedConfig<M, T, B, BYTES, MSB>
161+
H: ReproducibleBuildHasher,
162+
> Default for FixedConfig<M, T, B, BYTES, MSB, H>
152163
{
153164
fn default() -> Self {
154165
assert_bucket_type_large_enough::<T>(B);
155166
assert_buckets_within_estimation_bound(B, BYTES * BITS_PER_BYTE);
167+
156168
assert!(
157169
B < M::get_lookups().len(),
158170
"B = {} is not available for fixed config, requires B < {}",
159171
B,
160172
M::get_lookups().len()
161173
);
174+
162175
Self {
163176
_phantom: PhantomData,
164177
}
165178
}
166179
}
167180

181+
impl<
182+
M: Method + Lookups,
183+
T: IsBucketType,
184+
const B: usize,
185+
const BYTES: usize,
186+
const MSB: usize,
187+
H: ReproducibleBuildHasher,
188+
> PartialEq for FixedConfig<M, T, B, BYTES, MSB, H>
189+
{
190+
fn eq(&self, _other: &Self) -> bool {
191+
H::debug_assert_hashers_eq();
192+
193+
// The values of the fixed config are provided at compile time
194+
// so no runtime computation is required
195+
true
196+
}
197+
}
198+
199+
impl<
200+
M: Method + Lookups,
201+
T: IsBucketType,
202+
const B: usize,
203+
const BYTES: usize,
204+
const MSB: usize,
205+
H: ReproducibleBuildHasher,
206+
> Eq for FixedConfig<M, T, B, BYTES, MSB, H>
207+
{
208+
}
209+
168210
/// Geometric filter configuration using dynamic lookup tables.
169211
#[derive(Clone)]
170-
pub struct VariableConfig<M: Method, T> {
212+
pub struct VariableConfig<M: Method, T, H: ReproducibleBuildHasher> {
171213
b: usize,
172214
bytes: usize,
173215
msb: usize,
174-
_phantom: PhantomData<(M, T)>,
216+
_phantom: PhantomData<(M, T, H)>,
175217
lookup: Arc<Lookup>,
176218
}
177219

178-
impl<M: Method, T> Eq for VariableConfig<M, T> {}
220+
impl<M: Method, T, H: ReproducibleBuildHasher> Eq for VariableConfig<M, T, H> {}
179221

180-
impl<M: Method, T> PartialEq for VariableConfig<M, T> {
222+
impl<M: Method, T, H: ReproducibleBuildHasher> PartialEq for VariableConfig<M, T, H> {
181223
fn eq(&self, other: &Self) -> bool {
224+
H::debug_assert_hashers_eq();
225+
182226
self.b == other.b && self.bytes == other.bytes && self.msb == other.msb
183227
}
184228
}
185229

186-
impl<M: Method + Lookups, T: IsBucketType> VariableConfig<M, T> {
230+
impl<M: Method + Lookups, T: IsBucketType, H: ReproducibleBuildHasher> VariableConfig<M, T, H> {
187231
/// Returns a new configuration value. See [`FixedConfig`] for the meaning
188232
/// of the parameters. This functions computes a new lookup table every time
189233
/// it is invoked, so make sure to share the resulting value as much as possible.
@@ -205,8 +249,11 @@ impl<M: Method + Lookups, T: IsBucketType> VariableConfig<M, T> {
205249
}
206250
}
207251

208-
impl<M: Method, T: IsBucketType + 'static> GeoConfig<M> for VariableConfig<M, T> {
252+
impl<M: Method, T: IsBucketType + 'static, H: ReproducibleBuildHasher> GeoConfig<M>
253+
for VariableConfig<M, T, H>
254+
{
209255
type BucketType = T;
256+
type BuildHasher = H;
210257

211258
#[inline]
212259
fn max_msb_len(&self) -> usize {

0 commit comments

Comments
 (0)