Introduce an optimized extend_by_hashes function#131
Open
aneubeck wants to merge 3 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a batched insertion API to GeoDiffCount to speed up filter construction/insertion by avoiding per-hash rebalancing of the dense/sparse split, along with supporting bit-chunk utilities and benchmarks to quantify the improvement.
Changes:
- Added
GeoDiffCount::extend_by_hashesplusestimate_split_bucketto batch-xor many hashes while restoring invariants once per batch. - Added
BitVec::toggler/BitTogglerto efficiently toggle many bits without repeatedly resolving the internalCow. - Added
parity_bit_positionsto produce parity-awareBitChunkstreams, plus new correctness tests and Criterion benchmarks comparingpush_hashvs batched extend.
Show a summary per file
| File | Description |
|---|---|
crates/geo_filters/src/diff_count/bitvec.rs |
Adds a toggler helper to reduce overhead when flipping many bits in a hot loop. |
crates/geo_filters/src/diff_count.rs |
Implements extend_by_hashes, split estimation, and adds tests validating equivalence to per-hash insertion. |
crates/geo_filters/src/config/lookup.rs |
Optimizes lookup table storage and lookup path (includes new unsafe indexing). |
crates/geo_filters/src/config/bitchunks.rs |
Adds parity_bit_positions and expands test coverage for bit-chunk iterators. |
crates/geo_filters/evaluation/performance.rs |
Adds Criterion benchmarks for batch construction and batch insertion. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 5/5 changed files
- Comments generated: 2
Comment on lines
+223
to
+228
| #[inline] | ||
| pub fn toggle(&mut self, index: usize) { | ||
| debug_assert!(index < self.num_bits); | ||
| let (block_idx, bit_idx) = index.into_index_and_bit(); | ||
| self.blocks[block_idx] ^= bit_idx.into_block(); | ||
| } |
Comment on lines
42
to
+47
| let idx = hash >> (32 - self.b - 1); | ||
| let offset = (hash < self.buckets[idx].1) as usize; | ||
| offset + self.buckets[idx].0 + (1 << self.b) * levels | ||
| // SAFETY: `hash` was masked to 32 bits, so `idx = hash >> (31 - b)` holds at most `b + 1` | ||
| // significant bits and is therefore always `< 2^(b+1) == 2 << b == self.buckets.len()`. | ||
| debug_assert!(idx < self.buckets.len()); | ||
| let (base, threshold) = *unsafe { self.buckets.get_unchecked(idx) }; | ||
| let offset = (hash < threshold as usize) as usize; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is need to speed up construction of geofilters during indexing.
Speed up is at about 2.4 times (depending on the exact usage pattern of course)