Perf: cache primitive sort key in SortPreservingMerge to drop per-comparison bounds checks#23162
Conversation
…parison bounds checks The primary cost of `SortPreservingMergeStream` (the loser-tree k-way merge behind `SortPreservingMergeExec` and `SortExec`'s streaming/spill merge) is the per-row comparison loop, not the output `interleave`. Profiling a single primitive sort column shows ~60% of time in the loop + cursor comparison and only ~7% in `interleave`. Two issues on that hot path: 1. The single-column primitive/array cursor comparison was not being inlined (it showed up as a separate ~21% symbol), and every comparison bounds-checked the underlying `ScalarBuffer` twice. 2. `maybe_poll_stream` was called on every row even though it is a no-op whenever the winner's cursor is still live. Changes: - Inline the lightweight primitive/array cursor comparisons. `RowValues` is left out-of-line on purpose: its variable-length byte comparison is heavy and force-inlining it regresses the multi-column path. - Skip the per-row `maybe_poll_stream` call unless the winner's cursor is actually exhausted and needs a new `RecordBatch`. - Cache the current (and previous) value of a primitive cursor, refreshed once per `advance()` via a new `CursorValues::set_offset` hook (default no-op; `ArrayValues` forwards to the inner cursor). The hot `compare`/`eq_to_previous` then read cached fields instead of indexing the buffer, so the per-comparison bounds checks disappear. The single remaining buffer access in `set_offset` is dominated by the `offset < len` guard in `advance`, so the optimizer elides its bounds check: the length is checked once per row, not per comparison. No `unsafe` is used. Benchmarks (`sort_preserving_merge`, 3 partitions x 1M rows): - single u64 column: ~-15% - multiple u64 columns: ~-6% - single large string col: ~-5% Covered by the existing `sorts/` tests (merge ordering, nulls, stable sort, round-robin tie-breaker), which also exercise the new cache invariant via a `debug_assert!` when run with debug assertions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
run benchmarks |
|
run benchmark sort_tpch |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/spm-compare-cache (35eb7f9) to 5d24591 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/spm-compare-cache (35eb7f9) to 5d24591 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/spm-compare-cache (35eb7f9) to 5d24591 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/spm-compare-cache (35eb7f9) to 5d24591 (merge-base) diff using: sort_tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagesort_tpch — base (merge-base)
sort_tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
@zhuqi-lucas perhaps you could take a look here as well? Perf results seems pretty good! |
alamb
left a comment
There was a problem hiding this comment.
Looks like a good change to me -- thank you @Dandandan
I also did some test coverage analysis and all the new code looks covered to me
| self.rows.num_rows() | ||
| } | ||
|
|
||
| // NOTE: `eq`/`eq_to_previous`/`compare` are deliberately NOT `#[inline]`. |
There was a problem hiding this comment.
is't inline advisory? If you don't want this inlined, perhaps we could make it "inline(never)"
zhuqi-lucas
left a comment
There was a problem hiding this comment.
LGTM, cool improvement, thanks @Dandandan ! Curious — was the dominant win the bounds-check reduction, or the cache locality from caching the current/previous values into the struct?
The bounds-check reduction comes from remembering the previous version.
Yes - I think it would probably need modifying/replacing |
… RowValues Reduce the verbosity of the comments added in this PR. No functional change. For `RowValues`, replace the "deliberately not #[inline]" note with the measured rationale: leaving the inline decision to the compiler is fastest — forcing either `#[inline]` or `#[inline(never)]` regresses the multi-column merge path (~+12% on the `multiple_u64_columns` bench for `#[inline(never)]`). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the reviews @alamb @zhuqi-lucas! 🙏 |
Which issue does this PR close?
Rationale for this change
Two inefficiencies in the hot path:
(in profiles it appeared as a separate ~21% self-time symbol), and every comparison
bounds-checked the underlying
ScalarBuffertwice.maybe_poll_streamwas called on every output row, even though it is ano-op whenever the winner's cursor is still live (the common case).
What changes are included in this PR?
maybe_poll_streamcall unless the winner's cursor isactually exhausted and needs a fresh
RecordBatch.per
advance()via a newCursorValues::set_offsethook (default no-op;ArrayValuesforwards to the inner cursor).Are these changes tested?
Existing tests
Are there any user-facing changes?
No.
🤖 Generated with Claude Code