GH-3596: Add RowRanges.Builder for incremental construction from selected row indices#3597
GH-3596: Add RowRanges.Builder for incremental construction from selected row indices#3597peter-toth wants to merge 2 commits into
RowRanges.Builder for incremental construction from selected row indices#3597Conversation
…m selected row indices ### Rationale for this change Opening up APIs needed by a later materialization feature in Spark. External readers need to assemble a RowRanges incrementally from a stream of selected row indices (e.g. produced by a downstream filter or join) without having to know page boundaries ahead of time. ### What changes are included in this PR? Adds a Builder to RowRanges that takes a strictly-increasing sequence of selected row indices via addSelected(long) and coalesces consecutive indices into Range entries. Out-of-order or duplicate calls throw IllegalArgumentException. ### Are these changes tested? Yes. TestRowRanges covers single/multiple/coalesced ranges, the empty builder case, and the out-of-order/duplicate rejection paths. ### Are there any user-facing changes? No. Closes apache#3596 Co-authored-by: Matt Butrovich <mbutrovich@gmail.com>
e11f3d9 to
58857df
Compare
|
@wgtmac, could you please take a look at this PR? |
| * @return the constructed {@link RowRanges}, or {@link RowRanges#EMPTY} when no rows were | ||
| * selected. | ||
| */ | ||
| public RowRanges build() { |
There was a problem hiding this comment.
build() should return a snapshot or become terminal. As written, the returned RowRanges shares the builder’s mutable ranges list, so later calls on the same builder can mutate a previously built result and even break the sorted-ranges invariant.
| * @param blockRow the row index to mark selected (must be {@code >} the last value passed) | ||
| * @return this builder for chaining | ||
| */ | ||
| public Builder addSelected(long blockRow) { |
There was a problem hiding this comment.
addSelected should reject negative row indexes explicitly. -1 also collides with the builder’s sentinel state, so invalid input can be silently swallowed or corrupt the active run.
| return ranges.toString(); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
This needs a clearer API commitment. The motivation is external readers/Spark, but the type lives under internal and the PR says there are no user-facing changes. If this is meant to be supported externally, the package/API contract should say so; otherwise downstream users may depend on an unsupported API.
There was a problem hiding this comment.
Yeah, at the time, it was yet another approach to tighten the amount of publicly available API in parquet-java. It quickly turned out that 3rd parties want to use it. For the next major release it is a good candidate to be moved to a more public package. Maybe deprecate it here, and also create at another location?
There was a problem hiding this comment.
@wgtmac, @gszadovszky, I wonder which package could be the right place for RowRanges and the new builder? Shall I try to move/deprecate them in this PR or separate one?
There was a problem hiding this comment.
A proper target package would be the same as now just skip the internal part. I am fine moving/deprecating in this PR. Theoretically, we can simply move it without deprecation since it is "internal" currently, so we have no guarantees for backward compatibility. WDYT, @wgtmac?
| * RowRanges ranges = builder.build(); | ||
| * }</pre> | ||
| */ | ||
| public static class Builder { |
There was a problem hiding this comment.
Consider making Builder final and adding an explicit constructor. Otherwise subclassing and new RowRanges.Builder() become part of the public API surface by accident.
| * @param blockRow the row index to mark selected (must be {@code >} the last value passed) | ||
| * @return this builder for chaining | ||
| */ | ||
| public Builder addSelected(long blockRow) { |
There was a problem hiding this comment.
The naming could be clearer before this API ships. addSelected(long blockRow) is ambiguous; something like addSelectedRow(long rowIndex) would better communicate that the value is a 0-based row index within the current row group.
|
Do you have any advice on this new API? @gszadovszky |
Rationale for this change
This PR is based on @mbutrovich's previous work.
Opening up APIs needed by a later materialization feature in Spark. External readers need to assemble a RowRanges incrementally from a stream of selected row indices (e.g. produced by a downstream filter or join) without having to know page boundaries ahead of time.
What changes are included in this PR?
Adds a Builder to RowRanges that takes a strictly-increasing sequence of selected row indices via addSelected(long) and coalesces consecutive indices into Range entries. Out-of-order or duplicate calls throw IllegalArgumentException.
Are these changes tested?
Yes. TestRowRanges covers single/multiple/coalesced ranges, the empty builder case, and the out-of-order/duplicate rejection paths.
Are there any user-facing changes?
No.
Closes #3596