Skip to content

GH-3596: Add RowRanges.Builder for incremental construction from selected row indices#3597

Open
peter-toth wants to merge 2 commits into
apache:masterfrom
peter-toth:GH-3596-row-ranges-builder
Open

GH-3596: Add RowRanges.Builder for incremental construction from selected row indices#3597
peter-toth wants to merge 2 commits into
apache:masterfrom
peter-toth:GH-3596-row-ranges-builder

Conversation

@peter-toth

Copy link
Copy Markdown
Contributor

Rationale for this change

This PR is based on @mbutrovich's previous work.

Opening up APIs needed by a later materialization feature in Spark. External readers need to assemble a RowRanges incrementally from a stream of selected row indices (e.g. produced by a downstream filter or join) without having to know page boundaries ahead of time.

What changes are included in this PR?

Adds a Builder to RowRanges that takes a strictly-increasing sequence of selected row indices via addSelected(long) and coalesces consecutive indices into Range entries. Out-of-order or duplicate calls throw IllegalArgumentException.

Are these changes tested?

Yes. TestRowRanges covers single/multiple/coalesced ranges, the empty builder case, and the out-of-order/duplicate rejection paths.

Are there any user-facing changes?

No.

Closes #3596

…m selected row indices

### Rationale for this change

Opening up APIs needed by a later materialization feature in Spark. External readers need to assemble a RowRanges incrementally from a stream of selected row indices (e.g. produced by a downstream filter or join) without having to know page boundaries ahead of time.

### What changes are included in this PR?

Adds a Builder to RowRanges that takes a strictly-increasing sequence of selected row indices via addSelected(long) and coalesces consecutive indices into Range entries. Out-of-order or duplicate calls throw IllegalArgumentException.

### Are these changes tested?

Yes. TestRowRanges covers single/multiple/coalesced ranges, the empty builder case, and the out-of-order/duplicate rejection paths.

### Are there any user-facing changes?

No.

Closes apache#3596

Co-authored-by: Matt Butrovich <mbutrovich@gmail.com>
@peter-toth peter-toth force-pushed the GH-3596-row-ranges-builder branch from e11f3d9 to 58857df Compare June 5, 2026 13:56
@peter-toth

Copy link
Copy Markdown
Contributor Author

@wgtmac, could you please take a look at this PR?

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review, focused only on API design.

* @return the constructed {@link RowRanges}, or {@link RowRanges#EMPTY} when no rows were
* selected.
*/
public RowRanges build() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build() should return a snapshot or become terminal. As written, the returned RowRanges shares the builder’s mutable ranges list, so later calls on the same builder can mutate a previously built result and even break the sorted-ranges invariant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 17cce69.

* @param blockRow the row index to mark selected (must be {@code >} the last value passed)
* @return this builder for chaining
*/
public Builder addSelected(long blockRow) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addSelected should reject negative row indexes explicitly. -1 also collides with the builder’s sentinel state, so invalid input can be silently swallowed or corrupt the active run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check in 17cce69.

return ranges.toString();
}

/**

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a clearer API commitment. The motivation is external readers/Spark, but the type lives under internal and the PR says there are no user-facing changes. If this is meant to be supported externally, the package/API contract should say so; otherwise downstream users may depend on an unsupported API.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at the time, it was yet another approach to tighten the amount of publicly available API in parquet-java. It quickly turned out that 3rd parties want to use it. For the next major release it is a good candidate to be moved to a more public package. Maybe deprecate it here, and also create at another location?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, @gszadovszky, I wonder which package could be the right place for RowRanges and the new builder? Shall I try to move/deprecate them in this PR or separate one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A proper target package would be the same as now just skip the internal part. I am fine moving/deprecating in this PR. Theoretically, we can simply move it without deprecation since it is "internal" currently, so we have no guarantees for backward compatibility. WDYT, @wgtmac?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

* RowRanges ranges = builder.build();
* }</pre>
*/
public static class Builder {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making Builder final and adding an explicit constructor. Otherwise subclassing and new RowRanges.Builder() become part of the public API surface by accident.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, fixed in 17cce69.

* @param blockRow the row index to mark selected (must be {@code >} the last value passed)
* @return this builder for chaining
*/
public Builder addSelected(long blockRow) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming could be clearer before this API ships. addSelected(long blockRow) is ambiguous; something like addSelectedRow(long rowIndex) would better communicate that the value is a 0-based row index within the current row group.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in 17cce69.

@wgtmac

wgtmac commented Jun 30, 2026

Copy link
Copy Markdown
Member

Do you have any advice on this new API? @gszadovszky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a RowRanges.Builder for incremental construction from selected row indices

3 participants