-
Notifications
You must be signed in to change notification settings - Fork 1.5k
GH-3596: Add RowRanges.Builder for incremental construction from selected row indices
#3597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,7 +16,7 @@ | |
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
| package org.apache.parquet.internal.filter2.columnindex; | ||
| package org.apache.parquet.filter2.columnindex; | ||
|
|
||
| import java.util.ArrayList; | ||
| import java.util.Collections; | ||
|
|
@@ -33,7 +33,8 @@ | |
| * filtering. To be used iterate over the matching row indexes to be read from a row-group, retrieve the count of the | ||
| * matching rows or check overlapping of a row index range. | ||
| * | ||
| * @see ColumnIndexFilter#calculateRowRanges(Filter, ColumnIndexStore, Set, long) | ||
| * @see org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter#calculateRowRanges(Filter, | ||
| * org.apache.parquet.internal.filter2.columnindex.ColumnIndexStore, Set, long) | ||
| */ | ||
| public class RowRanges { | ||
| // Make it public because some uppler layer application need to access it | ||
|
|
@@ -316,4 +317,81 @@ public List<Range> getRanges() { | |
| public String toString() { | ||
| return ranges.toString(); | ||
| } | ||
|
|
||
| /** | ||
| * @return a new {@link Builder} for constructing a {@link RowRanges} from a sequence of | ||
| * selected row indices. | ||
| */ | ||
| public static Builder builder() { | ||
| return new Builder(); | ||
| } | ||
|
|
||
| /** | ||
| * Constructs a {@link RowRanges} by appending selected row indices in strictly increasing | ||
| * order. Consecutive indices are coalesced into a single {@link Range}; gaps close the | ||
| * current run and start a new one. | ||
| * | ||
| * <p>Usage: | ||
| * <pre>{@code | ||
| * RowRanges.Builder builder = RowRanges.builder(); | ||
| * for (long row : selectedRowsInOrder) { | ||
| * builder.addSelectedRow(row); | ||
| * } | ||
| * RowRanges ranges = builder.build(); | ||
| * }</pre> | ||
| */ | ||
| public static final class Builder { | ||
| private final List<Range> ranges = new ArrayList<>(); | ||
| private long runStart = -1; // -1 = no active run | ||
| private long runEnd = -1; // valid iff runStart >= 0 | ||
|
|
||
| private Builder() {} | ||
|
|
||
| /** | ||
| * Marks {@code rowIndex} as selected. The value is a 0-based row index within the current row | ||
| * group. Must be called in strictly increasing order; calling with a value less than or equal | ||
| * to the previous call's value throws {@link IllegalArgumentException}. | ||
| * | ||
| * @param rowIndex the 0-based row index to mark selected (must be {@code >} the last value | ||
| * passed and non-negative) | ||
| * @return this builder for chaining | ||
| */ | ||
| public Builder addSelectedRow(long rowIndex) { | ||
| if (rowIndex < 0) { | ||
| throw new IllegalArgumentException("addSelectedRow requires a non-negative row index; got " + rowIndex); | ||
| } | ||
| if (runStart < 0) { | ||
| runStart = rowIndex; | ||
| runEnd = rowIndex; | ||
| } else if (rowIndex == runEnd + 1) { | ||
| runEnd = rowIndex; | ||
| } else if (rowIndex > runEnd + 1) { | ||
| ranges.add(new Range(runStart, runEnd)); | ||
| runStart = rowIndex; | ||
| runEnd = rowIndex; | ||
| } else { | ||
| throw new IllegalArgumentException("addSelectedRow requires strictly increasing row indices; got " | ||
| + rowIndex + " after " + runEnd); | ||
| } | ||
| return this; | ||
| } | ||
|
|
||
| /** | ||
| * Returns a snapshot of the rows selected so far. The returned {@link RowRanges} is independent | ||
| * of this builder, so the builder may continue to be used afterwards without affecting it. | ||
| * | ||
| * @return the constructed {@link RowRanges}, or {@link RowRanges#EMPTY} when no rows were | ||
| * selected. | ||
| */ | ||
| public RowRanges build() { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in 17cce69. |
||
| List<Range> snapshot = new ArrayList<>(ranges); | ||
| if (runStart >= 0) { | ||
| snapshot.add(new Range(runStart, runEnd)); | ||
| } | ||
| if (snapshot.isEmpty()) { | ||
| return RowRanges.EMPTY; | ||
| } | ||
| return new RowRanges(snapshot); | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a clearer API commitment. The motivation is external readers/Spark, but the type lives under
internaland the PR says there are no user-facing changes. If this is meant to be supported externally, the package/API contract should say so; otherwise downstream users may depend on an unsupported API.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, at the time, it was yet another approach to tighten the amount of publicly available API in parquet-java. It quickly turned out that 3rd parties want to use it. For the next major release it is a good candidate to be moved to a more public package. Maybe deprecate it here, and also create at another location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac, @gszadovszky, I wonder which package could be the right place for
RowRangesand the new builder? Shall I try to move/deprecate them in this PR or separate one?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A proper target package would be the same as now just skip the
internalpart. I am fine moving/deprecating in this PR. Theoretically, we can simply move it without deprecation since it is "internal" currently, so we have no guarantees for backward compatibility. WDYT, @wgtmac?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, moved in 8d8e2dd.
One thing to flag: the relocation changes a public signature
ParquetFileReader.readFilteredRowGroup(int, RowRanges)now takes the new publicRowRangesinstead of the internal one. japicmp reports this asMETHOD_REMOVED(and the inherited form onCompressionConverter$TransParquetFileReader), so I added excludes for:org.apache.parquet.internal.filter2.columnindex.RowRanges(+$Range)org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter(return-type changes)org.apache.parquet.hadoop.ParquetFileReader#readFilteredRowGroup(int, …internal…RowRanges)(+ theTransParquetFileReaderinherited form)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I haven't noticed that, sorry. Then, we probably need to get back to the deprecated/move pattern. 😞
Let's move to the new place, extend the class as required at the old place and deprecate. Also deprecate the public API that references the old place and create new methods for the new one. Mark 2.0 as the removal version. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the released (1.17.0) surface: the only genuinely-public, cross-package method exposing
RowRangesisParquetFileReader.readFilteredRowGroup(int, RowRanges).ParquetFileReader.getRowRanges(int), that returnsRowRangeswas only made public onmaster(1.18.0).Everything else is under
org.apache.parquet.internal.*.Is it ok if we preserve compatibility for just
readFilteredRowGroup(int, RowRanges)and treat theinternal.*relocations as internal changes with no compat guarantee?