Skip to content

feat(parquet): apply column default values when reading missing fields (2/4)#792

Open
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:feat/default-values-read-parquet
Open

feat(parquet): apply column default values when reading missing fields (2/4)#792
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:feat/default-values-read-parquet

Conversation

@huan233usc

Copy link
Copy Markdown
Contributor

What

Part 2 of 4 of Iceberg v3 column default-value support (POC #731), built on the
schema-layer support merged in #746.

When a column is present in the read (table) schema but absent from a Parquet
data file — because the column was added after those rows were written — fill it
with the column's v3 initial-default instead of null.

Changes

  • arrow/literal_util (new): a shared Arrow materializer that converts a
    Literal to an Arrow scalar (ToArrowScalar), to a constant-filled array
    (MakeDefaultArray), or appends it once to a builder
    (AppendDefaultToBuilder). Covers all primitive types, including
    decimal/uuid/fixed and the four timestamp variants.
  • Parquet projection (parquet_schema_util.cc): when a field is missing
    from the file and carries an initial-default, project it as
    FieldProjection::Kind::kDefault, mirroring the generic schema_util.cc path
    already merged in feat(schema): represent, serialize and validate v3 column default values (1/4) #746.
  • Parquet data (parquet_data_util.cc): materialize the kDefault branch
    via MakeDefaultArray.

Tests

  • literal_util_test: scalar/array/builder conversion across primitive types,
    null and narrowing-sentinel handling, and casting to the target Arrow type.
  • parquet_data_test: ProjectRecordBatch fills missing required and optional
    default fields, including a nested struct.
  • parquet_test: end-to-end — write a file with an old schema, then read it
    through ReaderFactoryRegistry with an evolved schema carrying defaults
    (ReadMissingFieldsWithDefaults).

Stack

  1. feat(schema): represent, serialize and validate v3 column default values (1/4) #746 — schema: represent / serialize / validate (merged)
  2. this PR — read path: Parquet
  3. read path: Avro (follows)
  4. schema evolution: addColumn / updateColumnDefault (follows)

@huan233usc huan233usc marked this pull request as draft June 29, 2026 21:51
@huan233usc huan233usc force-pushed the feat/default-values-read-parquet branch from 925d8cd to 378150c Compare June 29, 2026 22:47
@huan233usc huan233usc closed this Jun 29, 2026
@huan233usc huan233usc reopened this Jun 29, 2026
@huan233usc huan233usc marked this pull request as ready for review June 30, 2026 02:12
…s (2/4)

When a column is present in the read schema but missing from a Parquet data
file (written before the column existed), fill it with the column's v3
initial-default instead of null. Adds a shared Arrow materializer
(arrow/literal_util) that turns a Literal into an Arrow scalar/array, and a
kDefault projection branch in the Parquet schema/data projection paths.

The materializer wraps the storage array in the extension type for extension
types such as `arrow.uuid` (compute::Cast has no storage->extension kernel).

Part 2 of the v3 column-default-values work (POC apache#731), built on the schema
support merged in apache#746.
@huan233usc huan233usc force-pushed the feat/default-values-read-parquet branch from 6b08bed to 8db3067 Compare June 30, 2026 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant