Skip to content

feat: add strictness metadata for scalar UDF null propagation and use it in outer join elimination#23148

Open
lyne7-sc wants to merge 4 commits into
apache:mainfrom
lyne7-sc:feat/udf_is_strict
Open

feat: add strictness metadata for scalar UDF null propagation and use it in outer join elimination#23148
lyne7-sc wants to merge 4 commits into
apache:mainfrom
lyne7-sc:feat/udf_is_strict

Conversation

@lyne7-sc

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Outer join elimination relies on proving that a filter rejects the NULL-padding rows introduced by an outer join. DataFusion already handles many built-in NULL-propagating expressions, but scalar UDFs did not expose whether they preserve the same property.

SELECT t1.a
FROM t1 LEFT JOIN t2 ON t1.a = t2.x
WHERE abs(t2.y) > 5;

Rows produced by the unmatched side of the left join have t2.y = NULL. Since abs(NULL) is also NULL, the predicate abs(t2.y) > 5 cannot evaluate to true for those rows. That means the left join can be safely rewritten as an inner join. Without function-level null propagation metadata, the optimizer has to treat scalar functions conservatively and misses this rewrite.

This PR adds ScalarUDFImpl::is_strict() to let scalar UDF implementations declare that they always return NULL when any argument is NULL. The default is false so existing UDFs remain conservative. Optimizer rules can then use this metadata when reasoning about expression nullability and null-rejecting predicates.

This design follows a pattern used by other query engines. PostgreSQL exposes STRICT / RETURNS NULL ON NULL INPUT on functions, and documents that such functions are not executed when any argument is NULL; a NULL result is assumed automatically. DuckDB similarly has function null-handling metadata, with default NULL-in/NULL-out behavior and special handling for functions that do not follow that rule.

References:

What changes are included in this PR?

  • Adds ScalarUDFImpl::is_strict(), defaulting to false.
  • Adds ScalarUDF::is_strict() as the public forwarding API.
  • Marks abs as strict.
  • Uses strict scalar functions in predicate/nullability reasoning and outer join elimination.
  • Propagates is_strict through datafusion-ffi::FFI_ScalarUDF.
  • Adds unit and slt coverage for strict and non-strict scalar UDF behavior.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes. Scalar UDF authors can now override ScalarUDFImpl::is_strict() to tell the optimizer that their function always returns NULL when any argument is NULL.

This PR also changes the FFI_ScalarUDF layout to propagate strictness across the FFI boundary, so it should carry the api change label.

Future work

  • Mark more built-in scalar functions as strict where the behavior is clearly NULL-in/NULL-out.
  • Reuse is_strict() in other optimizer rules to infer arg IS NOT NULL from predicates like strict_func(arg) > 0.
  • Use inferred non-null predicates to simplify redundant IS NOT NULL checks and push filters closer to scans.
  • Use strictness in statistics/selectivity estimation by deriving tighter nullability information for function outputs.

@github-actions github-actions Bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation ffi Changes to the ffi crate labels Jun 24, 2026
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-expr v54.0.0 (current)
       Built [  29.111s] (current)
     Parsing datafusion-expr v54.0.0 (current)
      Parsed [   0.070s] (current)
    Building datafusion-expr v54.0.0 (baseline)
       Built [  24.849s] (baseline)
     Parsing datafusion-expr v54.0.0 (baseline)
      Parsed [   0.070s] (baseline)
    Checking datafusion-expr v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   1.541s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  56.732s] datafusion-expr
    Building datafusion-ffi v54.0.0 (current)
       Built [  55.911s] (current)
     Parsing datafusion-ffi v54.0.0 (current)
      Parsed [   0.058s] (current)
    Building datafusion-ffi v54.0.0 (baseline)
       Built [  56.765s] (baseline)
     Parsing datafusion-ffi v54.0.0 (baseline)
      Parsed [   0.058s] (baseline)
    Checking datafusion-ffi v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.288s] 223 checks: 221 pass, 1 fail, 1 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field FFI_ScalarUDF.is_strict in /home/runner/work/datafusion/datafusion/datafusion/ffi/src/udf/mod.rs:87

--- warning repr_c_plain_struct_fields_reordered: struct fields reordered in repr(C) struct ---

Description:
A public repr(C) struct had its fields reordered. This can change the struct's memory layout, possibly breaking FFI use cases that depend on field position and order.
        ref: https://doc.rust-lang.org/reference/type-layout.html#reprc-structs
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/repr_c_plain_struct_fields_reordered.ron

Failed in:
  FFI_ScalarUDF.coerce_types moved from position 7 to 8, in /home/runner/work/datafusion/datafusion/datafusion/ffi/src/udf/mod.rs:93
  FFI_ScalarUDF.placement moved from position 8 to 9, in /home/runner/work/datafusion/datafusion/datafusion/ffi/src/udf/mod.rs:101
  FFI_ScalarUDF.clone moved from position 9 to 10, in /home/runner/work/datafusion/datafusion/datafusion/ffi/src/udf/mod.rs:108
  FFI_ScalarUDF.release moved from position 10 to 11, in /home/runner/work/datafusion/datafusion/datafusion/ffi/src/udf/mod.rs:111
  FFI_ScalarUDF.private_data moved from position 11 to 12, in /home/runner/work/datafusion/datafusion/datafusion/ffi/src/udf/mod.rs:115
  FFI_ScalarUDF.library_marker_id moved from position 12 to 13, in /home/runner/work/datafusion/datafusion/datafusion/ffi/src/udf/mod.rs:120

     Summary semver requires new major version: 1 major and 0 minor checks failed
     Warning produced 1 major and 0 minor level warnings
    Finished [ 114.604s] datafusion-ffi
    Building datafusion-functions v54.0.0 (current)
       Built [  28.153s] (current)
     Parsing datafusion-functions v54.0.0 (current)
      Parsed [   0.081s] (current)
    Building datafusion-functions v54.0.0 (baseline)
       Built [  28.111s] (baseline)
     Parsing datafusion-functions v54.0.0 (baseline)
      Parsed [   0.082s] (baseline)
    Checking datafusion-functions v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.449s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  57.939s] datafusion-functions
    Building datafusion-optimizer v54.0.0 (current)
       Built [  25.545s] (current)
     Parsing datafusion-optimizer v54.0.0 (current)
      Parsed [   0.028s] (current)
    Building datafusion-optimizer v54.0.0 (baseline)
       Built [  25.418s] (baseline)
     Parsing datafusion-optimizer v54.0.0 (baseline)
      Parsed [   0.030s] (baseline)
    Checking datafusion-optimizer v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.185s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  51.915s] datafusion-optimizer
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 166.496s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.020s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 168.256s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.099s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 337.216s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant