Skip to content

fix: coalesce the merged key of RIGHT/FULL USING/NATURAL joins#22998

Open
Phoenix500526 wants to merge 6 commits into
apache:mainfrom
Phoenix500526:issue/22881
Open

fix: coalesce the merged key of RIGHT/FULL USING/NATURAL joins#22998
Phoenix500526 wants to merge 6 commits into
apache:mainfrom
Phoenix500526:issue/22881

Conversation

@Phoenix500526

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

A USING / NATURAL join exposes its join key as a single merged column whose value, per the SQL standard, is COALESCE(left.key, right.key). DataFusion resolved an unqualified reference to that merged key to the left column unconditionally.

For RIGHT and FULL joins the left key is NULL-padded on rows that exist only on the right, so the merged key came out NULL instead of the value that is actually present. The wrong NULL is silent and propagates into GROUP BY, ORDER BY (changing row order) and SELECT *, corrupting results; WHERE on the merged key additionally failed with an "ambiguous reference" error. INNER / LEFT are unaffected, since their left key is never NULL-padded.

create table a(k int, x int) as values (1, 10), (2, 20), (3, 30);
create table b(k int, y int) as values (2, 200), (3, 300), (4, 400);

select k from a right join b using (k) order by k nulls last;
-- before: 2, 3, NULL      after: 2, 3, 4

What changes are included in this PR?

The unqualified merged key of a RIGHT / FULL`` USING / NATURAL join now resolves to COALESCE(left, right) everywhere it can be referenced:

  • column normalization (SELECT, ORDER BY, GROUP BY, HAVING, QUALIFY);
  • WHERE predicates — previously rejected as ambiguous; they now resolve against the join's real USING columns;
  • wildcard expansion (SELECT *).

Mechanics:

  • a new LogicalPlan::outer_using_key_pairs() returns the (left, right) key pairs of RIGHT / FULL USING / NATURAL joins;
  • the merged key is materialized as CASE WHEN left IS NOT NULL THEN left ELSE right END — the exact form coalesce is simplified to, so it can be built in datafusion-expr without depending on the functions crate — aliased to the key name so the output column keeps its name.
    INNER / LEFT joins and qualified access (a.k / b.k) are left unchanged.

Are these changes tested?

Yes. A new sqllogictest file join_using_merged_key.slt covers the merged key for RIGHT / FULL / NATURAL joins across SELECT, SELECT *, WHERE and ORDER BY, with INNER / LEFT, qualified a.k / b.k access, and the explicit coalesce(a.k, b.k) ... ON form as regression guards.

The full sqllogictest suite and the datafusion-common /datafusion-expr / datafusion-sql unit tests pass with no regressions.

Are there any user-facing changes?

Yes. For RIGHT / FULL USING / NATURAL joins, the merged join key now returns COALESCE(left, right) (the value from whichever side is present) instead of NULL for rows that exist only on the right, and referencing the merged key in WHERE no longer raises an ambiguous-reference error. Queries that do not use RIGHT / FULL USING / NATURAL joins are unaffected.

No breaking public API changes (the change only adds new public items).

@neilconway neilconway left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending this PR! Overall I agree that the current behavior is wrong and should be fixed.

I think the current approach produces an error for queries like SELECT k FROM a FULL JOIN b USING(k) ORDER BY a.k, because the sort introduces a hidden projection column.

Comment thread datafusion/expr/src/expr_rewriter/mod.rs Outdated
/// For these join types the left key is NULL-padded on rows that exist only
/// on the right, so an unqualified reference to the merged key must resolve
/// to `COALESCE(left, right)` rather than to the left column alone.
pub fn outer_using_key_pairs(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name could be improved, since we handle some outer joins (right, full) but not others (left).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename outer_using_key_pairs to right_or_full_using_key_pairs

/// Like [`normalize_col_with_schemas_and_ambiguity_check`], but additionally
/// resolves the merged key of a `RIGHT` / `FULL` `USING` / `NATURAL` join (given
/// as `(left, right)` column pairs) to `COALESCE(left, right)`.
pub fn normalize_col_with_schemas_ambiguity_and_outer_using(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unfortunate to leak an implementation detail ("USING needs special handling for outer joins") into the public API surface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I drop this function and resolve the merged key in the SQL planner instead, so the USING-outer-join detail no longer leaks into the generic expr public API.

@Phoenix500526

Copy link
Copy Markdown
Contributor Author

Thanks for sending this PR! Overall I agree that the current behavior is wrong and should be fixed.

I think the current approach produces an error for queries like SELECT k FROM a FULL JOIN b USING(k) ORDER BY a.k, because the sort introduces a hidden projection column.

Hi, @neilconway I dug into this one. The failure is a schema clash: ORDER BY a.k drags a.k into the same projection as the merged key k, and for a FULL join k is COALESCE(a.k, b.k) (unqualified). DFSchema won't allow an unqualified k next to a qualified a.k — that's the "would be ambiguous" error.

I found two ways out, neither perfectly clean.

One is to give the merged key a fake qualifier, e.g. COALESCE(a.k, b.k) AS __join_virtual_table__.k, so it's qualified and stops clashing. That actually fixes both ORDER BY a.k and SELECT k, a.k. The catch is the fake table leaks: you see it in EXPLAIN, and unparse falls over because the qualifier points at nothing real — plan_to_sql emits SQL referencing a table that doesn't exist and it won't re-plan:

SELECT __join_virtual_table__.order_id FROM (
  SELECT CASE WHEN o1.order_id IS NOT NULL THEN o1.order_id ELSE o2.order_id END AS order_id, o1.order_id
  FROM orders o1 FULL JOIN orders o2 USING(order_id) ORDER BY o1.order_id)
-- re-plan: qualified field name o1.order_id and unqualified field name order_id would be ambiguous

The other is to push the Sort below the merged-key projection so it reads a.k straight off the join:

Projection: COALESCE(a.k, b.k) AS k
  Sort: a.k
    Full Join: Using a.k = b.k

Clean plan, fixes the ORDER BY case. But now the Sort sits directly on the join with no projection, so unparse falls back to SELECT *, * (one star per side), which expands the merged key twice and again won't re-plan:

SELECT ... FROM (SELECT *, * FROM orders o1 FULL JOIN orders o2 USING(order_id) ORDER BY o1.order_id)
-- re-plan: Projections require unique expression names ... order_id ... order_id

So both fix execution but both trip up unparse — a FULL USING merged key is a COALESCE the unparser doesn't turn back into USING. Neither feels clean enough to me — do you have a better approach in mind for this case?

@neilconway

Copy link
Copy Markdown
Contributor

@Phoenix500526 Thanks for revising this! The FULL case needs some more consideration -- I'll dig into it and try to respond on Monday.

@Phoenix500526

Phoenix500526 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@Phoenix500526 Thanks for revising this! The FULL case needs some more consideration -- I'll dig into it and try to respond on Monday.

@neilconway Thanks so much for offering to dig into the FULL case! Good news though: I kept poking at it and found a third approach that fixes FULL and survives the unparse round-trip, so you're off the hook 😄 I'd still love to hear what you think, of course.

Quick rundown:

Idea: rename the merged key to a reserved internal name, not a fake qualifier.

Both of my earlier attempts died at unparse because a FULL USING merged key is a COALESCE the unparser can't turn back into USING. So instead of dodging the { k, a.k } clash with a phantom qualifier, I dodge it with a plain reserved name.

During sort push-down, when a qualified sort column (a.k) is about to be folded into a projection that already exposes the unqualified merged key k, I rename that merged key to __datafusion_merged_key_k first, then restore the original name in a wrapper projection:

Projection: __datafusion_merged_key_k AS k                    -- wrapper restores `k`
  Sort: a.k DESC
    Projection: CASE WHEN a.k IS NOT NULL THEN a.k ELSE b.k END AS __datafusion_merged_key_k, a.k
      Full Join: Using a.k = b.k

The folded schema is now { __datafusion_merged_key_k, a.k }— distinct names, no ambiguity. If the sort itself also references the merged key (ORDER BY a.k, k), I rewrite that reference to the same internal name so the multi-key sort keeps resolving. (ORDER BY a.k, k was legal before the merged-key work, so this keeps it from regressing.)

I've added round-trip tests (USING and NATURAL FULL) plus execution coverage for ORDER BY a.k and ORDER BY a.k, k.

One loose end I should flag: the internal name leaks into EXPLAIN. The result column is k and the unparsed SQL is clean, but the intermediate plan nodes still show __datafusion_merged_key_k:

logical_plan
01)Projection: __datafusion_merged_key_k AS k
02)--Sort: a.k DESC NULLS LAST
03)----Projection: CASE WHEN a.k IS NOT NULL THEN a.k ELSE b.k END AS __datafusion_merged_key_k, a.k

It's purely cosmetic — no effect on results or round-trip SQL — but it's not the prettiest. Hiding it completely would mean touching the generic plan Display, which felt like overkill, so I left it for now. If you spot a cleaner way, I'm all ears.

Add coverage that the merged key of a USING / NATURAL join is
COALESCE(left, right) across SELECT, WHERE, ORDER BY and wildcard
expansion, including RIGHT / FULL joins where the left key is NULL-padded.

Refs apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A USING / NATURAL join exposes its join key as a single merged column
whose value, per the SQL standard, is COALESCE(left, right). DataFusion
resolved an unqualified reference to that merged key to the left column
unconditionally.

For RIGHT and FULL joins the left key is NULL-padded on rows that exist
only on the right, so the merged key came out NULL instead of the value
that is actually present. That wrong NULL is silent and propagates into
ORDER BY and other uses of the key, changing query results.

Resolve the unqualified merged key of RIGHT / FULL joins to
COALESCE(left, right) so it carries the value from whichever side is
present, matching the explicit `coalesce(a.k, b.k) ... ON a.k = b.k`
form. INNER / LEFT are unaffected, since their left key is never
NULL-padded.

Refs apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A USING / NATURAL merged key can be referenced unqualified anywhere a
column can, including in WHERE. But WHERE resolved unqualified references
against only the columns the predicate itself mentions, not the join's
USING columns, so an unqualified merged-key reference matched both the
left and right key with no USING context and was rejected as ambiguous.

Resolve WHERE predicates the same way as the SELECT list -- against the
join's real USING columns -- so the merged key is recognized and, for
RIGHT / FULL joins, takes the value from whichever side is present
(COALESCE(left, right)) instead of erroring.

Refs apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A wildcard (`SELECT *`) over a USING / NATURAL join collapses the
duplicated join key to a single merged-key column. That surviving column
was the left key, which is NULL-padded on rows present only on the right
of a RIGHT / FULL join, so `SELECT *` showed NULL for the merged key on
those rows.

Expand the merged key in a wildcard as COALESCE(left, right) for
RIGHT / FULL joins so it shows the value from whichever side is present,
consistent with how an explicit reference to the key resolves and with
the SQL standard.

Closes apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Resolve an unqualified reference to the merged key of a USING / NATURAL
join to the side that is never NULL-padded, addressing review feedback
that always coalescing handled LEFT and RIGHT asymmetrically: the right
key for RIGHT, COALESCE(left, right) for FULL, and the left key
(unchanged) for INNER / LEFT.

For RIGHT this yields a plain column instead of a computed COALESCE, so
the merged key no longer collides with a qualified key in the same
projection schema -- queries like `ORDER BY a.k` or `SELECT k, a.k` over
a RIGHT join now plan cleanly. FULL still needs the computed COALESCE and
keeps that limitation.

Also drop the public normalize_col_with_schemas_ambiguity_and_outer_using
and resolve the merged key in the SQL planner instead, so the
USING-outer-join detail no longer leaks into the generic expr public API.
Rename outer_using_key_pairs to right_or_full_using_key_pairs (it never
covered LEFT) and carry the join type. Correct two stale integration
snapshots to the right-side merged key.

Refs apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A FULL USING / NATURAL join exposes its merged key as an unqualified
COALESCE(left, right) column. Ordering by a qualified key of the same
join (`ORDER BY a.k`) folds that qualified `a.k` into the projection that
already carries the unqualified merged `k`, building an illegal
`{ k, a.k }` schema, so such queries errored at plan time.

During sort push-down, rename a colliding merged key to a reserved
internal name before folding the qualified sort column, and restore the
original name in the wrapper projection. References to the merged key in
the sort itself (`ORDER BY a.k, k`) are rewritten to the same internal
name so multi-key sorts keep resolving; that shape was legal before the
merged-key work, so this avoids a regression.

Teach the unparser to flatten a wrapper projection that purely renames
an inner column over a Sort, so the rewritten plan still round-trips
through plan -> SQL -> plan.

Refs apache#22881

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RIGHT/FULL/NATURAL JOIN ... USING(k) does not coalesce the join key (returns NULL for right-only rows)

2 participants