fix: coalesce the merged key of RIGHT/FULL USING/NATURAL joins#22998
fix: coalesce the merged key of RIGHT/FULL USING/NATURAL joins#22998Phoenix500526 wants to merge 6 commits into
Conversation
neilconway
left a comment
There was a problem hiding this comment.
Thanks for sending this PR! Overall I agree that the current behavior is wrong and should be fixed.
I think the current approach produces an error for queries like SELECT k FROM a FULL JOIN b USING(k) ORDER BY a.k, because the sort introduces a hidden projection column.
| /// For these join types the left key is NULL-padded on rows that exist only | ||
| /// on the right, so an unqualified reference to the merged key must resolve | ||
| /// to `COALESCE(left, right)` rather than to the left column alone. | ||
| pub fn outer_using_key_pairs( |
There was a problem hiding this comment.
The function name could be improved, since we handle some outer joins (right, full) but not others (left).
There was a problem hiding this comment.
Rename outer_using_key_pairs to right_or_full_using_key_pairs
| /// Like [`normalize_col_with_schemas_and_ambiguity_check`], but additionally | ||
| /// resolves the merged key of a `RIGHT` / `FULL` `USING` / `NATURAL` join (given | ||
| /// as `(left, right)` column pairs) to `COALESCE(left, right)`. | ||
| pub fn normalize_col_with_schemas_ambiguity_and_outer_using( |
There was a problem hiding this comment.
Seems unfortunate to leak an implementation detail ("USING needs special handling for outer joins") into the public API surface.
There was a problem hiding this comment.
I drop this function and resolve the merged key in the SQL planner instead, so the USING-outer-join detail no longer leaks into the generic expr public API.
Hi, @neilconway I dug into this one. The failure is a schema clash: I found two ways out, neither perfectly clean. One is to give the merged key a fake qualifier, e.g. SELECT __join_virtual_table__.order_id FROM (
SELECT CASE WHEN o1.order_id IS NOT NULL THEN o1.order_id ELSE o2.order_id END AS order_id, o1.order_id
FROM orders o1 FULL JOIN orders o2 USING(order_id) ORDER BY o1.order_id)
-- re-plan: qualified field name o1.order_id and unqualified field name order_id would be ambiguousThe other is to push the Sort below the merged-key projection so it reads a.k straight off the join: Clean plan, fixes the SELECT ... FROM (SELECT *, * FROM orders o1 FULL JOIN orders o2 USING(order_id) ORDER BY o1.order_id)
-- re-plan: Projections require unique expression names ... order_id ... order_idSo both fix execution but both trip up unparse — a FULL USING merged key is a COALESCE the unparser doesn't turn back into USING. Neither feels clean enough to me — do you have a better approach in mind for this case? |
|
@Phoenix500526 Thanks for revising this! The |
c1cbe60 to
a7cb6cd
Compare
@neilconway Thanks so much for offering to dig into the Quick rundown: Idea: rename the merged key to a reserved internal name, not a fake qualifier. Both of my earlier attempts died at unparse because a During sort push-down, when a qualified sort column ( The folded schema is now I've added round-trip tests ( One loose end I should flag: the internal name leaks into EXPLAIN. The result column is k and the unparsed SQL is clean, but the intermediate plan nodes still show It's purely cosmetic — no effect on results or round-trip SQL — but it's not the prettiest. Hiding it completely would mean touching the generic plan Display, which felt like overkill, so I left it for now. If you spot a cleaner way, I'm all ears. |
Add coverage that the merged key of a USING / NATURAL join is COALESCE(left, right) across SELECT, WHERE, ORDER BY and wildcard expansion, including RIGHT / FULL joins where the left key is NULL-padded. Refs apache#22881 Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A USING / NATURAL join exposes its join key as a single merged column whose value, per the SQL standard, is COALESCE(left, right). DataFusion resolved an unqualified reference to that merged key to the left column unconditionally. For RIGHT and FULL joins the left key is NULL-padded on rows that exist only on the right, so the merged key came out NULL instead of the value that is actually present. That wrong NULL is silent and propagates into ORDER BY and other uses of the key, changing query results. Resolve the unqualified merged key of RIGHT / FULL joins to COALESCE(left, right) so it carries the value from whichever side is present, matching the explicit `coalesce(a.k, b.k) ... ON a.k = b.k` form. INNER / LEFT are unaffected, since their left key is never NULL-padded. Refs apache#22881 Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A USING / NATURAL merged key can be referenced unqualified anywhere a column can, including in WHERE. But WHERE resolved unqualified references against only the columns the predicate itself mentions, not the join's USING columns, so an unqualified merged-key reference matched both the left and right key with no USING context and was rejected as ambiguous. Resolve WHERE predicates the same way as the SELECT list -- against the join's real USING columns -- so the merged key is recognized and, for RIGHT / FULL joins, takes the value from whichever side is present (COALESCE(left, right)) instead of erroring. Refs apache#22881 Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A wildcard (`SELECT *`) over a USING / NATURAL join collapses the duplicated join key to a single merged-key column. That surviving column was the left key, which is NULL-padded on rows present only on the right of a RIGHT / FULL join, so `SELECT *` showed NULL for the merged key on those rows. Expand the merged key in a wildcard as COALESCE(left, right) for RIGHT / FULL joins so it shows the value from whichever side is present, consistent with how an explicit reference to the key resolves and with the SQL standard. Closes apache#22881 Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Resolve an unqualified reference to the merged key of a USING / NATURAL join to the side that is never NULL-padded, addressing review feedback that always coalescing handled LEFT and RIGHT asymmetrically: the right key for RIGHT, COALESCE(left, right) for FULL, and the left key (unchanged) for INNER / LEFT. For RIGHT this yields a plain column instead of a computed COALESCE, so the merged key no longer collides with a qualified key in the same projection schema -- queries like `ORDER BY a.k` or `SELECT k, a.k` over a RIGHT join now plan cleanly. FULL still needs the computed COALESCE and keeps that limitation. Also drop the public normalize_col_with_schemas_ambiguity_and_outer_using and resolve the merged key in the SQL planner instead, so the USING-outer-join detail no longer leaks into the generic expr public API. Rename outer_using_key_pairs to right_or_full_using_key_pairs (it never covered LEFT) and carry the join type. Correct two stale integration snapshots to the right-side merged key. Refs apache#22881 Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
A FULL USING / NATURAL join exposes its merged key as an unqualified
COALESCE(left, right) column. Ordering by a qualified key of the same
join (`ORDER BY a.k`) folds that qualified `a.k` into the projection that
already carries the unqualified merged `k`, building an illegal
`{ k, a.k }` schema, so such queries errored at plan time.
During sort push-down, rename a colliding merged key to a reserved
internal name before folding the qualified sort column, and restore the
original name in the wrapper projection. References to the merged key in
the sort itself (`ORDER BY a.k, k`) are rewritten to the same internal
name so multi-key sorts keep resolving; that shape was legal before the
merged-key work, so this avoids a regression.
Teach the unparser to flatten a wrapper projection that purely renames
an inner column over a Sort, so the rewritten plan still round-trips
through plan -> SQL -> plan.
Refs apache#22881
Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
b5c263f to
b89ffe3
Compare
Which issue does this PR close?
Rationale for this change
A
USING/NATURALjoin exposes its join key as a single merged column whose value, per the SQL standard, isCOALESCE(left.key, right.key). DataFusion resolved an unqualified reference to that merged key to the left column unconditionally.For
RIGHTandFULLjoins the left key is NULL-padded on rows that exist only on the right, so the merged key came out NULL instead of the value that is actually present. The wrong NULL is silent and propagates intoGROUP BY,ORDER BY(changing row order) andSELECT *, corrupting results;WHEREon the merged key additionally failed with an "ambiguous reference" error.INNER/LEFTare unaffected, since their left key is never NULL-padded.What changes are included in this PR?
The unqualified merged key of a
RIGHT/FULL`` USING/NATURALjoin now resolves toCOALESCE(left, right)everywhere it can be referenced:SELECT,ORDER BY,GROUP BY,HAVING,QUALIFY);WHEREpredicates — previously rejected as ambiguous; they now resolve against the join's real USING columns;SELECT *).Mechanics:
LogicalPlan::outer_using_key_pairs()returns the(left, right)key pairs ofRIGHT/FULLUSING/NATURALjoins;CASE WHEN left IS NOT NULL THEN left ELSE right END— the exact form coalesce is simplified to, so it can be built indatafusion-exprwithout depending on the functions crate — aliased to the key name so the output column keeps its name.INNER/LEFTjoins and qualified access (a.k / b.k) are left unchanged.Are these changes tested?
Yes. A new sqllogictest file
join_using_merged_key.sltcovers the merged key forRIGHT/FULL/NATURALjoins acrossSELECT,SELECT *,WHEREandORDER BY, withINNER/LEFT, qualifieda.k/b.kaccess, and the explicitcoalesce(a.k, b.k) ... ONform as regression guards.The full sqllogictest suite and the
datafusion-common/datafusion-expr/datafusion-sqlunit tests pass with no regressions.Are there any user-facing changes?
Yes. For
RIGHT/FULLUSING/NATURALjoins, the merged join key now returnsCOALESCE(left, right)(the value from whichever side is present) instead ofNULLfor rows that exist only on the right, and referencing the merged key inWHEREno longer raises an ambiguous-reference error. Queries that do not useRIGHT/FULLUSING/NATURALjoins are unaffected.No breaking public API changes (the change only adds new public items).