feat: reduce duplicate fields on join#1184
Conversation
kosiew
left a comment
There was a problem hiding this comment.
Left a few suggestions.
Also, it would be good to update docs/source/.../joins.rst to mention keep_duplicate_keys
| @@ -400,7 +400,6 @@ def test_unnest_without_nulls(nested_df): | |||
| assert result.column(1) == pa.array([7, 8, 8, 9, 9, 9]) | |||
|
|
|||
|
|
|||
There was a problem hiding this comment.
It would be good if new tests for keep_duplicate_keys=False or True were added. To ensure coverage, add tests verifying that passing keep_duplicate_keys=True preserves both columns.
python/datafusion/dataframe.py
Outdated
| left_on: str | Sequence[str] | None = None, | ||
| right_on: str | Sequence[str] | None = None, | ||
| join_keys: tuple[list[str], list[str]] | None = None, | ||
| keep_duplicate_keys: bool = False, |
There was a problem hiding this comment.
The name keep_duplicate_keys is somewhat confusing: it drops the right-side keys when False.
A more direct name like drop_duplicate_keys: bool = False (default) or deduplicate: bool = False may better express intent.
Additionally, in many DataFrame libraries (Pandas, PySpark), the term suffixes or indicator is used for duplicate‐column handling. Consider whether a suffix‐based approach (with default ('', '_right')) could be more familiar to users than a boolean drop flag.
8e1ed67 to
1e7f874
Compare
Co-authored-by: kosiew <kosiew@gmail.com>
Which issue does this PR close?
Closes #1173
Rationale for this change
In the current version of the code when you do a join and there is a common
oncolumn name, then you end up with two columns in the output dataframe with ambiguous names. This is an annoyance for users where they have to work around by renaming the column to join on. With this change, it makes the interface more user friendly.What changes are included in this PR?
keep_duplicate_keysif the user does not want to drop duplicate column namesonand notjoin_onAre there any user-facing changes?
Yes.
DataFrame.join()by default will now only return a single column for duplicateonkeys. The user can revert to the previous version by settingkeep_duplicate_keystoTrue.