Skip to content

add natural join support, closes #629#3571

Merged
mattdowle merged 5 commits intomasterfrom
natural-join
May 22, 2019
Merged

add natural join support, closes #629#3571
mattdowle merged 5 commits intomasterfrom
natural-join

Conversation

@jangorecki
Copy link
Copy Markdown
Member

closes #629

@codecov
Copy link
Copy Markdown

codecov bot commented May 18, 2019

Codecov Report

Merging #3571 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3571      +/-   ##
==========================================
+ Coverage   97.58%   97.58%   +<.01%     
==========================================
  Files          66       66              
  Lines       12682    12695      +13     
==========================================
+ Hits        12376    12389      +13     
  Misses        306      306
Impacted Files Coverage Δ
R/test.data.table.R 100% <ø> (ø) ⬆️
R/data.table.R 97.67% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6790cd6...93d9436. Read the comment docs.

@codecov
Copy link
Copy Markdown

codecov bot commented May 18, 2019

Codecov Report

Merging #3571 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3571      +/-   ##
==========================================
+ Coverage   97.55%   97.55%   +<.01%     
==========================================
  Files          66       66              
  Lines       12672    12675       +3     
==========================================
+ Hits        12362    12365       +3     
  Misses        310      310
Impacted Files Coverage Δ
R/test.data.table.R 100% <ø> (ø) ⬆️
R/data.table.R 97.71% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bc17d8...de61845. Read the comment docs.

Copy link
Copy Markdown
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple enough!

Comment thread R/data.table.R Outdated
Comment thread R/data.table.R Outdated
@MichaelChirico
Copy link
Copy Markdown
Member

are we dropping the .NATURAL shorthand? or I should re-file separately for that

@jangorecki
Copy link
Copy Markdown
Member Author

jangorecki commented May 18, 2019

Using .NATURAL doesn't really gives that much benefit over intersect(names(d1), names(d2)), less keystrokes fine, but in terms of functionality, only when you are chaining with [ so you cannot easily refer to d1, d2 then yes it would be useful.
I decided to implement it that way because we are not sacrificing anything. We are gaining even more simple API for join, when possible. Base Rmerge does that by default, dplyr join too, SQL do need NATURAL keyword. We want to have concise API thus better align in this way than the SQL way.

@mattdowle
Copy link
Copy Markdown
Member

mattdowle commented May 18, 2019

Great idea to get this one done.
I think I see @franknarf1's point here #629 (comment): explicit on=.COMMON should be needed. Otherwise couldn't there be user bugs when they think they are joining to key columns but their bug is that there is no key. In that case they get an error now correctly, but after this PR it would join on common names silently. The current "x has no key" error could suggest on=.COMMON.

@MichaelChirico
Copy link
Copy Markdown
Member

only when you are chaining with [ so you cannot easily refer to d1, d2 then yes it would be useful.

Exactly what I had in mind.

The API here is good, I had .NATURAL in mind not as required but as an extra way to accomplish the same thing. But as you said this is a simple fix that accomplishes a lot, better to save the more involved feature for a future PR.

@MichaelChirico
Copy link
Copy Markdown
Member

@mattdowle is correct about ambiguity... in isolation, x[y] could be either a keyed join or a natural join.

I'm not sure about making it required, but allowing .NATURAL (or .COMMON or whatever) empowers users to make their code more readable.

@MichaelChirico
Copy link
Copy Markdown
Member

@jangorecki my comments added. Standard internal approach for if the verbose message might get huge -- truncate at 10 elements. In this case, I added a third approach, on the mental model that there are essentially three cases and ordered by usage frequency: (1) join on one/two/three columns >> (2) join on all columns, e.g. self-join >>>>> (3) join on many but not all columns

@jangorecki
Copy link
Copy Markdown
Member Author

jangorecki commented May 19, 2019

Otherwise couldn't there be user bugs when they think they are joining to key columns but their bug is that there is no key.

Yes, there could be user bugs, but old code will behave consistently, because we optionally replace an error, so bugs will only manifest in newly written code (due to user's mistake obviously). Which is safe. Issue which this PR is addressing was asking precisely for this kind of feature #629, and having in mind that data.table aims for concise syntax I believe on=.COMMON is redundant.

@mattdowle
Copy link
Copy Markdown
Member

mattdowle commented May 19, 2019

old code will behave consistently, because we optionally replace an error

Old code could now start to return results, whereas it would error (correctly) before if x didn't have a key. Code and data can change over time and it's feasible that a key gets dropped incorrectly. When a key gets incorrectly dropped, my guess is that users would expect that x[y] (an op that has always relied on x having a key) should still fail rather than start to join on common names. In other words, isn't the error something that's relied on in production?

@jangorecki
Copy link
Copy Markdown
Member Author

jangorecki commented May 20, 2019

OK, changes amended. Natural join requires now on=.NATURAL, in such a case it also ignores the fact if x is keyed. Similarly as user would provide on by hand.
Optionally there is also new option datatable.naturaljoin, when set to TRUE then defaults on=.NATURAL, but only when x is not keyed. Thus no breaking changes should happen using such option, unless someone expects an error.

@mattdowle mattdowle added this to the 1.12.4 milestone May 22, 2019
@mattdowle mattdowle merged commit 6484781 into master May 22, 2019
@mattdowle mattdowle deleted the natural-join branch May 22, 2019 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[R-Forge #2175] Add natural joins i.e. X[Y] when X has no key could use common column names

3 participants