In a join, x[i, v := i.v], if multiple rows of i match to a single row of x, the assignment takes the last one (?). It would be nice to get an error or maybe a warning when this behavior is triggered.
library(data.table)
a <- data.table(id = c(1L, 1L, 2L, 3L, NA_integer_), x = 11:15)
b <- data.table(id = 1:2, y = -(1:2))
b[a, on=.(id), x := i.x, verbose = TRUE]
# Calculated ad hoc index in 0 secs
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: x,i.x
# Assigning to 3 row subset of 2 rows
I'm not sure if the condition in the title (n > m) is necessary and sufficient for this behavior, though.
My workaround for now would involve looking at the opposite join:
a[b, on=.(id), .N, by=.EACHI][, range(N)]
# [1] 1 2
That seems pretty cumbersome. Maybe there's some way for me to capture and grep the verbose output (but then again, maybe not).
Just an idea: A more general approach could involve returning an object containing diagnostics from the join and assignment. Of course, the object cannot be the return value of [.data.table, but maybe it could be dropped in some locked-binding global, .datatable.diagnostic similar to .Last.value. Alternately, maybe that sort of object would fit well into @jangorecki 's dtq package.
I'm thinking along these lines as I write tutorial materials to convert Stata users to R. In Stata, all joins cat a nice-ish table to the console.
SO post from a Stata user interested in uniqueness of matching of each row of i in x etc: https://stackoverflow.com/questions/49541330/r-data-table-merge-vs-stata-merge
Update: Re the verbose message text, the n is recorded thanks to #3460 and the m is just the number of rows in the table (which I guess I didn't realize at the time I posted this, thinking it was instead m = uniqueN(irows, nar.m = TRUE)... which unfortunately is not computed, and there is no way to detect whether the update join was 1:1, etc per the SO link above).
So anyway, I'll leave this open since it seems to highlight a point of difficulty (judging by emoji-votes) even if my suggestion does not fix it.
In a join,
x[i, v := i.v], if multiple rows ofimatch to a single row ofx, the assignment takes the last one (?). It would be nice to get an error or maybe a warning when this behavior is triggered.I'm not sure if the condition in the title (n > m) is necessary and sufficient for this behavior, though.
My workaround for now would involve looking at the opposite join:
That seems pretty cumbersome. Maybe there's some way for me to capture and grep the verbose output (but then again, maybe not).
Just an idea: A more general approach could involve returning an object containing diagnostics from the join and assignment. Of course, the object cannot be the return value of
[.data.table, but maybe it could be dropped in some locked-binding global,.datatable.diagnosticsimilar to.Last.value. Alternately, maybe that sort of object would fit well into @jangorecki 's dtq package.I'm thinking along these lines as I write tutorial materials to convert Stata users to R. In Stata, all joins
cata nice-ish table to the console.SO post from a Stata user interested in uniqueness of matching of each row of
iinxetc: https://stackoverflow.com/questions/49541330/r-data-table-merge-vs-stata-mergeUpdate: Re the verbose message text, the n is recorded thanks to #3460 and the m is just the number of rows in the table (which I guess I didn't realize at the time I posted this, thinking it was instead
m = uniqueN(irows, nar.m = TRUE)... which unfortunately is not computed, and there is no way to detect whether the update join was 1:1, etc per the SO link above).So anyway, I'll leave this open since it seems to highlight a point of difficulty (judging by emoji-votes) even if my suggestion does not fix it.