Skip to content

unique(DT) when there are no dups could be much faster #2013

@mattdowle

Description

@mattdowle

(The new default of using all columns brings this to the fore.)

DT = data.table(A=1:3, B=4:6)
DT
   A B
1: 1 4
2: 2 5
3: 3 6
debug(duplicated.data.table)
debug(unique.data.table)
unique(DT)

/duplicated.R#22:
Browse[3]> o
integer(0)
attr(,"starts")
[1] 1 2 3
attr(,"maxgrpn")
[1] 1

So at this point it knows that DT is unique and it could return it or a shallow copy straight away. But it doesn't. It carries on to turn all-FALSE into 1:nrow and then subset every column by that 1:nrow.

Also should time the forderv to make sure it is short-circuiting correctly once it resolves ambiguities in the first few columns. forderv should not touch B in this example at all because A is enough to reach uniqueness.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions