keyed tables are already known sorted, so finding unique values is much easier than it is in the general case.
Compare:
NN = 1e8
set.seed(13013)
# about 400 MB, if you're RAM-conscious
DT = data.table(sample(1e5, NN, TRUE), key = 'V1')
system.time(unique(DT$V1))
# user system elapsed
# 1.354 0.415 1.798
system.time(DT[ , unique(V1)])
# user system elapsed
# 1.266 0.414 1.681
system.time(DT[ , TRUE, keyby = V1])
# user system elapsed
# 0.375 0.000 0.375
It seems to me we should be able to match (or exceed) the final time in the second call to unique (i.e. within []).
If we were willing to do something like add a dt_primary_key class to the primary key, we could also achieve this speed in the first approach by writing a unique.dt_primary_key method, but I'm not sure how extensible this is to multiple keys (S4?)
keyed tables are already known sorted, so finding unique values is much easier than it is in the general case.Compare:
It seems to me we should be able to match (or exceed) the final time in the second call to
unique(i.e. within[]).If we were willing to do something like add a
dt_primary_keyclass to the primary key, we could also achieve this speed in the first approach by writing aunique.dt_primary_keymethod, but I'm not sure how extensible this is to multiple keys (S4?)