I'm trying to summarize some data which I have stored (in wide form) as repeated cross-sections; the way to go would appear to be to summarize and melt, but there's some crucial information lost in the process -- variable names are tossed & it's not clear we can predict the resulting order.
An example:
set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
paste0("meas", rep(1:3, c(3, 3, 2)),
rep(c("_jan", "_feb", "_mar"))))
DT
# meas1_jan meas1_feb meas1_mar meas2_jan
# 1: 0.6600937557 0.3694227 0.04429776 0.02691871
# 2: 0.5562533571 0.2260477 0.99935621 0.62940965
# 3: 0.7800425424 0.9500860 0.05847563 0.90400149
# 4: 0.3372507696 0.1853700 0.96434815 0.23819722
# 5: 0.0008951684 0.2743278 0.33468493 0.55100025
# 6: 0.4064611420 0.6773607 0.76984110 0.03732329
# 7: 0.5057564366 0.6481860 0.46362786 0.10610388
# 8: 0.2232209216 0.3236526 0.94694898 0.32251033
# 9: 0.9359388519 0.1219411 0.79383274 0.52573046
# 10: 0.1367496471 0.0121765 0.80466101 0.86377631
# meas2_feb meas2_mar meas3_jan meas3_feb
# 1: 0.438328012 0.48073709 0.15475894 0.6031853
# 2: 0.951975449 0.38903743 0.41983518 0.1806789
# 3: 0.006685105 0.53661894 0.01102345 0.6490597
# 4: 0.719372826 0.49205557 0.24464973 0.9969787
# 5: 0.318251195 0.98845165 0.75659034 0.3054178
# 6: 0.117685706 0.10667707 0.60810269 0.1472457
# 7: 0.210446290 0.76481919 0.70049390 0.4345142
# 8: 0.900549510 0.68583183 0.93117619 0.3676016
# 9: 0.628168668 0.02526911 0.55742879 0.7729294
# 10: 0.498467443 0.17884387 0.97037215 0.1924867
We want to melt grouping each month's data, so patterns is quite helpful:
melt(DT[ , lapply(.SD, mean)],
measure.vars = patterns("_jan", "_feb", "_mar"))
Looks beautiful, but the output less so:
# variable value1 value2 value3
#1: 1 0.4542663 0.3788571 0.6180074
#2: 2 0.4204972 0.4789930 0.4648342
#3: 3 0.5354431 0.4650098 NA
We've lost a lot of info. Does value1 correspond to jan, feb, or mar? Does variable == 1 mean meas1, meas2, or meas3?
I hope that value1 means jan, value2 means feb, and value3 means mar, but this is unclear; ditto that valuek corresponds to variable == k. This is especially true if the column order of the input is potentially unknown.
The value.name and variable.name fields are pretty useless to help us here.
It seems the robust way to deal with this as of now is to specify the measure.vars in a list, like so:
melt(DT[ , lapply(.SD, mean)],
measure.vars = list(paste0("meas", 1:3, "_jan"),
paste0("meas", 1:3, "_feb"),
paste0("meas", 1:2, "_mar")))
With this, confidence is restored in the order of output, so we can at least rename the output without worry of mis-naming something. But this seems verbose and ugly, and to make patterns seem a lot less helpful as a function.
I'm trying to summarize some data which I have stored (in wide form) as repeated cross-sections; the way to go would appear to be to summarize and melt, but there's some crucial information lost in the process -- variable names are tossed & it's not clear we can predict the resulting order.
An example:
We want to
meltgrouping each month's data, sopatternsis quite helpful:Looks beautiful, but the output less so:
We've lost a lot of info. Does
value1correspond tojan,feb, ormar? Doesvariable == 1meanmeas1,meas2, ormeas3?I hope that
value1meansjan,value2meansfeb, andvalue3meansmar, but this is unclear; ditto thatvaluekcorresponds tovariable == k. This is especially true if the column order of the input is potentially unknown.The
value.nameandvariable.namefields are pretty useless to help us here.It seems the robust way to deal with this as of now is to specify the
measure.varsin alist, like so:With this, confidence is restored in the order of output, so we can at least rename the output without worry of mis-naming something. But this seems verbose and ugly, and to make
patternsseem a lot less helpful as a function.