Skip to content

FR: expansion of melt functionality for handling names of output #1547

@MichaelChirico

Description

@MichaelChirico

I'm trying to summarize some data which I have stored (in wide form) as repeated cross-sections; the way to go would appear to be to summarize and melt, but there's some crucial information lost in the process -- variable names are tossed & it's not clear we can predict the resulting order.

An example:

set.seed(10239)
DT <- setnames(as.data.table(replicate(8, runif(10))),
               paste0("meas", rep(1:3, c(3, 3, 2)), 
                      rep(c("_jan", "_feb", "_mar"))))
DT
#        meas1_jan meas1_feb  meas1_mar  meas2_jan
#  1: 0.6600937557 0.3694227 0.04429776 0.02691871
#  2: 0.5562533571 0.2260477 0.99935621 0.62940965
#  3: 0.7800425424 0.9500860 0.05847563 0.90400149
#  4: 0.3372507696 0.1853700 0.96434815 0.23819722
#  5: 0.0008951684 0.2743278 0.33468493 0.55100025
#  6: 0.4064611420 0.6773607 0.76984110 0.03732329
#  7: 0.5057564366 0.6481860 0.46362786 0.10610388
#  8: 0.2232209216 0.3236526 0.94694898 0.32251033
#  9: 0.9359388519 0.1219411 0.79383274 0.52573046
# 10: 0.1367496471 0.0121765 0.80466101 0.86377631
#       meas2_feb  meas2_mar  meas3_jan meas3_feb
#  1: 0.438328012 0.48073709 0.15475894 0.6031853
#  2: 0.951975449 0.38903743 0.41983518 0.1806789
#  3: 0.006685105 0.53661894 0.01102345 0.6490597
#  4: 0.719372826 0.49205557 0.24464973 0.9969787
#  5: 0.318251195 0.98845165 0.75659034 0.3054178
#  6: 0.117685706 0.10667707 0.60810269 0.1472457
#  7: 0.210446290 0.76481919 0.70049390 0.4345142
#  8: 0.900549510 0.68583183 0.93117619 0.3676016
#  9: 0.628168668 0.02526911 0.55742879 0.7729294
# 10: 0.498467443 0.17884387 0.97037215 0.1924867

We want to melt grouping each month's data, so patterns is quite helpful:

melt(DT[ , lapply(.SD, mean)],
     measure.vars = patterns("_jan", "_feb", "_mar"))

Looks beautiful, but the output less so:

#    variable    value1    value2    value3
#1:        1 0.4542663 0.3788571 0.6180074
#2:        2 0.4204972 0.4789930 0.4648342
#3:        3 0.5354431 0.4650098        NA

We've lost a lot of info. Does value1 correspond to jan, feb, or mar? Does variable == 1 mean meas1, meas2, or meas3?

I hope that value1 means jan, value2 means feb, and value3 means mar, but this is unclear; ditto that valuek corresponds to variable == k. This is especially true if the column order of the input is potentially unknown.

The value.name and variable.name fields are pretty useless to help us here.

It seems the robust way to deal with this as of now is to specify the measure.vars in a list, like so:

melt(DT[ , lapply(.SD, mean)],
     measure.vars = list(paste0("meas", 1:3, "_jan"),
                         paste0("meas", 1:3, "_feb"),
                         paste0("meas", 1:2, "_mar")))

With this, confidence is restored in the order of output, so we can at least rename the output without worry of mis-naming something. But this seems verbose and ugly, and to make patterns seem a lot less helpful as a function.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions