Skip to content

Need a way to protect data from shallow copy #2277

@renkun-ken

Description

@renkun-ken

I'm working on a production system in which a basic data.table (nearly 200 columns and 6M rows) is generated at the beginning, then tens of scripts work on this data.table and compute derivative variables in-place and produce a finally column of values. To avoid deep copy of data.table, each time I use dt[TRUE] to make a shallow copy, then new derived columns are added. As documented, this does not avoid in-place modification of exiting columns in the original dt. Therefore I'm wondering if there's a way to protect existing columns from modification?

A basic workflow looks like this:

dt <- generate_data() # a big data.table

run("script-1.R")
run("script-2.R")
# ...
run("script-100.R")

where run() sys.source() a given script file in an sandbox environment.

In each script-n.R, the code looks like this:

ft <- dt[TRUE]
ft[, x1 := ..., by = col1]
ft[, x2 := ..., by = col2]
# ...
ft[, x := x1 + x2 * abs(x3 - x4)]

where all columns modified in ft are supposed not to exist in dt so that they are added without modifying any of pre-existing columns in dt.

I know the safest approach to this is copy dt all the time but it is simply too time consuming since the production is also time-critical. So the question is, Is there a way to protect all columns in dt while ft[, x1 := ...] only allows new columns to be added and prevents changing any columns in dt?

Metadata

Metadata

Assignees

No one assigned

    Labels

    by-referenceIssues related to by-reference/copying behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions