Description
We need to replace the dplyr-based code for grouping and mutation with a faster data.table version.
The goal is to replicate the behavior of the following dplyr pipeline:
temp_eventdata_grouped <- temp_eventdata %>%
group_by(start_cor_id) %>%
mutate(start_cor_group_id = cur_group_id()) %>%
mutate(start_cor_group_count = n()) %>%
ungroup() %>%
group_by(end_cor_id) %>%
mutate(end_cor_group_id = cur_group_id()) %>%
mutate(end_cor_group_count = n()) %>%
as.data.table()
Additional functionality:
- Add optional support for grouping by DNA strand (
strand column) alongside start_cor_id and end_cor_id when specified by the user.
- Ensure that if the strand information is used, the groupings respect both ID and strand combinations.
Requirements
- Rewrite the above pipeline entirely using
data.table for better performance.
- Keep the output structure compatible with downstream code (i.e., a
data.table with the same new columns).
- Add an argument like
use_strand = TRUE/FALSE to toggle whether the DNA strand should be included in grouping keys.
- Preserve the same column names:
start_cor_group_id
start_cor_group_count
end_cor_group_id
end_cor_group_count
- Avoid unnecessary copying of the data.
- Ensure the behavior when
strand column is missing is either:
- Graceful fallback (ignore strand), or
- Explicit error (clear messaging).
Deliverables
group_eventdata_datatable.R (main function or utility script)
- Updated documentation/comments explaining the difference between strand-aware and strand-agnostic grouping
- Unit tests to check correctness of both
use_strand = TRUE and use_strand = FALSE modes
- Benchmarks comparing runtime between
dplyr and data.table versions (optional)
Notes
- When
use_strand = TRUE, grouping keys will be:
start_cor_id + strand for start_cor_group_id
end_cor_id + strand for end_cor_group_id
- When
use_strand = FALSE, grouping will be only by start_cor_id or end_cor_id.
data.table usage should leverage .GRP and .N efficiently:
.GRP gives the current group number
.N gives the size of the current group
Priorities
- High: Correctness of grouping logic, especially with optional
strand handling.
- High: Full replacement of
dplyr with data.table to improve speed.
- Medium: Defensive coding for missing or incorrectly typed
strand column.
- Low: Benchmarks and performance comparison.
Description
We need to replace the
dplyr-based code for grouping and mutation with a fasterdata.tableversion.The goal is to replicate the behavior of the following
dplyrpipeline:Additional functionality:
strandcolumn) alongsidestart_cor_idandend_cor_idwhen specified by the user.Requirements
data.tablefor better performance.data.tablewith the same new columns).use_strand = TRUE/FALSEto toggle whether the DNA strand should be included in grouping keys.start_cor_group_idstart_cor_group_countend_cor_group_idend_cor_group_countstrandcolumn is missing is either:Deliverables
group_eventdata_datatable.R(main function or utility script)use_strand = TRUEanduse_strand = FALSEmodesdplyranddata.tableversions (optional)Notes
use_strand = TRUE, grouping keys will be:start_cor_id + strandforstart_cor_group_idend_cor_id + strandforend_cor_group_iduse_strand = FALSE, grouping will be only bystart_cor_idorend_cor_id.data.tableusage should leverage.GRPand.Nefficiently:.GRPgives the current group number.Ngives the size of the current groupPriorities
strandhandling.dplyrwithdata.tableto improve speed.strandcolumn.