Skip to content

Replace dplyr grouping with fast data.table implementation, with optional strand grouping #5

@Arshammik

Description

@Arshammik

Description

We need to replace the dplyr-based code for grouping and mutation with a faster data.table version.
The goal is to replicate the behavior of the following dplyr pipeline:

temp_eventdata_grouped <- temp_eventdata %>% 
  group_by(start_cor_id) %>% 
  mutate(start_cor_group_id = cur_group_id()) %>% 
  mutate(start_cor_group_count = n()) %>% 
  ungroup() %>% 
  group_by(end_cor_id) %>% 
  mutate(end_cor_group_id = cur_group_id()) %>% 
  mutate(end_cor_group_count = n()) %>%
  as.data.table()

Additional functionality:

  • Add optional support for grouping by DNA strand (strand column) alongside start_cor_id and end_cor_id when specified by the user.
  • Ensure that if the strand information is used, the groupings respect both ID and strand combinations.

Requirements

  • Rewrite the above pipeline entirely using data.table for better performance.
  • Keep the output structure compatible with downstream code (i.e., a data.table with the same new columns).
  • Add an argument like use_strand = TRUE/FALSE to toggle whether the DNA strand should be included in grouping keys.
  • Preserve the same column names:
    • start_cor_group_id
    • start_cor_group_count
    • end_cor_group_id
    • end_cor_group_count
  • Avoid unnecessary copying of the data.
  • Ensure the behavior when strand column is missing is either:
    • Graceful fallback (ignore strand), or
    • Explicit error (clear messaging).

Deliverables

  • group_eventdata_datatable.R (main function or utility script)
  • Updated documentation/comments explaining the difference between strand-aware and strand-agnostic grouping
  • Unit tests to check correctness of both use_strand = TRUE and use_strand = FALSE modes
  • Benchmarks comparing runtime between dplyr and data.table versions (optional)

Notes

  • When use_strand = TRUE, grouping keys will be:
    • start_cor_id + strand for start_cor_group_id
    • end_cor_id + strand for end_cor_group_id
  • When use_strand = FALSE, grouping will be only by start_cor_id or end_cor_id.
  • data.table usage should leverage .GRP and .N efficiently:
    • .GRP gives the current group number
    • .N gives the size of the current group

Priorities

  • High: Correctness of grouping logic, especially with optional strand handling.
  • High: Full replacement of dplyr with data.table to improve speed.
  • Medium: Defensive coding for missing or incorrectly typed strand column.
  • Low: Benchmarks and performance comparison.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions