Skip to content

subsetArchRProject fails during barcode collisions between samples #2288

@dgodovich

Description

@dgodovich

Hello,

When using subsetArchRProject to subset by cells, I noticed that if I used the resulting ArrowFiles there were more cells than I expected. After investigating, I found that this was caused by repeated 10X barcodes between samples - these had unique cell names of the form [sample]#[barcode], but this information is stripped when the ArrowFiles are copied over (see this line of code).

Current behavior: ArrowFiles are subsetted to any cells that have matching barcodes, even if that is between samples.
Expected behavior: ArrowFiles are subsetted to exactly the cells you specify

See below for a reproducible example with the tutorial dataset.

library(ArchR)
library(parallel)
inputFiles <- getTutorialData("Hematopoiesis")
addArchRGenome("hg19")
addArchRThreads(threads = 16) 

ArrowFiles <- createArrowFiles(
  inputFiles = inputFiles,
  sampleNames = names(inputFiles),
  minTSS = 4,
  minFrags = 1000, 
  addTileMat = FALSE,
  addGeneScoreMat = FALSE
)

projHeme1 <- ArchRProject(
  ArrowFiles = ArrowFiles, 
  outputDirectory = "HemeTutorial",
  copyArrows = TRUE
)

barcodes <- sapply(strsplit(getCellNames(projHeme1),'#'), '[')[2,]
cell_subset <- getCellNames(projHeme1)[barcodes %>% duplicated(fromLast = TRUE)]
print(length(cell_subset)) # 43

projSubset <- subsetArchRProject(
    ArchRProj = projHeme1,
    cells = cell_subset,
    outputDirectory = "ArchRSubset",
    dropCells = TRUE,
    force = TRUE
)

print(nCells(projSubset)) # has the expected 43 cells

print(lapply(getArrowFiles(projSubset), nCells) %>% unlist %>% sum) # 70
# ArrowFiles of this project have 70 cells

test_subset <- ArchRProject(getArrowFiles(projSubset), outputDirectory = 'test_proj/')
print(nCells(test_subset))
# also has 70 cells

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /lila/home/godovid/miniconda3/envs/workshop_2024/lib/libopenblasp-r0.3.25.so;  LAPACK version 3.11.0

Random number generation:
 RNG:     L'Ecuyer-CMRG 
 Normal:  Inversion 
 Sample:  Rejection 
 
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.70.2                  
 [3] rtracklayer_1.62.0                BiocIO_1.12.0                    
 [5] Biostrings_2.70.3                 XVector_0.42.0                   
 [7] rhdf5_2.46.1                      SummarizedExperiment_1.32.0      
 [9] Biobase_2.62.0                    RcppArmadillo_0.12.8.3.0         
[11] Rcpp_1.0.13                       Matrix_1.6-5                     
[13] GenomicRanges_1.54.1              GenomeInfoDb_1.38.8              
[15] IRanges_2.36.0                    S4Vectors_0.40.2                 
[17] BiocGenerics_0.48.1               sparseMatrixStats_1.14.0         
[19] MatrixGenerics_1.14.0             matrixStats_1.3.0                
[21] data.table_1.15.4                 stringr_1.5.1                    
[23] plyr_1.8.9                        magrittr_2.0.3                   
[25] ggplot2_3.5.1                     gtable_0.3.5                     
[27] gtools_3.9.5                      gridExtra_2.3                    
[29] devtools_2.4.5                    usethis_2.2.2                    
[31] ArchR_1.0.3                      

loaded via a namespace (and not attached):
 [1] bitops_1.0-7             remotes_2.4.2.1          rlang_1.1.4             
 [4] compiler_4.3.2           callr_3.7.3              vctrs_0.6.5             
 [7] profvis_0.3.8            pkgconfig_2.0.3          crayon_1.5.2            
[10] fastmap_1.2.0            ellipsis_0.3.2           utf8_1.2.4              
[13] Rsamtools_2.18.0         promises_1.3.0           sessioninfo_1.2.2       
[16] ps_1.7.6                 purrr_1.0.2              zlibbioc_1.48.2         
[19] cachem_1.0.8             jsonlite_1.8.8           later_1.3.2             
[22] rhdf5filters_1.14.1      DelayedArray_0.28.0      BiocParallel_1.36.0     
[25] uuid_1.2-0               Rhdf5lib_1.24.2          prettyunits_1.2.0       
[28] R6_2.5.1                 stringi_1.8.4            pkgload_1.3.4           
[31] IRkernel_1.3.2           base64enc_0.1-3          httpuv_1.6.15           
[34] tidyselect_1.2.1         yaml_2.3.8               abind_1.4-5             
[37] codetools_0.2-19         miniUI_0.1.1.1           processx_3.8.3          
[40] pkgbuild_1.4.2           lattice_0.22-5           tibble_3.2.1            
[43] shiny_1.8.1.1            withr_3.0.1              evaluate_0.23           
[46] urlchecker_1.0.1         pillar_1.9.0             generics_0.1.3          
[49] RCurl_1.98-1.14          IRdisplay_1.1            munsell_0.5.1           
[52] scales_1.3.0             xtable_1.8-4             glue_1.7.0              
[55] tools_4.3.2              GenomicAlignments_1.38.2 pbdZMQ_0.3-11           
[58] fs_1.6.4                 XML_3.99-0.16.1          colorspace_2.1-1        
[61] GenomeInfoDbData_1.2.11  repr_1.1.7               restfulr_0.0.15         
[64] cli_3.6.3                fansi_1.0.6              S4Arrays_1.2.1          
[67] dplyr_1.1.4              digest_0.6.35            SparseArray_1.2.4       
[70] rjson_0.2.21             htmlwidgets_1.6.4        memoise_2.0.1           
[73] htmltools_0.5.8.1        lifecycle_1.0.4          mime_0.12               

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions