Skip to content

Comments

Handle duplicate fastq filenames in ezMethodFastQC via sample-name resolution#308

Draft
Copilot wants to merge 5 commits intomasterfrom
copilot/improve-duplicate-detection
Draft

Handle duplicate fastq filenames in ezMethodFastQC via sample-name resolution#308
Copilot wants to merge 5 commits intomasterfrom
copilot/improve-duplicate-detection

Conversation

Copy link

Copilot AI commented Feb 11, 2026

ezMethodFastQC fails immediately when samples from different run directories share fastq filenames (e.g., multiple reads.fastq.gz). This adds automatic resolution using sample names when possible.

Changes

Duplicate detection logic (lines 96-117):

  • Check if reportDirs (based on fastq basenames) contain duplicates
  • If duplicated, attempt resolution by checking if sample-name-based reportDirs are unique
  • Set needsRenaming flag if sample names resolve the conflict
  • Only error if sample names are also duplicated, with clear message showing which directories conflict

Post-FastQC renaming (lines 144-189):

  • When needsRenaming=TRUE, rename _fastqc directories, .zip, and .html files to sample-based names
  • Track renamed directories to avoid double-rename when multiple files map to same basename
  • Check target existence and rename success, fail with informative errors

Example scenario:

# Two samples with identical filenames from different runs
files <- c(
  "sample1_R1" = "run1/reads.fastq.gz",
  "sample2_R1" = "run2/reads.fastq.gz"
)

# Before: immediate failure on duplicated "reads_fastqc"
# After: creates "sample1_R1_fastqc" and "sample2_R1_fastqc"

Tests: Unit tests cover no-duplicates, resolvable duplicates, unresolvable duplicates, and file extension handling.

Original prompt

Problem

The ezMethodFastQC function can fail when processing samples with duplicate fastq file names, which can occur when merging samples from different run directories. The current code checks for duplicate report directories using stopifnot(!any(duplicated(reportDirs))) but doesn't attempt to resolve the issue before failing.

Solution

Implement a smarter duplicate detection and resolution strategy:

  1. Check for duplicates in the predicted reportDirs (based on fastq file basenames)
  2. Attempt resolution by checking if renaming to sample names would eliminate duplicates
  3. Rename directories after FastQC runs if the sample-name-based approach resolves the duplication
  4. Only throw an error if duplication persists even after attempting to use sample names

Implementation Details

In the ezMethodFastQC function in R/app-fastQC.R:

Before FastQC execution (around line 103):

  • Remove the immediate stopifnot(!any(duplicated(reportDirs))) check
  • Add logic to detect duplicates and determine if sample-name-based renaming would resolve them
  • Set a needsRenaming flag if renaming would help
  • Only throw an error if sample-name-based renaming still results in duplicates

After FastQC execution (around line 134):

  • Add a renaming block that executes if needsRenaming is TRUE
  • Rename the _fastqc directories, .zip files, and .html files to use sample names
  • Update reportDirs to reflect the new directory names

Error message improvement:

If duplication cannot be resolved, provide a clear error message explaining:

  • What went wrong (duplicated fastq file names)
  • Why it couldn't be resolved (sample names also duplicate)
  • Which directories are duplicated

Code Changes

Replace lines 103-110 with:

## guess the names of the report directories that will be created by fastqc
reportDirs <- sub("\\.(fastq|fq|bam)(\\.gz)*$", "_fastqc", basename(files))

## Check for duplicates and determine if renaming to sample names resolves it
needsRenaming <- FALSE
reportDirsRenamed <- NULL
if (any(duplicated(reportDirs))) {
  ## Try using sample names instead
  reportDirsRenamed <- sub("\\.(fastq|fq|bam)(\\.gz)*$", "_fastqc", names(files))
  
  if (!any(duplicated(reportDirsRenamed))) {
    ## Renaming to sample names resolves the duplication
    needsRenaming <- TRUE
  } else {
    ## Renaming doesn't resolve duplication - throw error
    stop("Duplicated fastq file names detected that cannot be resolved by renaming to sample names. ",
         "This typically occurs when merging samples from different run directories with identical file names. ",
         "Duplicated report directories: ", 
         paste(reportDirs[duplicated(reportDirs)], collapse = ", "))
  }
}

stopifnot(!any(duplicated(reportDirs)) || needsRenaming)

Add after the FastQC execution block (after line 134, after the gc() call):

## Rename directories if needed to resolve duplicates
if (needsRenaming) {
  for (i in seq_along(files)) {
    if (reportDirs[i] != reportDirsRenamed[i] && file.exists(reportDirs[i])) {
      file.rename(reportDirs[i], reportDirsRenamed[i])
      ## Also rename the zip file if it exists
      zipOld <- paste0(reportDirs[i], ".zip")
      zipNew <- paste0(reportDirsRenamed[i], ".zip")
      if (file.exists(zipOld)) {
        file.rename(zipOld, zipNew)
      }
      ## Also rename the html file if it exists
      htmlOld <- paste0(reportDirs[i], ".html")
      htmlNew <- paste0(reportDirsRenamed[i], ".html")
      if (file.exists(htmlOld)) {
        file.rename(htmlOld, htmlNew)
      }
    }
  }
  ## Update reportDirs to the renamed versions
  reportDirs <- reportDirsRenamed
}

Benefits

  • Prevents unnecessary failures: Automatically resolves duplicate fastq filenames when sample names are unique
  • Better error messages: If resolution isn't possible, users get clear information about what went wrong
  • No upstream changes needed: Works with existing file structures and naming conventions
  • Maintains compatibility: All downstream code continues to work using the reportDirs variable

This pull request was created from Copilot chat.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 4 commits February 11, 2026 08:24
Co-authored-by: CodingKaiser <16783087+CodingKaiser@users.noreply.github.com>
Co-authored-by: CodingKaiser <16783087+CodingKaiser@users.noreply.github.com>
…mprove renaming logic

Co-authored-by: CodingKaiser <16783087+CodingKaiser@users.noreply.github.com>
Co-authored-by: CodingKaiser <16783087+CodingKaiser@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement smarter duplicate detection and resolution in ezMethodFastQC Handle duplicate fastq filenames in ezMethodFastQC via sample-name resolution Feb 11, 2026
Copilot AI requested a review from CodingKaiser February 11, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants