Implementing and Testing DIANN converter for MSstatsBIG. by Rudhik1904 · Pull Request #9 · Vitek-Lab/MSstatsBig

Rudhik1904 · 2026-01-05T23:45:26Z

WIP: Added Clean_DIANN, Diann Converter and some basic test (more comprehensive test is still needed)

Summary by CodeRabbit

New Features
- Public DIANN → MSstats converter with chunked, delimiter-aware processing, configurable filtering, metadata preservation, and backend support.
Tests
- New unit and integration tests covering conversion, preprocessing, annotation merging, feature selection, and end-to-end flows.
Documentation
- User documentation added for the DIANN→MSstats converter.
Chores
- Package metadata updated (tooling bump); Imports expanded and testing helpers added to Suggests.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-05T23:45:34Z

📝 Walkthrough

Walkthrough

Adds chunked DIANN-to-MSstats conversion (new reduceBigDIANN and cleanDIANNChunk), exports bigDIANNtoMSstatsFormat (duplicate definition present), updates DESCRIPTION/NAMESPACE (imports/suggests, RoxygenNote), adds man page, and new unit tests for DIANN and converter flows.

Changes

Cohort / File(s)	Summary
Build Configuration `DESCRIPTION`	Bumped `RoxygenNote` to `7.3.3`; expanded `Imports` with `dplyr`, `MSstats`, `MSstatsConvert`; added `Suggests`: `testthat`, `mockery`.
Public API / Exports `NAMESPACE`	Exported new function `bigDIANNtoMSstatsFormat`.
DIANN Processing `R/clean_DIANN.R`	Added `reduceBigDIANN()` (delimiter auto-detect, chunked read/write, append) and `cleanDIANNChunk()` (MBR-aware column selection, semicolon un-nesting, `Fragment.Info` → `FragmentIon`/`ProductCharge` parsing, Q-value filtering, MSstats column mapping, per-chunk CSV write).
Converter Integration `R/converters.R`	Added exported `bigDIANNtoMSstatsFormat()` that calls `reduceBigDIANN()` and `MSstatsPreprocessBig()`; note: duplicate definition of this function appears in the file.
Documentation `man/bigDIANNtoMSstatsFormat.Rd`	Added roxygen-generated manual for `bigDIANNtoMSstatsFormat()` (usage, args, return type).
Tests — Converters `tests/testthat/test-converters.R`	New tests for `MSstatsAddAnnotationBig`, `MSstatsPreprocessBig` feature selection, and an integration flow using Arrow backend and temporary files.
Tests — DIANN `tests/testthat/test-diann_converter.R`	New tests for `cleanDIANNChunk`, `reduceBigDIANN`, and end-to-end `bigDIANNtoMSstatsFormat` (arrow backend), covering mapping, filtering, feature selection, and cleanup.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant bigDIANN as bigDIANNtoMSstatsFormat
    participant reduce as reduceBigDIANN
    participant chunk as cleanDIANNChunk
    participant preprocess as MSstatsPreprocessBig
    participant storage as Arrow/Sparklyr

    User->>bigDIANN: call with DIANN file & params
    bigDIANN->>reduce: start chunked reduction
    reduce->>reduce: detect delimiter

    rect rgba(200,220,255,0.5)
    Note over reduce,chunk: Chunked loop (1e6 rows/chunk)
    reduce->>chunk: process chunk 1
    chunk->>chunk: select columns (MBR-aware)
    chunk->>chunk: un-nest semicolon fields
    chunk->>chunk: parse FragmentIon & ProductCharge
    chunk->>chunk: apply Q-value and fragment filters
    chunk-->>reduce: return processed chunk
    reduce->>reduce: append to CSV output
    reduce->>chunk: process chunk 2...N
    end

    reduce-->>bigDIANN: reduced CSV ready
    bigDIANN->>preprocess: MSstatsPreprocessBig (feature selection, optional anomaly scoring)
    preprocess-->>storage: produce Arrow/Sparklyr table
    bigDIANN-->>User: return processed data

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Create MSstatsBig DIANN Converter - Old Format #7 — Implements the big DIANN converter and chunked cleaning functions described in the issue.

Possibly related PRs

Feature anomaly #6 — Shares converter/preprocess parameter surface (anomaly scoring) and overlaps with big-conversion pipeline changes.

Poem

🐰
I nibble million rows by moonlight glow,
unnest the fragments where the data flow.
I parse and filter, hop from chunk to chunk,
stitch tidy tables with a joyful thunk.
Hooray — clean data in a bunny-go!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: implementing a DIANN converter and tests for MSstatsBIG.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch Adding-Diann-Converter

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Fix all issues with AI Agents 🤖

In @DESCRIPTION:
- Around line 26-27: Move the test-only packages out of Imports and into
Suggests in DESCRIPTION: remove "testthat" and "mockery" from the Imports list
and add them under Suggests so they are only required for development/testing;
update the DESCRIPTION sections accordingly so Imports contains runtime
dependencies only and Suggests lists testthat and mockery.

In @man/bigDIANNtoMSstatsFormat.Rd:
- Around line 33-34: The documentation and default parameter use incorrect DIANN
column names without dots; update the man page entry for quantificationColumn to
use "Fragment.Quant.Corrected" (DIANN 1.8.x) and "Fragment.Quant.Raw" (DIANN
1.9.x) instead of the dotted-less variants, and update the default value in the
reduceBigDIANN function (R/clean_DIANN.R, function reduceBigDIANN) to
"Fragment.Quant.Corrected" so the code and docs match DIANN's actual output
column names.

In @R/clean_DIANN.R:
- Around line 49-58: The current sub('\\^\\.\\*', '', .data$Fragment.Info) in
the dplyr::mutate (assigning FragmentIon) uses an incorrect literal regex and
does nothing; either remove the sub and set FragmentIon = .data$Fragment.Info
directly, or replace the pattern with one that strips the charge/suffix (e.g.
sub("\\^.*$", "", .data$Fragment.Info)) so "y7^1/1" -> "y7" and "b3-H2O^1/1" ->
"b3-H2O"; update the FragmentIon assignment accordingly in the mutate call.

In @R/converters.R:
- Around line 162-163: The roxygen docs for the quantificationColumn parameter
are inconsistent with the actual default; update the documentation text for
quantificationColumn to match the real default value used in the function
(change the displayed 'FragmentQuantCorrected' to 'Fragment.Quant.Corrected')
and likewise ensure the alternative value text matches the actual option name
used in code (e.g., use 'Fragment.Quant.Raw' if that is the real token), so the
docstring and the default parameter value for quantificationColumn are
identical.

🧹 Nitpick comments (6)

R/clean_DIANN.R (2)

91-91: Address the TODO comment regarding annotation columns.

The TODO indicates uncertainty about whether Condition and BioReplicate columns are needed. This should be confirmed with Tony (or through documentation review) to ensure the converter produces the correct MSstats format.

Would you like me to help search the MSstats documentation or existing converter implementations to verify the required columns?

2-21: Consider adding error handling for I/O operations.

Both reduceBigDIANN and cleanDIANNChunk lack error handling for file operations. Consider wrapping I/O operations in tryCatch blocks to provide informative error messages when:

Input files are missing, corrupted, or have incorrect format

Output path is invalid or write-protected

Disk space is exhausted during chunked writes

This would improve the user experience when processing large files that might fail midway through.

Also applies to: 24-113
tests/testthat/test-converters.R (4)
31-32: Use tempfile() for output file to avoid test pollution.

The input file uses tempfile() but the output file is hardcoded as "preprocess_output.csv". This could cause issues with parallel test execution or leave artifacts in the working directory if the test fails before cleanup.
🔎 Proposed fix
   input_file <- tempfile(fileext = ".csv")
-  output_file <- "preprocess_output.csv"
+  output_file <- tempfile(fileext = ".csv")
57-60: Use on.exit() for guaranteed cleanup.

If an assertion fails before reaching cleanup, temporary files may persist. Using on.exit() ensures cleanup runs regardless of test outcome.
🔎 Proposed fix
   input_file <- tempfile(fileext = ".csv")
-  output_file <- "preprocess_output.csv"
+  output_file <- tempfile(fileext = ".csv")
+  on.exit({
+    if (file.exists(input_file)) file.remove(input_file)
+    if (file.exists(output_file)) file.remove(output_file)
+  }, add = TRUE)

   # ... test code ...

-  # Cleanup
-  file.remove(input_file)
-  if (file.exists(output_file)) file.remove(output_file)
 })
64-64: Unused mock_reduce variable.

mock_reduce is created but never used. The stub() function on line 66 is what actually replaces reduceBigSpectronaut. This line can be removed.
🔎 Proposed fix
 test_that("bigSpectronauttoMSstatsFormat works correctly", {
-  # Mock reduceBigSpectronaut as its source is not provided
-  mock_reduce <- mock(NULL)
-
   stub(bigSpectronauttoMSstatsFormat, "reduceBigSpectronaut", function(input_file, output_path, ...) {
76-77: Use tempfile() for test files.

Similar to the previous test, hardcoded filenames can cause test pollution. While input_file isn't actually read (due to mocking), using tempfile() for output_file ensures proper isolation.
🔎 Proposed fix
-  input_file <- "dummy_spectro_input.csv"
-  output_file <- "spectro_output.csv"
+  input_file <- "dummy_spectro_input.csv"  # Not actually read due to mock
+  output_file <- tempfile(fileext = ".csv")
+  on.exit({
+    if (file.exists(output_file)) file.remove(output_file)
+    reduce_file <- paste0("reduce_output_", basename(output_file))
+    if (file.exists(reduce_file)) file.remove(reduce_file)
+  }, add = TRUE)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a160afc and 1dfd8f0.

⛔ Files ignored due to path filters (3)

tests/testthat/topN_preprocess_output.csv is excluded by !**/*.csv
tests/testthat/topN_spectro_output.csv is excluded by !**/*.csv
tests/testthat/topN_test_diann_output.csv is excluded by !**/*.csv

📒 Files selected for processing (7)

DESCRIPTION
NAMESPACE
R/clean_DIANN.R
R/converters.R
man/bigDIANNtoMSstatsFormat.Rd
tests/testthat/test-converters.R
tests/testthat/test-diann_converter.R

🔇 Additional comments (6)

NAMESPACE (1)

5-5: LGTM!

The new export for bigDIANNtoMSstatsFormat is correctly added and aligns with the new converter functionality.

tests/testthat/test-diann_converter.R (1)

6-109: LGTM! Comprehensive test coverage.

The test suite provides excellent coverage of the DIANN converter functionality:

cleanDIANNChunk test: Validates chunk processing, fragment filtering (H2O removal), and column mapping

reduceBigDIANN test: Tests file-based chunked processing with multiple proteins and charge extraction

bigDIANNtoMSstatsFormat test: End-to-end validation of feature selection logic with arrow backend

The tests correctly exercise key edge cases and validate expected outputs.

tests/testthat/test-converters.R (3)

1-4: LGTM on test setup.

The imports and context setup are appropriate for testing converter functions.

6-28: LGTM!

Good coverage of the annotation merge functionality. The test validates that Run and Intensity are preserved while Condition and BioReplicate are correctly joined.

92-95: Cleanup may miss the intermediate file with tempfile() approach.

If you adopt tempfile() for output_file, the paste0("reduce_output_", output_file) pattern will break because output_file would be a full path like /tmp/RtmpXXX/file123.csv. Consider adjusting how the reduce output path is constructed or use basename().

Also, consider wrapping cleanup in on.exit() as suggested for the previous test.
R/converters.R (1)
170-199: Verify reduceBigDIANN exists and parameter alignment.

The function implementation follows the established pattern from bigSpectronauttoMSstatsFormat. The function reduceBigDIANN is correctly defined in R/clean_DIANN.R and properly invoked.

Looking at MSstatsPreprocessBig signature (R/converters.R:44-54), the parameter order is: input_file, output_file_name, backend, max_feature_count, filter_unique_peptides, aggregate_psms, filter_few_obs, remove_annotation, calculateAnomalyScores, anomalyModelFeatures, connection.

For bigSpectronauttoMSstatsFormat (R/converters.R:147-151):
MSstatsPreprocessBig(
  paste0("reduce_output_", output_file_name),
  output_file_name, backend, max_feature_count,
  aggregate_psms, filter_few_obs, remove_annotation, calculateAnomalyScores, 
  anomalyModelFeatures, connection)
This is missing filter_unique_peptides entirely—a bug in the existing function.

Your new function correctly includes all parameters in the proper order:
MSstatsPreprocessBig(
  paste0("reduce_output_", output_file_name),
  output_file_name, backend, max_feature_count,
  filter_unique_peptides, aggregate_psms, filter_few_obs, 
  remove_annotation, calculateAnomalyScores, 
  anomalyModelFeatures, connection)

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@R/clean_DIANN.R`:
- Around line 2-3: The default quantificationColumn in reduceBigDIANN is
incorrect ("FragmentQuantCorrected") and should match bigDIANNtoMSstatsFormat's
column name ("Fragment.Quant.Corrected") so the column can be found; update the
default value of the quantificationColumn parameter in reduceBigDIANN to
"Fragment.Quant.Corrected" (or otherwise normalize dotted vs. non-dotted names
inside reduceBigDIANN to accept both), referencing the reduceBigDIANN function
and the bigDIANNtoMSstatsFormat expectation when making the change.
- Around line 35-58: The code unconditionally uses columns "Fragment.Info" and
the variable quantificationColumn after only using intersect() earlier, so add
guards to skip downstream processing when those columns are missing: check if
quantificationColumn exists in colnames(input) before calling
tidyr::separate_rows on it or converting input[[quantificationColumn]] <-
as.numeric(...), and check for "Fragment.Info" in colnames(input) before
separating rows or any parsing of fragment data; use the existing split_cols or
explicit logical checks (e.g., if (quantificationColumn %in% colnames(input)) {
... } and if ("Fragment.Info" %in% colnames(input)) { ... }) around the
tidyr::separate_rows, fragment processing, and numeric conversion to avoid
crashes when columns are absent.
- Around line 64-70: ProductCharge currently becomes NA when Fragment.Info
contains "/" but no digits follow because as.integer(stringr::str_extract(...))
yields NA; modify the ProductCharge computation (the dplyr::if_else branch that
uses stringr::str_extract on .data$Fragment.Info) to wrap the extracted value
with dplyr::coalesce (or equivalent) so that if str_extract returns NA it falls
back to "1" before coercion, ensuring the final value is 1L when parsing fails.

In `@R/converters.R`:
- Around line 184-193: The intermediate path construction using
paste0("reduce_output_", output_file_name) is unsafe when output_file_name
contains directories; update calls that create the reduced intermediate filename
(in reduceBigDIANN -> MSstatsPreprocessBig sequence and the analogous locations
in bigSpectronauttoMSstatsFormat) to build the path with
file.path(dirname(output_file_name), paste0("reduce_output_",
basename(output_file_name))) so the reduce_output_* file is created alongside
the intended output file regardless of directory components.

In `@tests/testthat/test-diann_converter.R`:
- Around line 78-108: The test uses a static output filename
("test_diann_output.csv") which can cause collisions; change the output_file
assignment to use tempfile(fileext = ".csv") and update all subsequent
references (the call to bigDIANNtoMSstatsFormat, the dplyr::collect on
converted, and the cleanup file.remove calls that use output_file and
paste0("reduce_output_", output_file)) to use that tempfile variable so the test
writes/cleans unique temp files; relevant symbols: output_file,
bigDIANNtoMSstatsFormat, converted, and the cleanup file.remove/paste0 calls.

♻️ Duplicate comments (2)

R/clean_DIANN.R (2)
7-13: Stop on unsupported extensions; avoid treating .xlsx as TSV.

grepl("tsv|xls") matches .xlsx, which is binary. The ";" fallback can silently mis-parse files. Prefer anchored, case-insensitive extension checks and error for unsupported formats.
🐛 Proposed fix
-  if (grepl("csv", input_file)) {
+  if (grepl("\\.csv$", input_file, ignore.case = TRUE)) {
     delim = ","
-  } else if (grepl("tsv|xls", input_file)) {
+  } else if (grepl("\\.(tsv|txt)$", input_file, ignore.case = TRUE)) {
     delim = "\t"
   } else {
-    delim <- ";"
+    stop("Unsupported file extension. Please provide a .csv or .tsv/.txt file.")
   }
60-63: Remove the ineffective FragmentIon regex.

sub('\\^\\.\\*', ...) matches the literal "^.*" and is effectively a no-op. If no transformation is intended, assign the value directly.
🔧 Suggested change
-    FragmentIon = sub('\\^\\.\\*', '', .data$Fragment.Info),
+    FragmentIon = .data$Fragment.Info,

coderabbitai · 2026-01-20T04:48:21Z

+    # Extract product charge
+    ProductCharge = dplyr::if_else(
+      grepl("/", .data$Fragment.Info),
+      # Extract charge (number right after "/" in string), default to 1 if parsing fails
+      as.integer(stringr::str_extract(.data$Fragment.Info, "(?<=/)[0-9]+")),
+      1L
+    )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "clean_DIANN.R" -type f

Repository: Vitek-Lab/MSstatsBig

Length of output: 81

🏁 Script executed:

sed -n '64,70p' ./R/clean_DIANN.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 348

🏁 Script executed:

python3 << 'EOF' # Let's verify the behavior of stringr::str_extract and as.integer in R # by examining what happens when the regex pattern doesn't match # The behavior should be: # stringr::str_extract("something/", "(?<=/)[0-9]+") -> NA (no digits after /) # as.integer(NA) -> NA (not 1L) # This is a R behavior question, let me create a test to understand it better print("Testing R string extraction and type conversion behavior:") print("Case 1: Fragment.Info with / but no digits after it") print(" stringr::str_extract returns: NA (pattern doesn't match)") print(" as.integer(NA) returns: NA (not 1L)") print("") print("Case 2: Fragment.Info with / and digits after it") print(" stringr::str_extract returns: the digits") print(" as.integer(digits) returns: integer value") print("") print("Case 3: Fragment.Info without /") print(" Current code returns: 1L (via the else branch of if_else)") print("") print("So the bug is in Case 1 - when / is present but no digits follow") print("The function would return NA instead of 1L") EOF

Repository: Vitek-Lab/MSstatsBig

Length of output: 150

🏁 Script executed:

grep -r "ProductCharge" ./R --include="*.R" -B 2 -A 2

Repository: Vitek-Lab/MSstatsBig

Length of output: 4903

🏁 Script executed:

sed -n '50,80p' ./R/clean_DIANN.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 1088

Ensure ProductCharge defaults to 1 when parsing fails.

The current code returns NA when a / is present in Fragment.Info but no digits follow (e.g., "y1/"). The regex pattern "(?<=/)[0-9]+" fails to match, causing stringr::str_extract() to return NA, and subsequently as.integer(NA) returns NA instead of the intended default of 1L. Use dplyr::coalesce() to handle this edge case:

🔧 Suggested change

- ProductCharge = dplyr::if_else( - grepl("/", .data$Fragment.Info), - # Extract charge (number right after "/" in string), default to 1 if parsing fails - as.integer(stringr::str_extract(.data$Fragment.Info, "(?<=/)[0-9]+")), - 1L - ) + ProductCharge = dplyr::if_else( + grepl("/", .data$Fragment.Info), + dplyr::coalesce( + as.integer(stringr::str_extract(.data$Fragment.Info, "(?<=/)[0-9]+")), + 1L + ), + 1L + )

🤖 Prompt for AI Agents

In `@R/clean_DIANN.R` around lines 64 - 70, ProductCharge currently becomes NA when Fragment.Info contains "/" but no digits follow because as.integer(stringr::str_extract(...)) yields NA; modify the ProductCharge computation (the dplyr::if_else branch that uses stringr::str_extract on .data$Fragment.Info) to wrap the extracted value with dplyr::coalesce (or equivalent) so that if str_extract returns NA it falls back to "1" before coercion, ensuring the final value is 1L when parsing fails.

coderabbitai · 2026-01-20T04:48:21Z

+test_that("bigDIANNtoMSstatsFormat works with arrow backend", {
+  input_file <- tempfile(fileext = ".csv")
+  output_file <- "test_diann_output.csv"
+
+  # 4 features for one protein. Feature selection should pick the top 2.
+  diann_data <- rbind(
+    data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),
+    data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),
+    data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),
+    data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001)
+  )
+  write.csv(diann_data, input_file, row.names = FALSE)
+
+  converted <- bigDIANNtoMSstatsFormat(
+    input_file = input_file,
+    output_file_name = output_file,
+    backend = "arrow",
+    max_feature_count = 2
+  )
+  result <- dplyr::collect(converted)
+
+  # Avg intensities: y1=1050, y2=550, y3=100, y4=2050.
+  # Top 2 features are y4 and y1.
+  expect_equal(nrow(result), 4) # 2 features * 2 runs
+  expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon)))
+  expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon)))
+
+  # Cleanup
+  file.remove(input_file)
+  if (file.exists(output_file)) file.remove(output_file)
+  if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file))


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the actual test file to confirm the current state cat -n tests/testthat/test-diann_converter.R | sed -n '70,115p'

Repository: Vitek-Lab/MSstatsBig

Length of output: 2688

🏁 Script executed:

# Check if there's any test configuration for parallel execution git ls-files | grep -E "(testthat|\.yml|\.yaml|\.json)" | head -20

Repository: Vitek-Lab/MSstatsBig

Length of output: 257

🏁 Script executed:

# Look for similar test output file patterns in other test files fd '\.R$' tests/testthat/ | xargs grep -l "tempfile\|file.remove" | head -10

Repository: Vitek-Lab/MSstatsBig

Length of output: 134

🏁 Script executed:

# Check for test configuration files (especially for parallel execution) fd -e yml -e yaml -e dcf | head -20

Repository: Vitek-Lab/MSstatsBig

Length of output: 46

🏁 Script executed:

# Look at other tests to see if they use static filenames too grep -n "output_file.*<-" tests/testthat/*.R

Repository: Vitek-Lab/MSstatsBig

Length of output: 465

🏁 Script executed:

# Check the git history to see if these leftover files should have been cleaned git ls-files tests/testthat/*.csv

Repository: Vitek-Lab/MSstatsBig

Length of output: 186

🏁 Script executed:

# Look for how other tests in the project handle temporary output files grep -B 2 -A 2 "tempfile.*output\|output.*tempfile" tests/testthat/*.R | head -40

Repository: Vitek-Lab/MSstatsBig

Length of output: 803

Use tempfile() instead of static filename to prevent test collisions.

The other tests in this file (lines 7 and 45) correctly use tempfile(fileext = ".csv") for output files. This test should follow the same pattern to avoid potential collisions in parallel test runs or from previous test failures. Evidence of incomplete cleanup is visible in the repository with leftover CSV files (e.g., topN_test_diann_output.csv).

🔧 Suggested change

- output_file <- "test_diann_output.csv" + output_file <- basename(tempfile(fileext = ".csv"))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

test_that("bigDIANNtoMSstatsFormat works with arrow backend", {

input_file <- tempfile(fileext = ".csv")

output_file <- "test_diann_output.csv"

# 4 features for one protein. Feature selection should pick the top 2.

diann_data <- rbind(

data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),

data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),

data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),

data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001)

)

write.csv(diann_data, input_file, row.names = FALSE)

converted <- bigDIANNtoMSstatsFormat(

input_file = input_file,

output_file_name = output_file,

backend = "arrow",

max_feature_count = 2

)

result <- dplyr::collect(converted)

# Avg intensities: y1=1050, y2=550, y3=100, y4=2050.

# Top 2 features are y4 and y1.

expect_equal(nrow(result), 4) # 2 features * 2 runs

expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon)))

expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon)))

# Cleanup

file.remove(input_file)

if (file.exists(output_file)) file.remove(output_file)

if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file))

test_that("bigDIANNtoMSstatsFormat works with arrow backend", {

input_file <- tempfile(fileext = ".csv")

output_file <- basename(tempfile(fileext = ".csv"))

# 4 features for one protein. Feature selection should pick the top 2.

diann_data <- rbind(

data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),

data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),

data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),

data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001)

)

write.csv(diann_data, input_file, row.names = FALSE)

converted <- bigDIANNtoMSstatsFormat(

input_file = input_file,

output_file_name = output_file,

backend = "arrow",

max_feature_count = 2

)

result <- dplyr::collect(converted)

# Avg intensities: y1=1050, y2=550, y3=100, y4=2050.

# Top 2 features are y4 and y1.

expect_equal(nrow(result), 4) # 2 features * 2 runs

expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon)))

expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon)))

# Cleanup

file.remove(input_file)

if (file.exists(output_file)) file.remove(output_file)

if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file))

})

🤖 Prompt for AI Agents

In `@tests/testthat/test-diann_converter.R` around lines 78 - 108, The test uses a static output filename ("test_diann_output.csv") which can cause collisions; change the output_file assignment to use tempfile(fileext = ".csv") and update all subsequent references (the call to bigDIANNtoMSstatsFormat, the dplyr::collect on converted, and the cleanup file.remove calls that use output_file and paste0("reduce_output_", output_file)) to use that tempfile variable so the test writes/cleans unique temp files; relevant symbols: output_file, bigDIANNtoMSstatsFormat, converted, and the cleanup file.remove/paste0 calls.

Think this concern is valid

@tonywu1999 Thank you for confirming! Using tempfile() will ensure the tests run cleanly without potential file collisions, especially in parallel test scenarios or CI environments.

…t.ion is NA, ProductCharge is set to 1

…mn is not present

tonywu1999 · 2026-01-23T00:19:33Z

+  if (!"Precursor.Mz" %in% colnames(input)) {
+    input <- dplyr::mutate(input, Precursor.Mz = NA)
+  }
+  if (!"Fragment.Info" %in% colnames(input)) {


https://github.com/Vitek-Lab/MSstatsConvert/blob/devel/R/clean_DIANN.R#L115-L118 : Instead, it should be like Frag1, Frag2, etc.

tonywu1999 · 2026-01-28T21:01:27Z

+test_that("bigDIANNtoMSstatsFormat works with arrow backend", {
+  input_file <- tempfile(fileext = ".csv")
+  output_file <- "test_diann_output.csv"
+
+  # 4 features for one protein. Feature selection should pick the top 2.
+  diann_data <- rbind(
+    data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),
+    data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),
+    data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001),
+    data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001)
+  )
+  write.csv(diann_data, input_file, row.names = FALSE)
+
+  converted <- bigDIANNtoMSstatsFormat(
+    input_file = input_file,
+    output_file_name = output_file,
+    backend = "arrow",
+    max_feature_count = 2
+  )
+  result <- dplyr::collect(converted)
+
+  # Avg intensities: y1=1050, y2=550, y3=100, y4=2050.
+  # Top 2 features are y4 and y1.
+  expect_equal(nrow(result), 4) # 2 features * 2 runs
+  expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon)))
+  expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon)))
+
+  # Cleanup
+  file.remove(input_file)
+  if (file.exists(output_file)) file.remove(output_file)
+  if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file))


Think this concern is valid

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@R/clean_DIANN.R`:
- Around line 241-249: The .filterDIANNByQValues function fails if expected
q-value columns are missing; update it to defensive-check for required columns
before filtering: inside .filterDIANNByQValues, detect whether MBR is TRUE and
then verify presence of "LibQValue" and "LibPGQValue", or if MBR is FALSE verify
"GlobalQValue" and "GlobalPGQValue"; if missing, either create those columns
filled with NA (so downstream filtering behaves predictably) or return a clear
stop()/warning indicating the missing column names and skip the MBR-specific
filter, then apply the DetectionQValue filter as currently done. Use the
function name .filterDIANNByQValues and the parameter MBR and the column names
LibQValue, LibPGQValue, GlobalQValue, GlobalPGQValue in your checks to locate
where to add this logic.

coderabbitai · 2026-01-30T17:49:36Z

+.filterDIANNByQValues <- function(input, MBR, global_qvalue_cutoff, qvalue_cutoff, pg_qvalue_cutoff) {
+  input <- dplyr::filter(input, DetectionQValue < global_qvalue_cutoff)
+
+  if (MBR) {
+    dplyr::filter(input, LibPGQValue < pg_qvalue_cutoff & LibQValue < qvalue_cutoff)
+  } else {
+    dplyr::filter(input, GlobalPGQValue < pg_qvalue_cutoff & GlobalQValue < qvalue_cutoff)
+  }
+}


⚠️ Potential issue | 🟡 Minor

Missing Q-value columns will cause runtime errors.

When MBR=TRUE, the function expects LibQValue and LibPGQValue columns (lines 245). When MBR=FALSE, it expects GlobalQValue and GlobalPGQValue (line 247). If these columns are absent from the input (e.g., older DIANN versions or different output settings), the function will fail with an unclear error.

🛡️ Suggested defensive check

.filterDIANNByQValues <- function(input, MBR, global_qvalue_cutoff, qvalue_cutoff, pg_qvalue_cutoff) { input <- dplyr::filter(input, DetectionQValue < global_qvalue_cutoff) if (MBR) { + if (!all(c("LibPGQValue", "LibQValue") %in% colnames(input))) { + warning("MBR Q-value columns not found, skipping MBR filtering") + return(input) + } dplyr::filter(input, LibPGQValue < pg_qvalue_cutoff & LibQValue < qvalue_cutoff) } else { + if (!all(c("GlobalPGQValue", "GlobalQValue") %in% colnames(input))) { + warning("Global Q-value columns not found, skipping global filtering") + return(input) + } dplyr::filter(input, GlobalPGQValue < pg_qvalue_cutoff & GlobalQValue < qvalue_cutoff) } }

🤖 Prompt for AI Agents

In `@R/clean_DIANN.R` around lines 241 - 249, The .filterDIANNByQValues function fails if expected q-value columns are missing; update it to defensive-check for required columns before filtering: inside .filterDIANNByQValues, detect whether MBR is TRUE and then verify presence of "LibQValue" and "LibPGQValue", or if MBR is FALSE verify "GlobalQValue" and "GlobalPGQValue"; if missing, either create those columns filled with NA (so downstream filtering behaves predictably) or return a clear stop()/warning indicating the missing column names and skip the MBR-specific filter, then apply the DetectionQValue filter as currently done. Use the function name .filterDIANNByQValues and the parameter MBR and the column names LibQValue, LibPGQValue, GlobalQValue, GlobalPGQValue in your checks to locate where to add this logic.

Implementing and Testing DIANN converter for MSstatsBIG.

1dfd8f0

coderabbitai Bot reviewed Jan 5, 2026

View reviewed changes

Comment thread DESCRIPTION Outdated

Comment thread man/bigDIANNtoMSstatsFormat.Rd

Comment thread R/clean_DIANN.R

Comment thread R/clean_DIANN.R Outdated

Comment thread R/converters.R Outdated