Implementing and Testing DIANN converter for MSstatsBIG.#9
Conversation
📝 WalkthroughWalkthroughAdds chunked DIANN-to-MSstats conversion (new Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant bigDIANN as bigDIANNtoMSstatsFormat
participant reduce as reduceBigDIANN
participant chunk as cleanDIANNChunk
participant preprocess as MSstatsPreprocessBig
participant storage as Arrow/Sparklyr
User->>bigDIANN: call with DIANN file & params
bigDIANN->>reduce: start chunked reduction
reduce->>reduce: detect delimiter
rect rgba(200,220,255,0.5)
Note over reduce,chunk: Chunked loop (1e6 rows/chunk)
reduce->>chunk: process chunk 1
chunk->>chunk: select columns (MBR-aware)
chunk->>chunk: un-nest semicolon fields
chunk->>chunk: parse FragmentIon & ProductCharge
chunk->>chunk: apply Q-value and fragment filters
chunk-->>reduce: return processed chunk
reduce->>reduce: append to CSV output
reduce->>chunk: process chunk 2...N
end
reduce-->>bigDIANN: reduced CSV ready
bigDIANN->>preprocess: MSstatsPreprocessBig (feature selection, optional anomaly scoring)
preprocess-->>storage: produce Arrow/Sparklyr table
bigDIANN-->>User: return processed data
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip 🧪 Unit Test Generation v2 is now available!We have significantly improved our unit test generation capabilities. To enable: Add this to your reviews:
finishing_touches:
unit_tests:
enabled: trueTry it out by using the Have feedback? Share your thoughts on our Discord thread! Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Fix all issues with AI Agents 🤖
In @DESCRIPTION:
- Around line 26-27: Move the test-only packages out of Imports and into
Suggests in DESCRIPTION: remove "testthat" and "mockery" from the Imports list
and add them under Suggests so they are only required for development/testing;
update the DESCRIPTION sections accordingly so Imports contains runtime
dependencies only and Suggests lists testthat and mockery.
In @man/bigDIANNtoMSstatsFormat.Rd:
- Around line 33-34: The documentation and default parameter use incorrect DIANN
column names without dots; update the man page entry for quantificationColumn to
use "Fragment.Quant.Corrected" (DIANN 1.8.x) and "Fragment.Quant.Raw" (DIANN
1.9.x) instead of the dotted-less variants, and update the default value in the
reduceBigDIANN function (R/clean_DIANN.R, function reduceBigDIANN) to
"Fragment.Quant.Corrected" so the code and docs match DIANN's actual output
column names.
In @R/clean_DIANN.R:
- Around line 49-58: The current sub('\\^\\.\\*', '', .data$Fragment.Info) in
the dplyr::mutate (assigning FragmentIon) uses an incorrect literal regex and
does nothing; either remove the sub and set FragmentIon = .data$Fragment.Info
directly, or replace the pattern with one that strips the charge/suffix (e.g.
sub("\\^.*$", "", .data$Fragment.Info)) so "y7^1/1" -> "y7" and "b3-H2O^1/1" ->
"b3-H2O"; update the FragmentIon assignment accordingly in the mutate call.
In @R/converters.R:
- Around line 162-163: The roxygen docs for the quantificationColumn parameter
are inconsistent with the actual default; update the documentation text for
quantificationColumn to match the real default value used in the function
(change the displayed 'FragmentQuantCorrected' to 'Fragment.Quant.Corrected')
and likewise ensure the alternative value text matches the actual option name
used in code (e.g., use 'Fragment.Quant.Raw' if that is the real token), so the
docstring and the default parameter value for quantificationColumn are
identical.
🧹 Nitpick comments (6)
R/clean_DIANN.R (2)
91-91: Address the TODO comment regarding annotation columns.The TODO indicates uncertainty about whether
ConditionandBioReplicatecolumns are needed. This should be confirmed with Tony (or through documentation review) to ensure the converter produces the correct MSstats format.Would you like me to help search the MSstats documentation or existing converter implementations to verify the required columns?
2-21: Consider adding error handling for I/O operations.Both
reduceBigDIANNandcleanDIANNChunklack error handling for file operations. Consider wrapping I/O operations intryCatchblocks to provide informative error messages when:
- Input files are missing, corrupted, or have incorrect format
- Output path is invalid or write-protected
- Disk space is exhausted during chunked writes
This would improve the user experience when processing large files that might fail midway through.
Also applies to: 24-113
tests/testthat/test-converters.R (4)
31-32: Usetempfile()for output file to avoid test pollution.The input file uses
tempfile()but the output file is hardcoded as"preprocess_output.csv". This could cause issues with parallel test execution or leave artifacts in the working directory if the test fails before cleanup.🔎 Proposed fix
input_file <- tempfile(fileext = ".csv") - output_file <- "preprocess_output.csv" + output_file <- tempfile(fileext = ".csv")
57-60: Useon.exit()for guaranteed cleanup.If an assertion fails before reaching cleanup, temporary files may persist. Using
on.exit()ensures cleanup runs regardless of test outcome.🔎 Proposed fix
input_file <- tempfile(fileext = ".csv") - output_file <- "preprocess_output.csv" + output_file <- tempfile(fileext = ".csv") + on.exit({ + if (file.exists(input_file)) file.remove(input_file) + if (file.exists(output_file)) file.remove(output_file) + }, add = TRUE) # ... test code ... - # Cleanup - file.remove(input_file) - if (file.exists(output_file)) file.remove(output_file) })
64-64: Unusedmock_reducevariable.
mock_reduceis created but never used. Thestub()function on line 66 is what actually replacesreduceBigSpectronaut. This line can be removed.🔎 Proposed fix
test_that("bigSpectronauttoMSstatsFormat works correctly", { - # Mock reduceBigSpectronaut as its source is not provided - mock_reduce <- mock(NULL) - stub(bigSpectronauttoMSstatsFormat, "reduceBigSpectronaut", function(input_file, output_path, ...) {
76-77: Usetempfile()for test files.Similar to the previous test, hardcoded filenames can cause test pollution. While
input_fileisn't actually read (due to mocking), usingtempfile()foroutput_fileensures proper isolation.🔎 Proposed fix
- input_file <- "dummy_spectro_input.csv" - output_file <- "spectro_output.csv" + input_file <- "dummy_spectro_input.csv" # Not actually read due to mock + output_file <- tempfile(fileext = ".csv") + on.exit({ + if (file.exists(output_file)) file.remove(output_file) + reduce_file <- paste0("reduce_output_", basename(output_file)) + if (file.exists(reduce_file)) file.remove(reduce_file) + }, add = TRUE)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
tests/testthat/topN_preprocess_output.csvis excluded by!**/*.csvtests/testthat/topN_spectro_output.csvis excluded by!**/*.csvtests/testthat/topN_test_diann_output.csvis excluded by!**/*.csv
📒 Files selected for processing (7)
DESCRIPTIONNAMESPACER/clean_DIANN.RR/converters.Rman/bigDIANNtoMSstatsFormat.Rdtests/testthat/test-converters.Rtests/testthat/test-diann_converter.R
🔇 Additional comments (6)
NAMESPACE (1)
5-5: LGTM!The new export for
bigDIANNtoMSstatsFormatis correctly added and aligns with the new converter functionality.tests/testthat/test-diann_converter.R (1)
6-109: LGTM! Comprehensive test coverage.The test suite provides excellent coverage of the DIANN converter functionality:
- cleanDIANNChunk test: Validates chunk processing, fragment filtering (H2O removal), and column mapping
- reduceBigDIANN test: Tests file-based chunked processing with multiple proteins and charge extraction
- bigDIANNtoMSstatsFormat test: End-to-end validation of feature selection logic with arrow backend
The tests correctly exercise key edge cases and validate expected outputs.
tests/testthat/test-converters.R (3)
1-4: LGTM on test setup.The imports and context setup are appropriate for testing converter functions.
6-28: LGTM!Good coverage of the annotation merge functionality. The test validates that
RunandIntensityare preserved whileConditionandBioReplicateare correctly joined.
92-95: Cleanup may miss the intermediate file withtempfile()approach.If you adopt
tempfile()foroutput_file, thepaste0("reduce_output_", output_file)pattern will break becauseoutput_filewould be a full path like/tmp/RtmpXXX/file123.csv. Consider adjusting how the reduce output path is constructed or usebasename().Also, consider wrapping cleanup in
on.exit()as suggested for the previous test.R/converters.R (1)
170-199: VerifyreduceBigDIANNexists and parameter alignment.The function implementation follows the established pattern from
bigSpectronauttoMSstatsFormat. The functionreduceBigDIANNis correctly defined in R/clean_DIANN.R and properly invoked.Looking at
MSstatsPreprocessBigsignature (R/converters.R:44-54), the parameter order is:input_file, output_file_name, backend, max_feature_count, filter_unique_peptides, aggregate_psms, filter_few_obs, remove_annotation, calculateAnomalyScores, anomalyModelFeatures, connection.For
bigSpectronauttoMSstatsFormat(R/converters.R:147-151):MSstatsPreprocessBig( paste0("reduce_output_", output_file_name), output_file_name, backend, max_feature_count, aggregate_psms, filter_few_obs, remove_annotation, calculateAnomalyScores, anomalyModelFeatures, connection)This is missing
filter_unique_peptidesentirely—a bug in the existing function.Your new function correctly includes all parameters in the proper order:
MSstatsPreprocessBig( paste0("reduce_output_", output_file_name), output_file_name, backend, max_feature_count, filter_unique_peptides, aggregate_psms, filter_few_obs, remove_annotation, calculateAnomalyScores, anomalyModelFeatures, connection)
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@R/clean_DIANN.R`:
- Around line 2-3: The default quantificationColumn in reduceBigDIANN is
incorrect ("FragmentQuantCorrected") and should match bigDIANNtoMSstatsFormat's
column name ("Fragment.Quant.Corrected") so the column can be found; update the
default value of the quantificationColumn parameter in reduceBigDIANN to
"Fragment.Quant.Corrected" (or otherwise normalize dotted vs. non-dotted names
inside reduceBigDIANN to accept both), referencing the reduceBigDIANN function
and the bigDIANNtoMSstatsFormat expectation when making the change.
- Around line 35-58: The code unconditionally uses columns "Fragment.Info" and
the variable quantificationColumn after only using intersect() earlier, so add
guards to skip downstream processing when those columns are missing: check if
quantificationColumn exists in colnames(input) before calling
tidyr::separate_rows on it or converting input[[quantificationColumn]] <-
as.numeric(...), and check for "Fragment.Info" in colnames(input) before
separating rows or any parsing of fragment data; use the existing split_cols or
explicit logical checks (e.g., if (quantificationColumn %in% colnames(input)) {
... } and if ("Fragment.Info" %in% colnames(input)) { ... }) around the
tidyr::separate_rows, fragment processing, and numeric conversion to avoid
crashes when columns are absent.
- Around line 64-70: ProductCharge currently becomes NA when Fragment.Info
contains "/" but no digits follow because as.integer(stringr::str_extract(...))
yields NA; modify the ProductCharge computation (the dplyr::if_else branch that
uses stringr::str_extract on .data$Fragment.Info) to wrap the extracted value
with dplyr::coalesce (or equivalent) so that if str_extract returns NA it falls
back to "1" before coercion, ensuring the final value is 1L when parsing fails.
In `@R/converters.R`:
- Around line 184-193: The intermediate path construction using
paste0("reduce_output_", output_file_name) is unsafe when output_file_name
contains directories; update calls that create the reduced intermediate filename
(in reduceBigDIANN -> MSstatsPreprocessBig sequence and the analogous locations
in bigSpectronauttoMSstatsFormat) to build the path with
file.path(dirname(output_file_name), paste0("reduce_output_",
basename(output_file_name))) so the reduce_output_* file is created alongside
the intended output file regardless of directory components.
In `@tests/testthat/test-diann_converter.R`:
- Around line 78-108: The test uses a static output filename
("test_diann_output.csv") which can cause collisions; change the output_file
assignment to use tempfile(fileext = ".csv") and update all subsequent
references (the call to bigDIANNtoMSstatsFormat, the dplyr::collect on
converted, and the cleanup file.remove calls that use output_file and
paste0("reduce_output_", output_file)) to use that tempfile variable so the test
writes/cleans unique temp files; relevant symbols: output_file,
bigDIANNtoMSstatsFormat, converted, and the cleanup file.remove/paste0 calls.
♻️ Duplicate comments (2)
R/clean_DIANN.R (2)
7-13: Stop on unsupported extensions; avoid treating .xlsx as TSV.
grepl("tsv|xls")matches.xlsx, which is binary. The";"fallback can silently mis-parse files. Prefer anchored, case-insensitive extension checks and error for unsupported formats.🐛 Proposed fix
- if (grepl("csv", input_file)) { + if (grepl("\\.csv$", input_file, ignore.case = TRUE)) { delim = "," - } else if (grepl("tsv|xls", input_file)) { + } else if (grepl("\\.(tsv|txt)$", input_file, ignore.case = TRUE)) { delim = "\t" } else { - delim <- ";" + stop("Unsupported file extension. Please provide a .csv or .tsv/.txt file.") }
60-63: Remove the ineffective FragmentIon regex.
sub('\\^\\.\\*', ...)matches the literal "^.*" and is effectively a no-op. If no transformation is intended, assign the value directly.🔧 Suggested change
- FragmentIon = sub('\\^\\.\\*', '', .data$Fragment.Info), + FragmentIon = .data$Fragment.Info,
| # Extract product charge | ||
| ProductCharge = dplyr::if_else( | ||
| grepl("/", .data$Fragment.Info), | ||
| # Extract charge (number right after "/" in string), default to 1 if parsing fails | ||
| as.integer(stringr::str_extract(.data$Fragment.Info, "(?<=/)[0-9]+")), | ||
| 1L | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "clean_DIANN.R" -type fRepository: Vitek-Lab/MSstatsBig
Length of output: 81
🏁 Script executed:
sed -n '64,70p' ./R/clean_DIANN.RRepository: Vitek-Lab/MSstatsBig
Length of output: 348
🏁 Script executed:
python3 << 'EOF'
# Let's verify the behavior of stringr::str_extract and as.integer in R
# by examining what happens when the regex pattern doesn't match
# The behavior should be:
# stringr::str_extract("something/", "(?<=/)[0-9]+") -> NA (no digits after /)
# as.integer(NA) -> NA (not 1L)
# This is a R behavior question, let me create a test to understand it better
print("Testing R string extraction and type conversion behavior:")
print("Case 1: Fragment.Info with / but no digits after it")
print(" stringr::str_extract returns: NA (pattern doesn't match)")
print(" as.integer(NA) returns: NA (not 1L)")
print("")
print("Case 2: Fragment.Info with / and digits after it")
print(" stringr::str_extract returns: the digits")
print(" as.integer(digits) returns: integer value")
print("")
print("Case 3: Fragment.Info without /")
print(" Current code returns: 1L (via the else branch of if_else)")
print("")
print("So the bug is in Case 1 - when / is present but no digits follow")
print("The function would return NA instead of 1L")
EOFRepository: Vitek-Lab/MSstatsBig
Length of output: 150
🏁 Script executed:
grep -r "ProductCharge" ./R --include="*.R" -B 2 -A 2Repository: Vitek-Lab/MSstatsBig
Length of output: 4903
🏁 Script executed:
sed -n '50,80p' ./R/clean_DIANN.RRepository: Vitek-Lab/MSstatsBig
Length of output: 1088
Ensure ProductCharge defaults to 1 when parsing fails.
The current code returns NA when a / is present in Fragment.Info but no digits follow (e.g., "y1/"). The regex pattern "(?<=/)[0-9]+" fails to match, causing stringr::str_extract() to return NA, and subsequently as.integer(NA) returns NA instead of the intended default of 1L. Use dplyr::coalesce() to handle this edge case:
🔧 Suggested change
- ProductCharge = dplyr::if_else(
- grepl("/", .data$Fragment.Info),
- # Extract charge (number right after "/" in string), default to 1 if parsing fails
- as.integer(stringr::str_extract(.data$Fragment.Info, "(?<=/)[0-9]+")),
- 1L
- )
+ ProductCharge = dplyr::if_else(
+ grepl("/", .data$Fragment.Info),
+ dplyr::coalesce(
+ as.integer(stringr::str_extract(.data$Fragment.Info, "(?<=/)[0-9]+")),
+ 1L
+ ),
+ 1L
+ )🤖 Prompt for AI Agents
In `@R/clean_DIANN.R` around lines 64 - 70, ProductCharge currently becomes NA
when Fragment.Info contains "/" but no digits follow because
as.integer(stringr::str_extract(...)) yields NA; modify the ProductCharge
computation (the dplyr::if_else branch that uses stringr::str_extract on
.data$Fragment.Info) to wrap the extracted value with dplyr::coalesce (or
equivalent) so that if str_extract returns NA it falls back to "1" before
coercion, ensuring the final value is 1L when parsing fails.
| test_that("bigDIANNtoMSstatsFormat works with arrow backend", { | ||
| input_file <- tempfile(fileext = ".csv") | ||
| output_file <- "test_diann_output.csv" | ||
|
|
||
| # 4 features for one protein. Feature selection should pick the top 2. | ||
| diann_data <- rbind( | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001) | ||
| ) | ||
| write.csv(diann_data, input_file, row.names = FALSE) | ||
|
|
||
| converted <- bigDIANNtoMSstatsFormat( | ||
| input_file = input_file, | ||
| output_file_name = output_file, | ||
| backend = "arrow", | ||
| max_feature_count = 2 | ||
| ) | ||
| result <- dplyr::collect(converted) | ||
|
|
||
| # Avg intensities: y1=1050, y2=550, y3=100, y4=2050. | ||
| # Top 2 features are y4 and y1. | ||
| expect_equal(nrow(result), 4) # 2 features * 2 runs | ||
| expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon))) | ||
| expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon))) | ||
|
|
||
| # Cleanup | ||
| file.remove(input_file) | ||
| if (file.exists(output_file)) file.remove(output_file) | ||
| if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the actual test file to confirm the current state
cat -n tests/testthat/test-diann_converter.R | sed -n '70,115p'Repository: Vitek-Lab/MSstatsBig
Length of output: 2688
🏁 Script executed:
# Check if there's any test configuration for parallel execution
git ls-files | grep -E "(testthat|\.yml|\.yaml|\.json)" | head -20Repository: Vitek-Lab/MSstatsBig
Length of output: 257
🏁 Script executed:
# Look for similar test output file patterns in other test files
fd '\.R$' tests/testthat/ | xargs grep -l "tempfile\|file.remove" | head -10Repository: Vitek-Lab/MSstatsBig
Length of output: 134
🏁 Script executed:
# Check for test configuration files (especially for parallel execution)
fd -e yml -e yaml -e dcf | head -20Repository: Vitek-Lab/MSstatsBig
Length of output: 46
🏁 Script executed:
# Look at other tests to see if they use static filenames too
grep -n "output_file.*<-" tests/testthat/*.RRepository: Vitek-Lab/MSstatsBig
Length of output: 465
🏁 Script executed:
# Check the git history to see if these leftover files should have been cleaned
git ls-files tests/testthat/*.csvRepository: Vitek-Lab/MSstatsBig
Length of output: 186
🏁 Script executed:
# Look for how other tests in the project handle temporary output files
grep -B 2 -A 2 "tempfile.*output\|output.*tempfile" tests/testthat/*.R | head -40Repository: Vitek-Lab/MSstatsBig
Length of output: 803
Use tempfile() instead of static filename to prevent test collisions.
The other tests in this file (lines 7 and 45) correctly use tempfile(fileext = ".csv") for output files. This test should follow the same pattern to avoid potential collisions in parallel test runs or from previous test failures. Evidence of incomplete cleanup is visible in the repository with leftover CSV files (e.g., topN_test_diann_output.csv).
🔧 Suggested change
- output_file <- "test_diann_output.csv"
+ output_file <- basename(tempfile(fileext = ".csv"))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| test_that("bigDIANNtoMSstatsFormat works with arrow backend", { | |
| input_file <- tempfile(fileext = ".csv") | |
| output_file <- "test_diann_output.csv" | |
| # 4 features for one protein. Feature selection should pick the top 2. | |
| diann_data <- rbind( | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001) | |
| ) | |
| write.csv(diann_data, input_file, row.names = FALSE) | |
| converted <- bigDIANNtoMSstatsFormat( | |
| input_file = input_file, | |
| output_file_name = output_file, | |
| backend = "arrow", | |
| max_feature_count = 2 | |
| ) | |
| result <- dplyr::collect(converted) | |
| # Avg intensities: y1=1050, y2=550, y3=100, y4=2050. | |
| # Top 2 features are y4 and y1. | |
| expect_equal(nrow(result), 4) # 2 features * 2 runs | |
| expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon))) | |
| expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon))) | |
| # Cleanup | |
| file.remove(input_file) | |
| if (file.exists(output_file)) file.remove(output_file) | |
| if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file)) | |
| test_that("bigDIANNtoMSstatsFormat works with arrow backend", { | |
| input_file <- tempfile(fileext = ".csv") | |
| output_file <- basename(tempfile(fileext = ".csv")) | |
| # 4 features for one protein. Feature selection should pick the top 2. | |
| diann_data <- rbind( | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | |
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001) | |
| ) | |
| write.csv(diann_data, input_file, row.names = FALSE) | |
| converted <- bigDIANNtoMSstatsFormat( | |
| input_file = input_file, | |
| output_file_name = output_file, | |
| backend = "arrow", | |
| max_feature_count = 2 | |
| ) | |
| result <- dplyr::collect(converted) | |
| # Avg intensities: y1=1050, y2=550, y3=100, y4=2050. | |
| # Top 2 features are y4 and y1. | |
| expect_equal(nrow(result), 4) # 2 features * 2 runs | |
| expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon))) | |
| expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon))) | |
| # Cleanup | |
| file.remove(input_file) | |
| if (file.exists(output_file)) file.remove(output_file) | |
| if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file)) | |
| }) |
🤖 Prompt for AI Agents
In `@tests/testthat/test-diann_converter.R` around lines 78 - 108, The test uses a
static output filename ("test_diann_output.csv") which can cause collisions;
change the output_file assignment to use tempfile(fileext = ".csv") and update
all subsequent references (the call to bigDIANNtoMSstatsFormat, the
dplyr::collect on converted, and the cleanup file.remove calls that use
output_file and paste0("reduce_output_", output_file)) to use that tempfile
variable so the test writes/cleans unique temp files; relevant symbols:
output_file, bigDIANNtoMSstatsFormat, converted, and the cleanup
file.remove/paste0 calls.
There was a problem hiding this comment.
Think this concern is valid
There was a problem hiding this comment.
@tonywu1999 Thank you for confirming! Using tempfile() will ensure the tests run cleanly without potential file collisions, especially in parallel test scenarios or CI environments.
…t.ion is NA, ProductCharge is set to 1
…mn is not present
| if (!"Precursor.Mz" %in% colnames(input)) { | ||
| input <- dplyr::mutate(input, Precursor.Mz = NA) | ||
| } | ||
| if (!"Fragment.Info" %in% colnames(input)) { |
There was a problem hiding this comment.
https://github.com/Vitek-Lab/MSstatsConvert/blob/devel/R/clean_DIANN.R#L115-L118 : Instead, it should be like Frag1, Frag2, etc.
| test_that("bigDIANNtoMSstatsFormat works with arrow backend", { | ||
| input_file <- tempfile(fileext = ".csv") | ||
| output_file <- "test_diann_output.csv" | ||
|
|
||
| # 4 features for one protein. Feature selection should pick the top 2. | ||
| diann_data <- rbind( | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(1000, 1100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y1", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(500, 600), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y2", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(100, 100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y3", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001), | ||
| data.frame(Run = c("r1", "r2"), Protein.Names = "P1", Stripped.Sequence = "PEPTIDE", Modified.Sequence = "PEPTIDE", Precursor.Charge = 2, Fragment.Quant.Corrected = c(2000, 2100), Q.Value = 0.001, Precursor.Mz = 500, Fragment.Info = "y4", Lib.Q.Value = 0.001, Lib.PG.Q.Value = 0.001) | ||
| ) | ||
| write.csv(diann_data, input_file, row.names = FALSE) | ||
|
|
||
| converted <- bigDIANNtoMSstatsFormat( | ||
| input_file = input_file, | ||
| output_file_name = output_file, | ||
| backend = "arrow", | ||
| max_feature_count = 2 | ||
| ) | ||
| result <- dplyr::collect(converted) | ||
|
|
||
| # Avg intensities: y1=1050, y2=550, y3=100, y4=2050. | ||
| # Top 2 features are y4 and y1. | ||
| expect_equal(nrow(result), 4) # 2 features * 2 runs | ||
| expect_true(all(c("y1", "y4") %in% unique(result$FragmentIon))) | ||
| expect_false(any(c("y2", "y3") %in% unique(result$FragmentIon))) | ||
|
|
||
| # Cleanup | ||
| file.remove(input_file) | ||
| if (file.exists(output_file)) file.remove(output_file) | ||
| if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file)) |
There was a problem hiding this comment.
Think this concern is valid
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@R/clean_DIANN.R`:
- Around line 241-249: The .filterDIANNByQValues function fails if expected
q-value columns are missing; update it to defensive-check for required columns
before filtering: inside .filterDIANNByQValues, detect whether MBR is TRUE and
then verify presence of "LibQValue" and "LibPGQValue", or if MBR is FALSE verify
"GlobalQValue" and "GlobalPGQValue"; if missing, either create those columns
filled with NA (so downstream filtering behaves predictably) or return a clear
stop()/warning indicating the missing column names and skip the MBR-specific
filter, then apply the DetectionQValue filter as currently done. Use the
function name .filterDIANNByQValues and the parameter MBR and the column names
LibQValue, LibPGQValue, GlobalQValue, GlobalPGQValue in your checks to locate
where to add this logic.
| .filterDIANNByQValues <- function(input, MBR, global_qvalue_cutoff, qvalue_cutoff, pg_qvalue_cutoff) { | ||
| input <- dplyr::filter(input, DetectionQValue < global_qvalue_cutoff) | ||
|
|
||
| if (MBR) { | ||
| dplyr::filter(input, LibPGQValue < pg_qvalue_cutoff & LibQValue < qvalue_cutoff) | ||
| } else { | ||
| dplyr::filter(input, GlobalPGQValue < pg_qvalue_cutoff & GlobalQValue < qvalue_cutoff) | ||
| } | ||
| } |
There was a problem hiding this comment.
Missing Q-value columns will cause runtime errors.
When MBR=TRUE, the function expects LibQValue and LibPGQValue columns (lines 245). When MBR=FALSE, it expects GlobalQValue and GlobalPGQValue (line 247). If these columns are absent from the input (e.g., older DIANN versions or different output settings), the function will fail with an unclear error.
🛡️ Suggested defensive check
.filterDIANNByQValues <- function(input, MBR, global_qvalue_cutoff, qvalue_cutoff, pg_qvalue_cutoff) {
input <- dplyr::filter(input, DetectionQValue < global_qvalue_cutoff)
if (MBR) {
+ if (!all(c("LibPGQValue", "LibQValue") %in% colnames(input))) {
+ warning("MBR Q-value columns not found, skipping MBR filtering")
+ return(input)
+ }
dplyr::filter(input, LibPGQValue < pg_qvalue_cutoff & LibQValue < qvalue_cutoff)
} else {
+ if (!all(c("GlobalPGQValue", "GlobalQValue") %in% colnames(input))) {
+ warning("Global Q-value columns not found, skipping global filtering")
+ return(input)
+ }
dplyr::filter(input, GlobalPGQValue < pg_qvalue_cutoff & GlobalQValue < qvalue_cutoff)
}
}🤖 Prompt for AI Agents
In `@R/clean_DIANN.R` around lines 241 - 249, The .filterDIANNByQValues function
fails if expected q-value columns are missing; update it to defensive-check for
required columns before filtering: inside .filterDIANNByQValues, detect whether
MBR is TRUE and then verify presence of "LibQValue" and "LibPGQValue", or if MBR
is FALSE verify "GlobalQValue" and "GlobalPGQValue"; if missing, either create
those columns filled with NA (so downstream filtering behaves predictably) or
return a clear stop()/warning indicating the missing column names and skip the
MBR-specific filter, then apply the DetectionQValue filter as currently done.
Use the function name .filterDIANNByQValues and the parameter MBR and the column
names LibQValue, LibPGQValue, GlobalQValue, GlobalPGQValue in your checks to
locate where to add this logic.
WIP: Added Clean_DIANN, Diann Converter and some basic test (more comprehensive test is still needed)
Summary by CodeRabbit
New Features
Tests
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.