Diann turnover#128
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 13 minutes and 12 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughThis PR adds SILAC-based isotope labeling detection across DIANN and Spectronaut converters. It introduces a new Changes
Sequence DiagramsequenceDiagram
participant DIANNConverter as DIANNtoMSstatsFormat()
participant CleanDIANN as .cleanRawDIANN()
participant AssignLabel as .assignDIANNIsotopeLabelType()
participant Classify as .classifyIsotopeLabelType()
DIANNConverter->>CleanDIANN: msstats_object + labeledAminoAcids
CleanDIANN->>CleanDIANN: Detect has_channel (Channel exists + labeledAminoAcids non-null)
CleanDIANN->>AssignLabel: dn_input + labeledAminoAcids + has_channel
alt has_channel = TRUE
AssignLabel->>AssignLabel: Build regex from Channel values (H/L/NA)
AssignLabel->>Classify: PeptideSequence + channel_regex
else has_channel = FALSE & labeledAminoAcids non-NULL
AssignLabel->>AssignLabel: Build heavy/light regex from labeledAminoAcids
AssignLabel->>Classify: PeptideSequence + heavy_regex + light_regex
else labeledAminoAcids = NULL
AssignLabel->>AssignLabel: Return unchanged
end
Classify->>Classify: Match regex patterns against PeptideSequence
Classify->>Classify: Assign IsotopeLabelType (H/L/NA)
Classify-->>AssignLabel: Updated data with IsotopeLabelType
AssignLabel->>AssignLabel: Remove Channel column (if applicable)
AssignLabel-->>CleanDIANN: Cleaned data with IsotopeLabelType
CleanDIANN-->>DIANNConverter: Processed DIANN input
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
R/clean_DIANN.R (1)
262-265: Defensively regex-escapelabeledAminoAcids.
paste0(labeledAminoAcids, collapse = "|")is embedded directly into three regexes. Today these are single-letter amino-acid codes so it's safe, but a user passing something like"K "or a typo with regex metacharacters (e.g..,() would silently produce a malformed pattern. A quick defensive step: validate inputs are[A-Z]single characters, or run them through a regex escaper.🛡️ Proposed validation
if (has_channel) { ... } else { + if (!all(grepl("^[A-Z]$", labeledAminoAcids))) { + stop("labeledAminoAcids must be a character vector of single uppercase ", + "amino-acid letters (A-Z), e.g. c(\"K\") or c(\"K\", \"R\").") + } aa_pattern <- paste0(labeledAminoAcids, collapse = "|")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/clean_DIANN.R` around lines 262 - 265, Validate or escape labeledAminoAcids before building aa_pattern and the derived regexes (aa_pattern, heavy_regex, light_regex, strip_regex): either enforce that each element of labeledAminoAcids is a single uppercase letter matching /^[A-Z]$/ (throw an informative error otherwise) or run each element through a regex-escaping routine and then paste0(..., collapse="|"); then rebuild aa_pattern and the three regexes using the safe values so malformed patterns cannot be produced.R/converters_DIANNtoMSstatsFormat.R (1)
122-125: Minor: deduplicate repeated column check.
"IsotopeLabelType" %in% colnames(input)is evaluated twice. Cache it once for readability and to guarantee both branches stay in sync if the check is ever adjusted.♻️ Proposed refactor
- preprocess_feature_columns = if ("IsotopeLabelType" %in% colnames(input)) - c(feature_columns, "IsotopeLabelType") else feature_columns - fill_isotope_label_type = if ("IsotopeLabelType" %in% colnames(input)) - list() else list("IsotopeLabelType" = "Light") + has_isotope_label = "IsotopeLabelType" %in% colnames(input) + preprocess_feature_columns = if (has_isotope_label) + c(feature_columns, "IsotopeLabelType") else feature_columns + fill_isotope_label_type = if (has_isotope_label) + list() else list("IsotopeLabelType" = "Light")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/converters_DIANNtoMSstatsFormat.R` around lines 122 - 125, Compute a single boolean flag (e.g., has_isotope_label_type) by evaluating "IsotopeLabelType" %in% colnames(input) once, then use that flag in both preprocess_feature_columns and fill_isotope_label_type so the check isn't duplicated and both branches remain consistent; update references to use the new flag in the assignments to preprocess_feature_columns and fill_isotope_label_type.R/utils_clean_features.R (1)
339-366: LGTM — shared classifier is a clean abstraction.Three-mode dispatch is easy to read and the Spectronaut-mode cleanup of the temporary
StrippedSequencecolumn is properly handled. Nit: the.classifyIsotopeLabelTypecontract overwrites any pre-existingIsotopeLabelTypecolumn ondt; worth noting in the docstring if that ever matters to callers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/utils_clean_features.R` around lines 339 - 366, The .classifyIsotopeLabelType function unconditionally assigns/overwrites the IsotopeLabelType column on the provided data.table dt; update the function's docstring (for .classifyIsotopeLabelType) to explicitly state that it will create or overwrite an existing IsotopeLabelType column on dt and that callers should copy dt first if they need to preserve the original column; reference the dt parameter and the IsotopeLabelType column in the docstring so callers are aware of the side effect.inst/tinytest/test_clean_DIANN.R (1)
58-119: Good coverage of the new helpers.The three modes of
.classifyIsotopeLabelType(H+L, H-only, and multi-AA) plus the Channel / NULL paths of.assignDIANNIsotopeLabelTypeare well exercised. Consider also adding a test for the non-Channel SILAC-suffix path of.assignDIANNIsotopeLabelType(i.e.has_channel = FALSE,labeledAminoAcids = c("K"), peptide sequences containing(SILAC-K-H)/(SILAC-K-L)) since that branch is currently only covered end-to-end.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inst/tinytest/test_clean_DIANN.R` around lines 58 - 119, Add a unit test exercising the non-Channel code path of .assignDIANNIsotopeLabelType: create a data.table with PeptideSequence values containing SILAC suffixes like "PEPTIDEK(SILAC-K-H)", "PEPTIDEK(SILAC-K-L)", and a non-labeled peptide; call MSstatsConvert:::.assignDIANNIsotopeLabelType(dt, labeledAminoAcids = c("K"), has_channel = FALSE) and assert that IsotopeLabelType is c("H","L",NA_character_) (and that "Channel" is not present in colnames). This targets the branch where labeling is determined by suffixes rather than a Channel column.inst/tinytest/test_converters_DIANNtoMSstatsFormat.R (1)
1-179: Solid end-to-end coverage; a couple of brittleness concerns.The exact-count assertions (e.g.
nrow == 1020,H == 350,L == 350,NA == 320,nrow == 20400, etc.) pin the whole preprocessing chain including q-value filtering, feature filtering, and balanced-design fill. If any upstream default changes (q-value cutoffs,removeFewMeasurementssemantics, balanced-design behavior) these tests will fail en masse without actually indicating a SILAC regression. Consider complementing with looser invariant checks (e.g.expect_true(label_counts["H"] > 0 && label_counts["L"] > 0), ratios, or per-peptide spot checks) alongside the exact counts, so failures are easier to triage.Also,
expect_true("0 day" %in% output_pxd$Condition)etc. depend on the annotation fixture — fine, but a brief comment noting which rows in the annotation CSV produce these conditions would help future maintenance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@inst/tinytest/test_converters_DIANNtoMSstatsFormat.R` around lines 1 - 179, The tests use brittle exact-count assertions (e.g. expect_equal(nrow(output_silac), 1020), exact label_counts checks for "H"/"L"/NA, and expect_equal(nrow(output_pxd), 20400)) and a bare expect_true for conditions (expect_true("0 day" %in% output_pxd$Condition)); change these to more robust assertions: replace exact-counts with existence/positivity or ratio checks (e.g. expect_true(label_counts["H"] > 0 && label_counts["L"] > 0), expect_gt(nrow(output_silac), 0)), add a small set of deterministic per-peptide spot checks (keep the current heavy/light fragment checks like heavy_y8/light_y5) to validate correctness, and keep but document the condition checks by adding a short comment linking the annotation CSV rows that produce "0 day"/"8 days"/"32 days" before the expect_true lines so future maintainers know which fixture rows drive those expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/clean_DIANN.R`:
- Around line 247-272: The function .assignDIANNIsotopeLabelType currently can
produce an all-NA IsotopeLabelType silently; add a guard after the
branch-specific assignment to detect if all(IsotopeLabelType %in%
c(NA_character_)) (or all(is.na(IsotopeLabelType))) and emit a descriptive
warning (or stop) that includes the provided labeledAminoAcids and whether
has_channel was used; ensure you place this check after the Channel branch
assignment block and after the call to .classifyIsotopeLabelType (and the
PeptideSequence gsub) so it covers both paths, and reference
IsotopeLabelType/Channel/labeledAminoAcids in the message to help users debug
mis-specified labels.
In `@R/converters_DIANNtoMSstatsFormat.R`:
- Around line 22-32: The docstring claims labeledAminoAcids is only used for
documentation/validation when a Channel column exists, but the code uses
labeledAminoAcids as an opt-in flag (has_channel <- !is.null(labeledAminoAcids)
&& "Channel" %in% colnames(dn_input)) and performs no validation; either
implement validation or update docs—pick one: (A) Implement validation: in the
Channel branch (where has_channel is true, e.g., in R/clean_DIANN.R and
converters_DIANNtoMSstatsFormat.R) iterate rows per Channel/H/L groups and
verify that ModifiedSequence for rows labeled H contain at least one of the
single-letter codes in labeledAminoAcids and rows labeled L do not contain those
labels (or match expected pattern), raising a warning/error or annotating
mismatches; or (B) Update the roxygen comment in
converters_DIANNtoMSstatsFormat.R to state that labeledAminoAcids only enables
the Channel branch and is not used to validate ModifiedSequence when Channel is
present, so users are not misled. Ensure references to labeledAminoAcids,
Channel, ModifiedSequence, IsotopeLabelType, and has_channel are consistent.
---
Nitpick comments:
In `@inst/tinytest/test_clean_DIANN.R`:
- Around line 58-119: Add a unit test exercising the non-Channel code path of
.assignDIANNIsotopeLabelType: create a data.table with PeptideSequence values
containing SILAC suffixes like "PEPTIDEK(SILAC-K-H)", "PEPTIDEK(SILAC-K-L)", and
a non-labeled peptide; call MSstatsConvert:::.assignDIANNIsotopeLabelType(dt,
labeledAminoAcids = c("K"), has_channel = FALSE) and assert that
IsotopeLabelType is c("H","L",NA_character_) (and that "Channel" is not present
in colnames). This targets the branch where labeling is determined by suffixes
rather than a Channel column.
In `@inst/tinytest/test_converters_DIANNtoMSstatsFormat.R`:
- Around line 1-179: The tests use brittle exact-count assertions (e.g.
expect_equal(nrow(output_silac), 1020), exact label_counts checks for
"H"/"L"/NA, and expect_equal(nrow(output_pxd), 20400)) and a bare expect_true
for conditions (expect_true("0 day" %in% output_pxd$Condition)); change these to
more robust assertions: replace exact-counts with existence/positivity or ratio
checks (e.g. expect_true(label_counts["H"] > 0 && label_counts["L"] > 0),
expect_gt(nrow(output_silac), 0)), add a small set of deterministic per-peptide
spot checks (keep the current heavy/light fragment checks like
heavy_y8/light_y5) to validate correctness, and keep but document the condition
checks by adding a short comment linking the annotation CSV rows that produce "0
day"/"8 days"/"32 days" before the expect_true lines so future maintainers know
which fixture rows drive those expectations.
In `@R/clean_DIANN.R`:
- Around line 262-265: Validate or escape labeledAminoAcids before building
aa_pattern and the derived regexes (aa_pattern, heavy_regex, light_regex,
strip_regex): either enforce that each element of labeledAminoAcids is a single
uppercase letter matching /^[A-Z]$/ (throw an informative error otherwise) or
run each element through a regex-escaping routine and then paste0(...,
collapse="|"); then rebuild aa_pattern and the three regexes using the safe
values so malformed patterns cannot be produced.
In `@R/converters_DIANNtoMSstatsFormat.R`:
- Around line 122-125: Compute a single boolean flag (e.g.,
has_isotope_label_type) by evaluating "IsotopeLabelType" %in% colnames(input)
once, then use that flag in both preprocess_feature_columns and
fill_isotope_label_type so the check isn't duplicated and both branches remain
consistent; update references to use the new flag in the assignments to
preprocess_feature_columns and fill_isotope_label_type.
In `@R/utils_clean_features.R`:
- Around line 339-366: The .classifyIsotopeLabelType function unconditionally
assigns/overwrites the IsotopeLabelType column on the provided data.table dt;
update the function's docstring (for .classifyIsotopeLabelType) to explicitly
state that it will create or overwrite an existing IsotopeLabelType column on dt
and that callers should copy dt first if they need to preserve the original
column; reference the dt parameter and the IsotopeLabelType column in the
docstring so callers are aware of the side effect.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7adaa02d-202a-474a-bf18-2c1651c98113
⛔ Files ignored due to path filters (4)
inst/tinytest/raw_data/DIANN/PXD055942.parquetis excluded by!**/*.parquetinst/tinytest/raw_data/DIANN/annotation_PXD055942.csvis excluded by!**/*.csvinst/tinytest/raw_data/DIANN/annotation_silac.csvis excluded by!**/*.csvinst/tinytest/raw_data/DIANN/diann_input_silac.csvis excluded by!**/*.csv
📒 Files selected for processing (6)
R/clean_DIANN.RR/clean_Spectronaut.RR/converters_DIANNtoMSstatsFormat.RR/utils_clean_features.Rinst/tinytest/test_clean_DIANN.Rinst/tinytest/test_converters_DIANNtoMSstatsFormat.R
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Motivation and Context
This PR implements protein turnover tracking support for the DIANN converter by adding labeled amino acid (SILAC) isotope classification functionality. The PR introduces infrastructure to detect and classify peptides as "Heavy" or "Light" based on either explicit channel information (in DIANN 2.0+) or by identifying SILAC-tagged amino acids in peptide sequences. This enables MSstatsConvert to process time-course proteomics experiments where protein turnover is measured through stable isotope labeling.
Changes
Core Functions Modified:
.cleanRawDIANN()- Extended with new parameterlabeledAminoAcids = NULLto enable isotope label classification during raw data cleaninghas_channelflag based on whether theChannelcolumn exists in input andlabeledAminoAcidsis non-NULLhas_channelto column selection logic.assignDIANNIsotopeLabelType()after column renaming.cleanDIANNSelectRequiredColumns()- Updated to accept newhas_channelparameter (defaultFALSE) and conditionally include theChannelcolumn in required columnsDIANNtoMSstatsFormat()- Added new exported parameterlabeledAminoAcids = NULLand passes it through toMSstatsClean()IsotopeLabelTypeonly when it already exists in imported inputcolumns_to_fillsoIsotopeLabelTypeis force-filled with"Light"only when input lacks the columnNew Internal Helper Functions:
.assignDIANNIsotopeLabelType(dn_input, labeledAminoAcids, has_channel)- Implements isotope label detection logic for DIANN datalabeledAminoAcidsis NULLhas_channel = TRUE: derivesIsotopeLabelTypefromChannelvalues ("H"→"H","L"→"L", otherwiseNA), then removes theChannelcolumnhas_channel = FALSE: derivesIsotopeLabelTypeby classifying SILAC-tagged entries based on regexes built fromlabeledAminoAcids, then strips the SILAC suffix fromPeptideSequence.classifyIsotopeLabelType(dt, heavy_regex, light_regex = NULL, labeled_aa_regex = NULL)- Unified regex-based isotope classification helper supporting three modes:labeled_aa_regexnon-NULL): Creates temporaryStrippedSequenceby removing bracketed substrings, assigns"H"whenheavy_regexmatches,"L"whenlabeled_aa_regexmatchesStrippedSequence, elseNAlight_regexnon-NULL,labeled_aa_regexNULL): Assigns"H"whenheavy_regexmatches,"L"whenlight_regexmatches, elseNA"H"whenheavy_regexmatches, elseNAOther Changes:
.assignSpectronautIsotopeLabelType()- Refactored to delegate isotope classification to.classifyIsotopeLabelType(), removing local temporary assignments and improving code modularityUnit Tests
New test coverage added in
inst/tinytest/test_clean_DIANN.R:Tests for
.classifyIsotopeLabelType():"H"/"L"and non-matching sequences toNAKandR) for correct heavy vs light classificationlight_regexis not provided, ensuring sequences matching heavy are"H"and corresponding light sequences returnNATests for
.assignDIANNIsotopeLabelType():has_channel = TRUE, including correct"H"/"L"/NAmapping andChannelcolumn removallabeledAminoAcids = NULL, noIsotopeLabelTypecolumn is added and row count remains unchangedNew end-to-end test coverage in
inst/tinytest/test_converters_DIANNtoMSstatsFormat.R:SILAC CSV input test using
labeledAminoAcids = c("K"):IsotopeLabelTypeclassification counts (H/L/NA with no"Light")NAlabeling for unlabeled peptidesFragmentIonNADIANN 2.0 Channel-based SILAC parquet test (PXD055942):
quantificationColumn = "auto"behavior via Channel-driven H/L/NA classification(SILAC)text inPeptideSequenceNAlabeling for unlabeled peptides and presence of specific timepointConditionvaluesExtended existing DIANN tests to assert
IsotopeLabelTypeis filled with"Light"whenlabeledAminoAcidsis default/NULLCoding Guidelines
No significant coding guideline violations detected. The changes follow the existing codebase patterns:
..assignSpectronautIsotopeLabelType()improves modularity without breaking changes