Feature anomaly#6
Conversation
|
Caution Review failedThe pull request is closed. WalkthroughAdded optional anomaly-scoring parameters and anomaly feature propagation across Spectronaut and MSstats preprocessing paths, increased default max_feature_count from 20 to 100 in several converters, switched some filtering to non-destructive masking, and adjusted PSM aggregation/grouping logic. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Converter as bigSpectronauttoMSstatsFormat
participant Cleaner as reduceBigSpectronaut / cleanSpectronautChunk
participant Preproc as MSstatsPreprocessBigArrow
note over Converter,Cleaner #DDEEFF: New params: intensity, calculateAnomalyScores, anomalyModelFeatures
User->>Converter: call(..., intensity, calculateAnomalyScores, anomalyModelFeatures)
Converter->>Cleaner: reduceBigSpectronaut(..., intensity, calculateAnomalyScores, anomalyModelFeatures)
Cleaner->>Cleaner: read cols (include intensity + anomalyModelFeatures if requested)
Cleaner->>Preproc: write cleaned chunk file
Preproc->>Preproc: MSstatsPreprocessBigArrow(..., calculateAnomalyScores, anomalyModelFeatures)
Preproc-->>User: msstats_data (with anomaly feature columns carried through when enabled)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
Pre-merge checks (2 passed, 1 inconclusive)❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. 📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (5)
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (3)
R/backends.R (2)
131-143: Consider usingdplyr::namespace prefix consistentlyThe code inconsistently uses namespace prefixes. Lines 135-143 use unprefixed functions like
group_by,summarise,inner_join, while other parts of the file usedplyr::prefixes.- max_per_group <- input %>% - group_by(across(all_of(group_cols))) %>% - summarise(max_intensity = max(Intensity, na.rm = TRUE), - .groups = "drop") - - input <- input %>% - inner_join(max_per_group, by = group_cols) %>% - filter(Intensity == max_intensity) %>% - select(-max_intensity) + max_per_group <- input %>% + dplyr::group_by(dplyr::across(dplyr::all_of(group_cols))) %>% + dplyr::summarise(max_intensity = max(Intensity, na.rm = TRUE), + .groups = "drop") + + input <- input %>% + dplyr::inner_join(max_per_group, by = group_cols) %>% + dplyr::filter(Intensity == max_intensity) %>% + dplyr::select(-max_intensity)
147-158: Feature creation duplicates code from lines 99-100The
Featurecolumn creation is duplicated. Consider extracting this into a helper function to follow DRY principles.+createFeatureColumn <- function(input) { + dplyr::mutate(input, + Feature = paste(PeptideSequence, PrecursorCharge, + FragmentIon, ProductCharge, sep = "_")) +} + MSstatsPreprocessBigArrow <- function(input_file, output_file_name, max_feature_count = 20, filter_unique_peptides = FALSE, aggregate_psms = FALSE, filter_few_obs = FALSE, calculateAnomalyScores = FALSE, anomalyModelFeatures = c()) { input <- arrow::open_dataset(input_file, format = "csv") - input <- dplyr::mutate(input, - Feature = paste(PeptideSequence, PrecursorCharge, - FragmentIon, ProductCharge, sep = "_")) + input <- createFeatureColumn(input) # ... rest of the code ... if (filter_few_obs) { - input <- dplyr::mutate(input, - Feature = paste(PeptideSequence, PrecursorCharge, - FragmentIon, ProductCharge, sep = "_")) + input <- createFeatureColumn(input)R/clean_spectronaut.R (1)
105-106: Inconsistent assignment operator usageLine 106 uses
:=for assignment withinmutate, which is uncommon in dplyr. The standard approach is to use=.- input <- dplyr::mutate(input, IsotopeLabelType := dplyr::if_else(IsLabeled, "H", "L")) + input <- dplyr::mutate(input, IsotopeLabelType = dplyr::if_else(IsLabeled, "H", "L"))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
R/backends.R(3 hunks)R/clean_spectronaut.R(5 hunks)R/converters.R(3 hunks)
🔇 Additional comments (3)
R/clean_spectronaut.R (2)
49-51: Verify standardized column name uniqueness
In R/clean_spectronaut.R (lines 49–51 and 62–64),anomalyModelFeaturesare appended toall_colsand then passed throughMSstatsConvert:::.standardizeColnames(), which is an internal helper in the MSstatsConvert package. Confirm that this function never produces duplicate or conflicting names—if it can, wrap its output inmake.unique()or add an explicit collision check to prevent downstream errors.
84-86: Ensure downstream Intensity computations handle NA values
Only two aggregations—max(Intensity, na.rm = TRUE)inR/backends.R(lines 104 & 137)—explicitly handleNA; verify all other downstream uses ofIntensity(e.g. plots, statistical summaries) correctly account forNAor includena.rmto avoid unintended behavior.R/converters.R (1)
144-150: Good practice: Explicit return statement addedThe function now explicitly returns
msstats_datainstead of relying on implicit return, which improves code clarity.
| calculateAnomalyScores = FALSE, | ||
| anomalyModelFeatures = c()) { |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
New anomaly parameters are accepted but never used
The function accepts calculateAnomalyScores and anomalyModelFeatures parameters but doesn't utilize them anywhere in the implementation. These parameters should either be used to modify the processing logic or passed through to downstream functions.
Also applies to: 162-162
🏁 Script executed:
#!/bin/bash
# Check if these parameters are supposed to be used elsewhere in the pipeline
rg -n "calculateAnomalyScores|anomalyModelFeatures" R/ --type rLength of output: 1921
Forward anomaly parameters or remove them
The parameters calculateAnomalyScores and anomalyModelFeatures declared in R/backends.R (lines 94–95) are never used or passed to downstream conversion/cleaning functions. Either propagate them into the appropriate function calls (e.g. to clean_spectronaut() or converters) or remove them.
🤖 Prompt for AI Agents
In R/backends.R around lines 94 to 95, the parameters calculateAnomalyScores and
anomalyModelFeatures are declared but never used or forwarded; either propagate
them into downstream calls that perform conversion/cleaning (e.g., pass them as
additional arguments into clean_spectronaut() and any converter functions that
perform anomaly calculations) or remove them from the function signature if
anomaly processing is not supported. If propagating, add the parameters to the
call sites (update function signatures and calls for clean_spectronaut() and
relevant converters to accept calculateAnomalyScores and anomalyModelFeatures,
and forward the values) and ensure any internal functions that need them
implement the logic; if removing, delete both parameters from this function
signature and remove any related references or documentation.
💡 Verification agent
❓ Verification inconclusive
Parameter documentation missing for new anomaly-related parameters
The new parameters calculateAnomalyScores and anomalyModelFeatures have been added to the function signature but lack documentation. Since this function inherits documentation from MSstatsPreprocessBig, ensure that the parent function's documentation is updated to describe these parameters.
🏁 Script executed:
#!/bin/bash
# Check if MSstatsPreprocessBig documentation includes the new parameters
rg -A10 -B5 "^#' @param calculateAnomalyScores" R/
rg -A10 -B5 "^#' @param anomalyModelFeatures" R/Length of output: 0
Add missing documentation for anomaly parameters
Add roxygen @param entries for calculateAnomalyScores and anomalyModelFeatures to the MSstatsPreprocessBig documentation to describe their types, defaults, and purpose.
🤖 Prompt for AI Agents
In R/backends.R around lines 94 to 95, the function signature exposes parameters
calculateAnomalyScores and anomalyModelFeatures but the roxygen documentation
for MSstatsPreprocessBig is missing @param entries for these; add two @param
tags: one for calculateAnomalyScores describing it as logical, default FALSE,
and controlling whether anomaly scores should be computed, and one for
anomalyModelFeatures describing it as a character vector (or list) of feature
names used by the anomaly model with default c(), including a brief note on
expected values and effect when left empty; place these tags in the function's
roxygen block so they appear in generated help.
| @@ -1,9 +1,12 @@ | |||
| #' @keywords internal | |||
| reduceBigSpectronaut <- function(input_file, output_path, | |||
| intensity="F.NormalizedPeakArea", | |||
There was a problem hiding this comment.
🛠️ Refactor suggestion
Consider validating intensity column parameter
The intensity parameter allows specifying any column name, but there's no validation to ensure the column exists or contains numeric data. Consider adding validation early in the function.
reduceBigSpectronaut <- function(input_file, output_path,
intensity="F.NormalizedPeakArea",
filter_by_excluded = FALSE,
filter_by_identified = FALSE,
filter_by_qvalue = TRUE,
qvalue_cutoff = 0.01,
calculateAnomalyScores=FALSE,
anomalyModelFeatures=c()) {
+ # Validate parameters
+ if (!is.character(intensity) || length(intensity) != 1) {
+ stop("intensity must be a single character string")
+ }
+ if (calculateAnomalyScores && length(anomalyModelFeatures) == 0) {
+ warning("calculateAnomalyScores is TRUE but no anomalyModelFeatures provided")
+ }
+
if (grepl("csv", input_file)) {Also applies to: 8-9
🤖 Prompt for AI Agents
In R/clean_spectronaut.R around lines 3 and also lines 8-9, the intensity
parameter is accepted as an arbitrary column name but not validated; add an
early check that the specified intensity column exists in the input data (stop
with a clear message if missing), confirm the column is of numeric type (or
coerce to numeric and fail with a descriptive error if coercion yields all NAs),
and ensure NA handling is defined (e.g., remove or warn about non-numeric
entries). Implement these validations at the start of the function before any
downstream operations so callers get immediate, informative errors when they
pass an invalid column name or non-numeric data.
| input <- dplyr::mutate( | ||
| input, | ||
| Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_)) | ||
| input <- dplyr::mutate( | ||
| input, | ||
| Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_)) |
There was a problem hiding this comment.
Potential logic issue with double filtering by q-values
The q-value filtering applies two separate conditions that both set Intensity to NA. The second mutate call could overwrite values from the first one. This should likely be combined into a single condition.
if (filter_by_qvalue) {
input <- dplyr::mutate(
input,
- Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_))
- input <- dplyr::mutate(
- input,
- Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_))
+ Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff & PGQvalue < qvalue_cutoff,
+ Intensity, NA_real_))
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| input <- dplyr::mutate( | |
| input, | |
| Intensity = dplyr::if_else(EGQvalue < qvalue_cutoff, Intensity, NA_real_)) | |
| input <- dplyr::mutate( | |
| input, | |
| Intensity = dplyr::if_else(PGQvalue < qvalue_cutoff, Intensity, NA_real_)) | |
| if (filter_by_qvalue) { | |
| input <- dplyr::mutate( | |
| input, | |
| Intensity = dplyr::if_else( | |
| EGQvalue < qvalue_cutoff & PGQvalue < qvalue_cutoff, | |
| Intensity, | |
| NA_real_ | |
| ) | |
| ) | |
| } |
🤖 Prompt for AI Agents
In R/clean_spectronaut.R around lines 95 to 100, two sequential mutate calls
separately filter Intensity by EGQvalue and then by PGQvalue which allows the
second call to overwrite the first; replace them with a single mutate that
preserves Intensity only when both q-value checks pass (e.g. combine conditions
with a logical AND) and otherwise set NA_real_, keeping the same dplyr::if_else
pattern and NA_real_ type.
| calculateAnomalyScores = FALSE, | ||
| anomalyModelFeatures = c(), |
There was a problem hiding this comment.
Missing documentation for new parameters
The new parameters calculateAnomalyScores and anomalyModelFeatures lack documentation. Add proper roxygen2 documentation for these parameters.
Add the following documentation before the function:
#' @param calculateAnomalyScores logical; if TRUE, retains peak quality metrics for anomaly detection. Default is FALSE.
#' @param anomalyModelFeatures character vector; column names of features to be used in anomaly detection model. Default is empty vector.🤖 Prompt for AI Agents
In R/converters.R around lines 50 to 51, the new parameters
calculateAnomalyScores and anomalyModelFeatures are missing roxygen2
documentation; add two @param entries immediately before the function
definition: one documenting calculateAnomalyScores as a logical with default
FALSE and behavior to retain peak quality metrics for anomaly detection, and
another documenting anomalyModelFeatures as a character vector of column names
used by the anomaly detection model with default empty vector; ensure wording
follows existing roxygen style and punctuation and include default values in the
description.
| calculateAnomalyScores, | ||
| anomalyModelFeatures) | ||
| } else if (backend == "sparklyr") { | ||
| MSstatsPreprocessBigSparklyr(connection, input, output_file_name, |
There was a problem hiding this comment.
Typo in sparklyr backend call
The sparklyr backend call uses input instead of input_file as the second argument.
} else if (backend == "sparklyr") {
- MSstatsPreprocessBigSparklyr(connection, input, output_file_name,
+ MSstatsPreprocessBigSparklyr(connection, input_file, output_file_name,
max_feature_count, filter_unique_peptides,
aggregate_psms, filter_few_obs,
remove_annotation)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| MSstatsPreprocessBigSparklyr(connection, input, output_file_name, | |
| } else if (backend == "sparklyr") { | |
| MSstatsPreprocessBigSparklyr(connection, input_file, output_file_name, | |
| max_feature_count, filter_unique_peptides, | |
| aggregate_psms, filter_few_obs, | |
| remove_annotation) | |
| } |
🤖 Prompt for AI Agents
In R/converters.R around line 63, the sparklyr backend call incorrectly passes
the variable input as the second argument; change that argument to input_file so
the function call reads MSstatsPreprocessBigSparklyr(connection, input_file,
output_file_name, ...) ensuring the correct input file variable is forwarded to
the sparklyr backend.
| filter_by_qvalue = FALSE, | ||
| qvalue_cutoff = 0.01, | ||
| max_feature_count = 20, | ||
| intensity = "F.NormalizedPeakArea", |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Significant change in default max_feature_count from 20 to 100
The default max_feature_count has been increased from 20 to 100 in bigSpectronauttoMSstatsFormat. This is a significant change that could impact performance and memory usage.
Also applies to: 132-132
🌐 Web query:
What is the recommended max_feature_count value for MSstats analysis with Spectronaut data?
💡 Result:
Recommended value: max_feature_count = 20. [1][2]
Sources:
[1] MSstatsBig manual (bigSpectronauttoMSstatsFormat default).
[2] MSstatsBig workflow/vignette (documents converters for Spectronaut and default settings).
Revert default max_feature_count to 20
Align with MSstatsBig’s recommendation of max_feature_count = 20 for Spectronaut data—raising it to 100 may incur unnecessary performance and memory overhead. Update the default back in bigSpectronauttoMSstatsFormat (R/converters.R lines 127, 132).
🤖 Prompt for AI Agents
In R/converters.R around lines 127 and 132, the default for max_feature_count in
bigSpectronauttoMSstatsFormat was changed to 100; revert it back to 20 to match
MSstatsBig’s recommendation for Spectronaut data. Update both occurrences in the
function signature and any default argument or parameter handling to set
max_feature_count = 20, and run a quick unit/example to ensure behavior remains
consistent.
| calculateAnomalyScores=FALSE, | ||
| anomalyModelFeatures=c(), |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Inconsistent parameter passing between functions
In bigSpectronauttoMSstatsFormat, the anomaly parameters are passed to both reduceBigSpectronaut and MSstatsPreprocessBig. However, in bigFragPipetoMSstatsFormat (line 100), these parameters are not passed at all, which could cause issues if users expect consistent behavior across converters.
bigFragPipetoMSstatsFormat <- function(input_file, output_file_name,
backend,
max_feature_count = 20,
filter_unique_peptides = FALSE,
aggregate_psms = FALSE,
filter_few_obs = FALSE,
remove_annotation = FALSE,
+ calculateAnomalyScores = FALSE,
+ anomalyModelFeatures = c(),
connection = NULL) {
MSstatsPreprocessBig(input_file, output_file_name,
backend, max_feature_count, filter_unique_peptides,
aggregate_psms, filter_few_obs, remove_annotation,
+ calculateAnomalyScores, anomalyModelFeatures,
connection = connection)
}Also applies to: 147-148
🤖 Prompt for AI Agents
In R/converters.R around lines 137-138 and 147-148, the anomaly parameters
calculateAnomalyScores and anomalyModelFeatures are not being forwarded in
bigFragPipetoMSstatsFormat (unlike bigSpectronauttoMSstatsFormat); update
bigFragPipetoMSstatsFormat to accept these parameters (or reuse the existing
args) and pass them through to both reduceBigSpectronaut and
MSstatsPreprocessBig so behavior is consistent across converters.
Make MSstatsBig compatible with MSstats+. This involved allowing users to retain peak quality metrics through the process. Anomaly detection model should then be used on the output of MSstatsBig. Details on how to do this are in the MSstats+ vignette in the MSstats package.
Summary by CodeRabbit
New Features
Refactor
Documentation