Skip to content

feat(protein turnover): per-label statistics and multi-label summarization#193

Merged
tonywu1999 merged 18 commits intodevelfrom
feat-turnover-4
Apr 14, 2026
Merged

feat(protein turnover): per-label statistics and multi-label summarization#193
tonywu1999 merged 18 commits intodevelfrom
feat-turnover-4

Conversation

@tonywu1999
Copy link
Copy Markdown
Contributor

@tonywu1999 tonywu1999 commented Apr 13, 2026

PR Type

Bug fix, Enhancement, Tests, Documentation


Description

  • Summarize turnover data per PROTEIN/LABEL

  • Preserve LABEL in Tukey outputs

  • Fix per-label counting and missingness stats

  • Add regression tests and docs


Diagram Walkthrough

flowchart LR
  A["Labeled protein input"]
  B["Split by PROTEIN + LABEL"]
  C["Per-label linear/TMP summaries"]
  D["LABEL-aware output metrics"]
  E["Tests and documentation"]

  A -- "non-reference data" --> B
  B -- "drives" --> C
  C -- "propagates `LABEL` to" --> D
  D -- "validated by" --> E
Loading

File Walkthrough

Relevant files
Enhancement
2 files
dataProcess.R
Split summaries by label and keep labels                                 
+40/-39 
utils_summarization.R
Return multi-label Tukey summaries consistently                   
+20/-19 
Bug fix
2 files
utils_output.R
Make output merging and stats label-aware                               
+26/-29 
utils_summarization_prepare.R
Group observation counts within each label                             
+13/-13 
Tests
1 files
test_pr4_per_label.R
Add tests for per-label summarization behavior                     
+167/-0 
Documentation
2 files
dot-fitTukey.Rd
Document new `is_labeled_reference` Tukey parameter           
+7/-1     
dot-runTukey.Rd
Update Tukey docs for multi-label behavior                             
+6/-2     

Motivation and context / Short solution summary

Turnover (multi-label, e.g., H/L) summarization previously relied on single-label assumptions (many code paths filtered LABEL == "L"), causing incorrect aggregation, imputation, and outputs for turnover experiments. This PR implements per-label summarization and makes label-aware fixes across summarization, outputs, tests, and documentation. Behavior is now controlled by a clarified flag is_labeled_reference: TRUE = SRM-style (H treated as normalization reference, return L-normalized output), FALSE = turnover-style (summarize each LABEL independently). LABEL is preserved through Linear/TMP/Tukey pipelines, counts/missingness are computed per PROTEIN+LABEL, and output merging is made robust to mismatched columns.

Detailed changes (by file / area)

  • R/dataProcess.R

    • Protein-splitting now uses PROTEIN+LABEL unless is_labeled_ref indicates SRM (then split by PROTEIN only).
    • AFT imputation fit_data uses rows with is_labeled_ref == FALSE when present (instead of LABEL == "L").
    • predicted and newABUNDANCE assignments gate on is_labeled_ref == FALSE when available (else on censored).
    • survival table column selection tightened with intersect(...) and LABEL included only when present.
    • Single-feature and multi-feature Linear/TMP results now include LABEL (set to "L" when is_labeled_reference TRUE; otherwise preserved from data).
    • .runTukey called with is_labeled_reference where applicable.
  • R/utils_output.R

    • Use data.table::rbindlist(..., fill = TRUE) when combining list outputs (summarized, predicted_survival).
    • Protein-level merges and group counts become label-aware: TotalGroupMeasurements grouped by PROTEIN, GROUP, LABEL; merge keys include LABEL.
    • lab derivation no longer filters on LABEL == "L"; GROUP handling preserved appropriately.
    • Added is_labeled_ref to retained feature-level columns.
    • NumMeasuredFeature and NumImputedFeature aggregated by PROTEIN, RUN, LABEL.
    • nonmissing_orig redefined to depend on censoring/INTENSITY rather than LABEL gating.
  • R/utils_summarization.R

    • Renamed parameter is_labeled → is_labeled_reference in .runTukey/.fitTukey and updated semantics.
    • Multi-feature path calls .fitTukey(input, is_labeled_reference).
    • Single-feature path:
      • is_labeled_reference = TRUE: apply .adjustLRuns(...) and return L-normalized output (LABEL forced to "L").
      • is_labeled_reference = FALSE: return results for all labels (keep LABEL in outputs).
    • .getNonMissingFilterStats simplified: nonmissing derived from !is.na(newABUNDANCE) & !censored (or !is.na(INTENSITY) when newABUNDANCE absent); removed LABEL== "L" special-casing.
  • R/utils_summarization_prepare.R / MSstatsPrepareForSummarization

    • Detects add_ref_covariate from presence of is_labeled_ref and passes is_labeled_reference into .prepareSummary.
    • .prepareSummary signature changed to accept is_labeled_reference; label_by = character(0) when is_labeled_reference TRUE, else "LABEL".
    • Grouping keys for counts/missingness made label-aware when is_labeled_reference FALSE:
      • n_obs: by PROTEIN, FEATURE, LABEL
      • n_obs_run: by PROTEIN, RUN, LABEL
      • total_features: by PROTEIN, LABEL
      • prop_features: by PROTEIN, RUN, LABEL
    • Nonmissing filtering applied per PROTEIN/FEATURE/(LABEL) depending on is_labeled_reference.
    • Consolidated preparation helpers (removed separate .prepareTMP/.prepareLinear helpers in this diff).
  • R/utils_imputation.R / R/utils_censored.R / others

    • Imputation and censored handling updated to use is_labeled_ref where appropriate (e.g., survive fitting uses rows where is_labeled_ref == FALSE).
  • Tests

    • Added/updated tinytests and unit tests across inst/tinytest and tests:
      • inst/tinytest/test_utils_summarization.R: covers .fitTukey/.runTukey behavior for both is_labeled_reference = FALSE and TRUE; asserts presence of LABEL/LogIntensities and SRM normalization behavior.
      • inst/tinytest/test_utils_summarization_prepare.R: adds make_two_label_input and verifies n_obs and total_features computed per PROTEIN+FEATURE+LABEL (when is_labeled_reference FALSE) and expected behavior when TRUE.
      • inst/tinytest/test_dataProcess.R: updated regression checks to expect LABEL and added SRM imputation assertions (censored H rows should keep predicted = NA; censored L rows can be imputed).
      • inst/tinytest/test_utils_censored.R, test_utils_imputation.R, and normalization tests updated/covered for is_labeled_ref usage.
      • New tests file tests/test_pr4_per_label.R added for per-label summarization regression/unit coverage.
  • Documentation (man/*.Rd)

    • man/dot-fitTukey.Rd, man/dot-runTukey.Rd, man/dot-prepareSummary.Rd updated to document the new is_labeled_reference parameter and SRM vs turnover semantics.
    • man/dot-prepareLinear.Rd and man/dot-prepareTMP.Rd removed (internal doc changes reflecting helper consolidation).
    • Some HTML docs and other man pages updated where applicable.
  • Miscellaneous

    • Made output merging label-aware and resilient to column mismatches (rbindlist fill).
    • Removed guards/assumptions that filtered to LABEL == "L" so non-reference labels are handled correctly.
    • Internal function signatures changed; exported/public API surface unchanged.

Unit tests added or modified (summary)

  • inst/tinytest/test_utils_summarization.R

    • Tests for .fitTukey and .runTukey with is_labeled_reference = FALSE (turnover) and TRUE (SRM).
    • Tests assert LABEL presence (when turnover) and L-only outputs (when SRM), plus LogIntensities/newABUNDANCE presence.
    • Tests for .getNonMissingFilterStats nonmissing selection across labels.
  • inst/tinytest/test_utils_summarization_prepare.R

    • make_two_label_input fixture and assertions that:
      • n_obs counted per PROTEIN+FEATURE+LABEL when is_labeled_reference = FALSE.
      • total_features counted per PROTEIN+LABEL.
      • is_labeled_reference = TRUE yields H rows sharing counts with L rows (no zeroed counts).
  • inst/tinytest/test_dataProcess.R

    • Regression checks updated to allow column differences via intersection-of-columns and fsetequal checks.
    • New SRM imputation assertions verifying predicted NA behavior for H (is_labeled_ref=TRUE) vs imputed L (is_labeled_ref=FALSE).
  • Additional tinytests

    • Updates across imputation, censored, normalization tests to validate is_labeled_ref handling.

Coding guidelines / issues observed

  • Inconsistent naming and documentation state:
    • R code and man pages use is_labeled_reference, but several C++ sources and generated Rcpp exports still use is_labeled / is_reference (src/linear_summary.cpp, src/RcppExports.cpp, R/RcppExports.R, docs HTML). This creates potential mismatch between R-level parameter naming/semantics and C++ interfaces and documentation artifacts (docs reference is_labeled in some HTML/man files). Recommend aligning names and updating C++/Rcpp interfaces and generated docs to avoid confusion.
  • Remaining doc inconsistencies:
    • Some documentation/html files still show older parameter names (is_labeled) or usage signatures (docs/reference/dot-runTukey.html, docs/reference/dot-fitLinearModel.html). Ensure documentation rebuild to reflect new parameter names/semantics.
  • Removal of man pages:
    • man/dot-prepareLinear.Rd and man/dot-prepareTMP.Rd were removed; ensure this was intentional and that internal helpers are sufficiently documented elsewhere if needed.
  • No changes to exported/public R interfaces were made in this diff, but internal signature changes and C++ naming differences may warrant explicit changelog notes.

tonywu1999 and others added 4 commits April 13, 2026 13:18
- dataProcess.R: split protein_indices by PROTEIN+LABEL (not just PROTEIN)
  when not using labeled reference, so each label is summarized separately;
  remove LABEL == "L" filters from Linear/TMP survival imputation and result
  aggregation; propagate LABEL column through all result tables
- utils_summarization.R: rename is_labeled → is_labeled_reference in
  .runTukey/.fitTukey; return LABEL in non-reference results; remove
  LABEL == "L" guard from .getNonMissingFilterStats
- utils_output.R: use rbindlist(fill=TRUE) for mixed-schema result lists;
  add LABEL to TotalGroupMeasurements/NumMeasuredFeature/NumImputedFeature
  grouping keys; merge summarized+lab on LABEL; include ref in output cols;
  remove LABEL == "L" guards from nonmissing tracking
- man/: update .fitTukey and .runTukey Rd docs for renamed parameter

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
n_obs, n_obs_run, total_features, and prop_features must all be computed
within each PROTEIN+LABEL combination so that H and L features are counted
independently — a fixup for the per-label statistics commit.
Also switch .fitTukey roxygen to @inheritParams .runTukey.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests for PR 4 covering:
- .prepareLinear n_obs grouped by PROTEIN+FEATURE+LABEL (not pooled)
- .runTukey(is_labeled_reference=FALSE) returns LABEL column for both H and L
- .fitTukey(is_labeled_reference=FALSE) returns LABEL column
- .getNonMissingFilterStats applies to all rows (no LABEL=="L" guard)
- Regression: SRMRawData still summarizes correctly after per-label changes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 13, 2026

Warning

Rate limit exceeded

@tonywu1999 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes and 47 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 33 minutes and 47 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 651cd7ea-9bf0-4e26-ae47-40761752063e

📥 Commits

Reviewing files that changed from the base of the PR and between b9e28eb and eca8c92.

📒 Files selected for processing (1)
  • inst/tinytest/test_dataProcess.R
📝 Walkthrough

Walkthrough

Label-reference handling was added: an is_labeled_ref / is_labeled_reference flag now controls whether grouping and summarization exclude LABEL; Tukey fitting, non-missing filtering, survival/linear imputation, and output aggregation were made conditional on that flag. Tests and docs updated to reflect the new behavior.

Changes

Cohort / File(s) Summary
Summarization control & preparation
R/utils_summarization_prepare.R, R/dataProcess.R
Introduce and propagate is_labeled_reference / is_labeled_ref; .prepareSummary signature changed to accept it; grouping keys and nonmissing logic become label-aware or label-agnostic depending on the flag; TMP/Linear preparation consolidated into .prepareSummary.
Tukey & non-missing logic
R/utils_summarization.R, man/dot-fitTukey.Rd, man/dot-runTukey.Rd
Rename parameter to is_labeled_reference; .fitTukey() signature updated and branching added to return only L when reference-mode is TRUE and return all labels otherwise; .runTukey() updated accordingly; .getNonMissingFilterStats() simplified to rely on newABUNDANCE/censored or INTENSITY.
Model fitting / imputation
R/dataProcess.R
Survival/linear/TMP imputation now selects fitting rows based on is_labeled_ref when present (falling back to prior LABEL-based logic); predicted/newABUNDANCE assignment gates on is_labeled_ref where available; survival columns chosen via intersect(...) of actual cols.
Output aggregation & binding
R/utils_output.R
Use data.table::rbindlist(..., fill=TRUE) for heterogeneous result lists; include is_labeled_ref in feature output columns; make protein/run-level metrics and merges label-aware (grouping/joins include LABEL).
Tests
inst/tinytest/test_utils_summarization.R, inst/tinytest/test_utils_summarization_prepare.R, inst/tinytest/test_dataProcess.R
Add/extend tests for .fitTukey(..., is_labeled_reference=FALSE/TRUE), .runTukey, .getNonMissingFilterStats, label-aware .prepareSummary behavior, and SRM/TMP imputation expectations; adjust dataProcess regression comparisons to compare intersecting columns.
Documentation
man/dot-fitTukey.Rd, man/dot-runTukey.Rd, man/dot-prepareSummary.Rd, man/dot-prepareLinear.Rd (removed), man/dot-prepareTMP.Rd (removed)
Document new is_labeled_reference parameter and semantics; update signatures for .fitTukey and .runTukey; remove outdated .prepareLinear/.prepareTMP Rd pages and update .prepareSummary doc to include is_labeled_reference.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

Review effort 3/5

Suggested reviewers

  • mstaniak

Poem

🐇 I hop through rows both L and H,

A tiny flag decides their way.
I bind with fills and group by care,
Impute the missing, tidy the pair.
Hops and tests—release hooray!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main objective: implementing per-label statistics and multi-label summarization for protein turnover data analysis.
Description check ✅ Passed The PR description provides clear motivation, a diagram walkthrough, file-level details with specific change summaries, but lacks explicit testing section and incomplete motivation context.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat-turnover-4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Wrong averaging

In the single-feature linear path, the new summary now averages all rows in a run instead of only the light channel. For labeled-reference/SRM data, this function still receives both H and L rows together, so the returned protein abundance becomes the mean of the reference and measured channels rather than the light-channel summary. Any single-feature labeled-reference protein will therefore get a shifted LogIntensities value.

result = single_protein[, .(LogIntensities = mean(newABUNDANCE)), by = RUN]
result[, Protein := unique(single_protein$PROTEIN)]
result[, LABEL := unique(single_protein$LABEL)]
result[, Variance := NA_real_]
Label assignment

LABEL is populated with unique(single_protein$LABEL) even when the input group contains both H and L rows. In labeled-reference data that split is still done only by PROTEIN, so this assignment is not length-1. With more than two runs it can raise a data.table assignment error, and with exactly two runs it silently assigns alternating labels to run-level summaries. The downstream merge by LABEL will then attach incorrect or missing metadata.

result = unique(single_protein[, .(Protein = PROTEIN, RUN = RUN)])
extracted_values = get_linear_summary(single_protein, cf,
                                      counts, label, cov_mat)
result = cbind(result, extracted_values)
result[, LABEL := unique(single_protein$LABEL)]
Over-imputation

The survival fit and censoring replacement now run on all labels, but labeled-reference workflows previously limited this to the light channel. When a heavy/reference row is censored, this change imputes the heavy value as well, which can alter the reference-based normalization and change the summarized light-channel abundance. This affects any labeled-reference dataset with censored heavy observations.

survival_fit = .fitSurvival(
  single_protein[, cols, with = FALSE],
  aft_iterations
)
sigma2 = survival_fit$scale^2

single_protein[, c("predicted", "imputation_var") := {
    pred = predict(survival_fit, newdata = .SD, se.fit = TRUE)
    list(pred$fit, pred$se.fit^2 + sigma2)
}]

single_protein[, predicted := ifelse(censored, predicted, NA)]
single_protein[, newABUNDANCE := ifelse(censored, predicted, newABUNDANCE)]

survival = single_protein[, intersect(c(cols, "LABEL", "predicted"), colnames(single_protein)), with = FALSE]

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid mixing label channels

This now averages H and L together whenever single_protein contains both labels,
which changes the summarized abundance for labeled-reference workflows. Summarize a
label-homogeneous subset instead, and carry the corresponding single LABEL into the
result so downstream merges stay correct.

R/dataProcess.R [437-441]

-result = single_protein[, .(LogIntensities = mean(newABUNDANCE)), by = RUN]
-result[, Protein := unique(single_protein$PROTEIN)]
-result[, LABEL := unique(single_protein$LABEL)]
+summary_input = if (data.table::uniqueN(single_protein$LABEL) > 1L) {
+    single_protein[LABEL == "L"]
+} else {
+    single_protein
+}
+result = summary_input[, .(LogIntensities = mean(newABUNDANCE)), by = RUN]
+result[, Protein := unique(summary_input$PROTEIN)]
+result[, LABEL := unique(summary_input$LABEL)]
 result[, Variance := NA_real_]
-setcolorder(result, c("Protein", "RUN", "LogIntensities", "Variance"))
+setcolorder(result, c("Protein", "RUN", "LABEL", "LogIntensities", "Variance"))
Suggestion importance[1-10]: 8

__

Why: This correctly identifies a regression in the single-feature linear path: when single_protein contains both H and L, averaging all newABUNDANCE values by RUN changes the labeled-reference summary. Restricting the summary to the light-channel subset in the multi-label case preserves the prior behavior and keeps downstream LABEL handling consistent.

Medium
Prevent recycled label assignment

unique(single_protein$LABEL) can return both L and H, and data.table will recycle
those values across rows, silently mislabeling the summaries. Assign a single output
label explicitly when multiple labels are present instead of recycling a multi-value
vector.

R/dataProcess.R [468-472]

 result = unique(single_protein[, .(Protein = PROTEIN, RUN = RUN)])
 extracted_values = get_linear_summary(single_protein, cf,
                                       counts, label, cov_mat)
 result = cbind(result, extracted_values)
-result[, LABEL := unique(single_protein$LABEL)]
+result[, LABEL := if (data.table::uniqueN(single_protein$LABEL) == 1L) {
+    unique(single_protein$LABEL)
+} else {
+    "L"
+}]
Suggestion importance[1-10]: 7

__

Why: This is a valid correctness issue because unique(single_protein$LABEL) can contain more than one value, which can misassign or recycle labels in result. Setting a single explicit LABEL for multi-label summaries avoids bad merges later in MSstatsSummarizationOutput.

Medium

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/dataProcess.R`:
- Around line 437-442: The assignment LABEL := unique(single_protein$LABEL)
fails when single_protein contains both H and L (labeled-reference mode) because
unique(...) returns length-2; update the summarization so that when using the
linear SRM summarization (the block computing result from single_protein and
assigning LogIntensities/Protein/LABEL/Variance) you first filter single_protein
to only the L (light/reference) rows before aggregating, or explicitly select
the single LABEL value per RUN (e.g., take LABEL[which.min(...) or LABEL[1]
after filtering]) so that LABEL is scalar per run; apply the same change to the
corresponding block around lines 468–472 to ensure run-level result tables
contain only L rows and a single LABEL value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7f130e84-bde1-4f2b-a228-88e07cd4798f

📥 Commits

Reviewing files that changed from the base of the PR and between 5b5042c and 654ac41.

📒 Files selected for processing (8)
  • R/dataProcess.R
  • R/utils_output.R
  • R/utils_summarization.R
  • R/utils_summarization_prepare.R
  • inst/tinytest/test_utils_summarization.R
  • inst/tinytest/test_utils_summarization_prepare.R
  • man/dot-fitTukey.Rd
  • man/dot-runTukey.Rd

Comment thread R/dataProcess.R
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
R/dataProcess.R (1)

446-450: ⚠️ Potential issue | 🔴 Critical

Use a scalar output label in linear labeled-reference mode.

When is_labeled_reference is TRUE, single_protein still contains both H and L rows, so unique(single_protein$LABEL) is not scalar. Lines 449 and 482 can therefore fail at assignment time or stamp the wrong label onto the run-level result. Derive the output label from the non-reference rows once, or filter to those rows before building result.

Proposed fix
+    output_label = if (is_labeled_reference) {
+        unique(single_protein[!is_labeled_ref, LABEL])
+    } else {
+        unique(single_protein$LABEL)
+    }
+
     if (is_single_feature) {
         result = single_protein[, .(LogIntensities = mean(newABUNDANCE)), by = RUN]
         result[, Protein := unique(single_protein$PROTEIN)]
-        result[, LABEL := unique(single_protein$LABEL)]
+        result[, LABEL := output_label]
         result[, Variance := NA_real_]
         setcolorder(result, c("Protein", "RUN", "LogIntensities", "Variance"))
@@
             result = unique(single_protein[, .(Protein = PROTEIN, RUN = RUN)])
             extracted_values = get_linear_summary(single_protein, cf,
                                                   counts, label, cov_mat)
             result = cbind(result, extracted_values)
-            result[, LABEL := unique(single_protein$LABEL)]
+            result[, LABEL := output_label]
         }

Also applies to: 478-482

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/dataProcess.R` around lines 446 - 450, When is_labeled_reference is TRUE,
unique(single_protein$LABEL) can return multiple values because single_protein
contains both H and L rows; derive the scalar LABEL from only the non-reference
rows (e.g., filter single_protein where REF flag is false or where LABEL !=
reference label) before assigning to result (used in the block under
is_single_feature and the similar block around lines 478-482). Locate the
assignments to result[, LABEL := unique(single_protein$LABEL)] and replace them
with a scalar computed from the filtered rows (e.g., selected_label <-
unique(single_protein[non_reference_rows]$LABEL); then assign result[, LABEL :=
selected_label]) so the LABEL is always a single value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/dataProcess.R`:
- Around line 583-585: The code is hard-coding "L" into the LABEL column when
result lacks LABEL; change this to use the source protein's label instead: set
LABEL from single_protein$LABEL (e.g., result[, LABEL := single_protein$LABEL])
so that .runTukey()'s unlabeled outputs inherit the correct label; only fall
back to the hard-coded "L" when you are explicitly in the labeled-reference mode
(check whatever flag or parameter your pipeline uses for labeled-reference and
branch there).

---

Duplicate comments:
In `@R/dataProcess.R`:
- Around line 446-450: When is_labeled_reference is TRUE,
unique(single_protein$LABEL) can return multiple values because single_protein
contains both H and L rows; derive the scalar LABEL from only the non-reference
rows (e.g., filter single_protein where REF flag is false or where LABEL !=
reference label) before assigning to result (used in the block under
is_single_feature and the similar block around lines 478-482). Locate the
assignments to result[, LABEL := unique(single_protein$LABEL)] and replace them
with a scalar computed from the filtered rows (e.g., selected_label <-
unique(single_protein[non_reference_rows]$LABEL); then assign result[, LABEL :=
selected_label]) so the LABEL is always a single value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e9cbb4f1-66f7-4cef-93b9-ffaa3514f5d5

📥 Commits

Reviewing files that changed from the base of the PR and between 654ac41 and d861d0d.

📒 Files selected for processing (5)
  • R/dataProcess.R
  • R/utils_summarization_prepare.R
  • man/dot-prepareLinear.Rd
  • man/dot-prepareSummary.Rd
  • man/dot-prepareTMP.Rd
✅ Files skipped from review due to trivial changes (2)
  • man/dot-prepareSummary.Rd
  • man/dot-prepareLinear.Rd

Comment thread R/dataProcess.R Outdated
Comment thread R/dataProcess.R Outdated
Comment thread R/dataProcess.R Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
R/dataProcess.R (1)

447-452: ⚠️ Potential issue | 🟠 Major

Fix LABEL assignment in linear summarization when is_labeled_reference=TRUE.

When is_labeled_reference=TRUE, data is split by PROTEIN only (line 303-304), so single_protein contains both H and L rows. At line 449, unique(single_protein$LABEL) returns a length-2 vector c("H", "L"), which will cause a data.table assignment error when assigned to the scalar LABEL column.

The same issue exists at line 482 for the multi-feature case.

Proposed fix
+    output_label = if (data.table::uniqueN(single_protein$LABEL) > 1L) "L" else unique(single_protein$LABEL)
+
     if (is_single_feature) {
         result = single_protein[, .(LogIntensities = mean(newABUNDANCE)), by = RUN]
         result[, Protein := unique(single_protein$PROTEIN)]
-        result[, LABEL := unique(single_protein$LABEL)]
+        result[, LABEL := output_label]
         result[, Variance := NA_real_]

And similarly for line 482:

             result = cbind(result, extracted_values)
-            result[, LABEL := unique(single_protein$LABEL)]
+            result[, LABEL := output_label]
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/dataProcess.R` around lines 447 - 452, The LABEL assignment fails when
is_labeled_reference=TRUE because single_protein contains both "H" and "L" so
unique(single_protein$LABEL) returns length>1; update the LABEL assignment in
the linear summarization block (after result = single_protein[, .(LogIntensities
= mean(newABUNDANCE)), by = RUN]) to guard against multiple labels by selecting
a single value (e.g., LABEL := unique(single_protein$LABEL)[1]) or otherwise
handling the multi-value case (e.g., set NA_character_ or collapse values) and
apply the same defensive change to the analogous multi-feature summarization
block (the code around the second LABEL assignment).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@R/dataProcess.R`:
- Around line 447-452: The LABEL assignment fails when is_labeled_reference=TRUE
because single_protein contains both "H" and "L" so unique(single_protein$LABEL)
returns length>1; update the LABEL assignment in the linear summarization block
(after result = single_protein[, .(LogIntensities = mean(newABUNDANCE)), by =
RUN]) to guard against multiple labels by selecting a single value (e.g., LABEL
:= unique(single_protein$LABEL)[1]) or otherwise handling the multi-value case
(e.g., set NA_character_ or collapse values) and apply the same defensive change
to the analogous multi-feature summarization block (the code around the second
LABEL assignment).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 87533bf5-a158-4d4f-93ff-f7c5f33e7caf

📥 Commits

Reviewing files that changed from the base of the PR and between d861d0d and f59c26a.

📒 Files selected for processing (4)
  • R/dataProcess.R
  • R/utils_summarization_prepare.R
  • inst/tinytest/test_dataProcess.R
  • inst/tinytest/test_utils_summarization_prepare.R
🚧 Files skipped from review as they are similar to previous changes (1)
  • inst/tinytest/test_utils_summarization_prepare.R

@tonywu1999 tonywu1999 changed the title Feat turnover 4 feat(protein turnover): per-label statistics and multi-label summarization Apr 14, 2026
@tonywu1999 tonywu1999 merged commit 941aa84 into devel Apr 14, 2026
2 checks passed
@tonywu1999 tonywu1999 deleted the feat-turnover-4 branch April 14, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant