feat(protein turnover): per-label statistics and multi-label summarization#193
feat(protein turnover): per-label statistics and multi-label summarization#193tonywu1999 merged 18 commits intodevelfrom
Conversation
- dataProcess.R: split protein_indices by PROTEIN+LABEL (not just PROTEIN) when not using labeled reference, so each label is summarized separately; remove LABEL == "L" filters from Linear/TMP survival imputation and result aggregation; propagate LABEL column through all result tables - utils_summarization.R: rename is_labeled → is_labeled_reference in .runTukey/.fitTukey; return LABEL in non-reference results; remove LABEL == "L" guard from .getNonMissingFilterStats - utils_output.R: use rbindlist(fill=TRUE) for mixed-schema result lists; add LABEL to TotalGroupMeasurements/NumMeasuredFeature/NumImputedFeature grouping keys; merge summarized+lab on LABEL; include ref in output cols; remove LABEL == "L" guards from nonmissing tracking - man/: update .fitTukey and .runTukey Rd docs for renamed parameter Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
n_obs, n_obs_run, total_features, and prop_features must all be computed within each PROTEIN+LABEL combination so that H and L features are counted independently — a fixup for the per-label statistics commit. Also switch .fitTukey roxygen to @inheritParams .runTukey. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests for PR 4 covering: - .prepareLinear n_obs grouped by PROTEIN+FEATURE+LABEL (not pooled) - .runTukey(is_labeled_reference=FALSE) returns LABEL column for both H and L - .fitTukey(is_labeled_reference=FALSE) returns LABEL column - .getNonMissingFilterStats applies to all rows (no LABEL=="L" guard) - Regression: SRMRawData still summarizes correctly after per-label changes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 33 minutes and 47 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughLabel-reference handling was added: an Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/dataProcess.R`:
- Around line 437-442: The assignment LABEL := unique(single_protein$LABEL)
fails when single_protein contains both H and L (labeled-reference mode) because
unique(...) returns length-2; update the summarization so that when using the
linear SRM summarization (the block computing result from single_protein and
assigning LogIntensities/Protein/LABEL/Variance) you first filter single_protein
to only the L (light/reference) rows before aggregating, or explicitly select
the single LABEL value per RUN (e.g., take LABEL[which.min(...) or LABEL[1]
after filtering]) so that LABEL is scalar per run; apply the same change to the
corresponding block around lines 468–472 to ensure run-level result tables
contain only L rows and a single LABEL value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7f130e84-bde1-4f2b-a228-88e07cd4798f
📒 Files selected for processing (8)
R/dataProcess.RR/utils_output.RR/utils_summarization.RR/utils_summarization_prepare.Rinst/tinytest/test_utils_summarization.Rinst/tinytest/test_utils_summarization_prepare.Rman/dot-fitTukey.Rdman/dot-runTukey.Rd
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
R/dataProcess.R (1)
446-450:⚠️ Potential issue | 🔴 CriticalUse a scalar output label in linear labeled-reference mode.
When
is_labeled_referenceisTRUE,single_proteinstill contains both H and L rows, sounique(single_protein$LABEL)is not scalar. Lines 449 and 482 can therefore fail at assignment time or stamp the wrong label onto the run-level result. Derive the output label from the non-reference rows once, or filter to those rows before buildingresult.Proposed fix
+ output_label = if (is_labeled_reference) { + unique(single_protein[!is_labeled_ref, LABEL]) + } else { + unique(single_protein$LABEL) + } + if (is_single_feature) { result = single_protein[, .(LogIntensities = mean(newABUNDANCE)), by = RUN] result[, Protein := unique(single_protein$PROTEIN)] - result[, LABEL := unique(single_protein$LABEL)] + result[, LABEL := output_label] result[, Variance := NA_real_] setcolorder(result, c("Protein", "RUN", "LogIntensities", "Variance")) @@ result = unique(single_protein[, .(Protein = PROTEIN, RUN = RUN)]) extracted_values = get_linear_summary(single_protein, cf, counts, label, cov_mat) result = cbind(result, extracted_values) - result[, LABEL := unique(single_protein$LABEL)] + result[, LABEL := output_label] }Also applies to: 478-482
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/dataProcess.R` around lines 446 - 450, When is_labeled_reference is TRUE, unique(single_protein$LABEL) can return multiple values because single_protein contains both H and L rows; derive the scalar LABEL from only the non-reference rows (e.g., filter single_protein where REF flag is false or where LABEL != reference label) before assigning to result (used in the block under is_single_feature and the similar block around lines 478-482). Locate the assignments to result[, LABEL := unique(single_protein$LABEL)] and replace them with a scalar computed from the filtered rows (e.g., selected_label <- unique(single_protein[non_reference_rows]$LABEL); then assign result[, LABEL := selected_label]) so the LABEL is always a single value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/dataProcess.R`:
- Around line 583-585: The code is hard-coding "L" into the LABEL column when
result lacks LABEL; change this to use the source protein's label instead: set
LABEL from single_protein$LABEL (e.g., result[, LABEL := single_protein$LABEL])
so that .runTukey()'s unlabeled outputs inherit the correct label; only fall
back to the hard-coded "L" when you are explicitly in the labeled-reference mode
(check whatever flag or parameter your pipeline uses for labeled-reference and
branch there).
---
Duplicate comments:
In `@R/dataProcess.R`:
- Around line 446-450: When is_labeled_reference is TRUE,
unique(single_protein$LABEL) can return multiple values because single_protein
contains both H and L rows; derive the scalar LABEL from only the non-reference
rows (e.g., filter single_protein where REF flag is false or where LABEL !=
reference label) before assigning to result (used in the block under
is_single_feature and the similar block around lines 478-482). Locate the
assignments to result[, LABEL := unique(single_protein$LABEL)] and replace them
with a scalar computed from the filtered rows (e.g., selected_label <-
unique(single_protein[non_reference_rows]$LABEL); then assign result[, LABEL :=
selected_label]) so the LABEL is always a single value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e9cbb4f1-66f7-4cef-93b9-ffaa3514f5d5
📒 Files selected for processing (5)
R/dataProcess.RR/utils_summarization_prepare.Rman/dot-prepareLinear.Rdman/dot-prepareSummary.Rdman/dot-prepareTMP.Rd
✅ Files skipped from review due to trivial changes (2)
- man/dot-prepareSummary.Rd
- man/dot-prepareLinear.Rd
There was a problem hiding this comment.
♻️ Duplicate comments (1)
R/dataProcess.R (1)
447-452:⚠️ Potential issue | 🟠 MajorFix LABEL assignment in linear summarization when is_labeled_reference=TRUE.
When
is_labeled_reference=TRUE, data is split byPROTEINonly (line 303-304), sosingle_proteincontains both H and L rows. At line 449,unique(single_protein$LABEL)returns a length-2 vectorc("H", "L"), which will cause a data.table assignment error when assigned to the scalarLABELcolumn.The same issue exists at line 482 for the multi-feature case.
Proposed fix
+ output_label = if (data.table::uniqueN(single_protein$LABEL) > 1L) "L" else unique(single_protein$LABEL) + if (is_single_feature) { result = single_protein[, .(LogIntensities = mean(newABUNDANCE)), by = RUN] result[, Protein := unique(single_protein$PROTEIN)] - result[, LABEL := unique(single_protein$LABEL)] + result[, LABEL := output_label] result[, Variance := NA_real_]And similarly for line 482:
result = cbind(result, extracted_values) - result[, LABEL := unique(single_protein$LABEL)] + result[, LABEL := output_label] }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/dataProcess.R` around lines 447 - 452, The LABEL assignment fails when is_labeled_reference=TRUE because single_protein contains both "H" and "L" so unique(single_protein$LABEL) returns length>1; update the LABEL assignment in the linear summarization block (after result = single_protein[, .(LogIntensities = mean(newABUNDANCE)), by = RUN]) to guard against multiple labels by selecting a single value (e.g., LABEL := unique(single_protein$LABEL)[1]) or otherwise handling the multi-value case (e.g., set NA_character_ or collapse values) and apply the same defensive change to the analogous multi-feature summarization block (the code around the second LABEL assignment).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@R/dataProcess.R`:
- Around line 447-452: The LABEL assignment fails when is_labeled_reference=TRUE
because single_protein contains both "H" and "L" so unique(single_protein$LABEL)
returns length>1; update the LABEL assignment in the linear summarization block
(after result = single_protein[, .(LogIntensities = mean(newABUNDANCE)), by =
RUN]) to guard against multiple labels by selecting a single value (e.g., LABEL
:= unique(single_protein$LABEL)[1]) or otherwise handling the multi-value case
(e.g., set NA_character_ or collapse values) and apply the same defensive change
to the analogous multi-feature summarization block (the code around the second
LABEL assignment).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 87533bf5-a158-4d4f-93ff-f7c5f33e7caf
📒 Files selected for processing (4)
R/dataProcess.RR/utils_summarization_prepare.Rinst/tinytest/test_dataProcess.Rinst/tinytest/test_utils_summarization_prepare.R
🚧 Files skipped from review as they are similar to previous changes (1)
- inst/tinytest/test_utils_summarization_prepare.R
PR Type
Bug fix, Enhancement, Tests, Documentation
Description
Summarize turnover data per
PROTEIN/LABELPreserve
LABELin Tukey outputsFix per-label counting and missingness stats
Add regression tests and docs
Diagram Walkthrough
File Walkthrough
2 files
Split summaries by label and keep labelsReturn multi-label Tukey summaries consistently2 files
Make output merging and stats label-awareGroup observation counts within each label1 files
Add tests for per-label summarization behavior2 files
Document new `is_labeled_reference` Tukey parameterUpdate Tukey docs for multi-label behaviorMotivation and context / Short solution summary
Turnover (multi-label, e.g., H/L) summarization previously relied on single-label assumptions (many code paths filtered LABEL == "L"), causing incorrect aggregation, imputation, and outputs for turnover experiments. This PR implements per-label summarization and makes label-aware fixes across summarization, outputs, tests, and documentation. Behavior is now controlled by a clarified flag is_labeled_reference: TRUE = SRM-style (H treated as normalization reference, return L-normalized output), FALSE = turnover-style (summarize each LABEL independently). LABEL is preserved through Linear/TMP/Tukey pipelines, counts/missingness are computed per PROTEIN+LABEL, and output merging is made robust to mismatched columns.
Detailed changes (by file / area)
R/dataProcess.R
R/utils_output.R
R/utils_summarization.R
R/utils_summarization_prepare.R / MSstatsPrepareForSummarization
R/utils_imputation.R / R/utils_censored.R / others
Tests
Documentation (man/*.Rd)
Miscellaneous
Unit tests added or modified (summary)
inst/tinytest/test_utils_summarization.R
inst/tinytest/test_utils_summarization_prepare.R
inst/tinytest/test_dataProcess.R
Additional tinytests
Coding guidelines / issues observed