Skip to content

Conversation

@t0mdavid-m
Copy link
Member

@t0mdavid-m t0mdavid-m commented May 6, 2025

Summary by CodeRabbit

  • New Features

    • Introduced a new state management system for tracking UI state and versioning.
    • Added advanced data compression and downsampling for heatmap visualizations.
    • Implemented modular component initialization and rendering for improved UI flexibility.
    • Enabled efficient data filtering and caching for interactive visual components.
  • Refactor

    • Unified and simplified rendering logic across viewers, delegating data processing and visualization to centralized functions.
    • Streamlined data parsing and storage workflows for deconvolution and quantification results.
    • Updated internal naming conventions for spectrum components.
    • Replaced manual data fetching and component assembly with centralized rendering calls.
    • Refactored parsing functions to store results via file manager and removed redundant data returns.
    • Reorganized component declarations and constructors for better consistency and state handling.
  • Bug Fixes

    • Improved handling of optional data files and ensured robust data normalization and transformation.
  • Chores

    • Updated configuration to allow larger file uploads.
    • Added scipy dependency for statistical computations and visualization support.
    • Upgraded dependencies for frontend visualization components.
    • Updated Windows build workflow to new app version naming.
  • Documentation

    • Enhanced code clarity and maintainability through improved structure and removal of deprecated decorators.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 6, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This update introduces a major refactor of the data rendering and visualization pipeline. It replaces inline data processing and component rendering logic in viewer scripts with a unified, modular approach centered on new render_grid and render_component functions. Data parsing functions are updated for explicit file management and richer intermediate storage. Several new utility modules are introduced for state tracking, compression, initialization, and data updating.

Changes

File(s) Change Summary
.streamlit/config.toml Added maxUploadSize = 1000 to configure Streamlit's maximum upload size.
content/FLASHDeconv/FLASHDeconvViewer.py, content/FLASHQuant/FLASHQuantViewer.py, content/FLASHTnT/FLASHTnTViewer.py Refactored to remove manual data processing and component instantiation; now use render_grid for rendering. Removed sendDataToJS and related helpers. Simplified layout logic.
content/FLASHTnT/FLASHTnTLayoutManager.py Changed spectrum component name from 'deconv_spectrum' to 'combined_spectrum' in COMPONENT_NAMES.
openms-streamlit-vue-component Updated subproject commit reference.
src/Workflow.py Changed parsing logic to pass file manager and logger explicitly; unified result fetching and storage. Adjusted imports and removed redundant result storing.
src/parse/deconv.py Refactored parseDeconv to accept file manager, dataset ID, and logger; now stores all outputs via file manager and computes additional derived data. Added ECDF calculation helper.
src/parse/masstable.py Added optional logger to parseFLASHDeconvOutput; removed Streamlit caching decorators from parsing functions.
src/parse/quant.py Fixed imports to use src.parse.flashquant; no logic changes.
src/parse/tnt.py Refactored parseTnT to use file manager for storage, added normalization and data enrichment, and removed return value. Now accepts file manager, dataset ID, and logger.
src/render/StateTracker.py New module introducing StateTracker for versioned state management.
src/render/components.py Refactored component declaration to use session state caching; unified constructors to accept titles; removed flash_viewer_grid_component function.
src/render/compression.py New module with compute_compression_levels and downsample_heatmap for data compression and downsampling.
src/render/initialize.py New module with initialize_data to prepare data and components based on component name and tool.
src/render/render.py New module with render_component and render_grid to manage stateful rendering and grid layouts in Streamlit.
src/render/update.py New module with functions for data updating, filtering, and caching for visualization components.
src/render/util.py New module adding hash_complex for hashing complex objects using pickle and SHA-256.
environment.yml, requirements.txt Added scipy>=1.15 dependency to environment and requirements.
.github/workflows/build-windows-executable-app.yaml Updated APP_NAME environment variable from "FLASHApp-0.50" to "FLASHApp-0.70".

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant StreamlitApp
    participant render_grid
    participant initialize_data
    participant FileManager
    participant StateTracker

    User->>StreamlitApp: Interacts/selects experiment/layout
    StreamlitApp->>render_grid: Calls with selected_data, layout, file_manager, tool, identifier
    render_grid->>initialize_data: For each component, prepare data and component
    initialize_data->>FileManager: Fetches required datasets
    initialize_data-->>render_grid: Returns data, components, additional_data
    render_grid->>StateTracker: Retrieve or update state for component
    render_grid->>StreamlitApp: Renders component in grid layout
    Note over StreamlitApp,User: User interacts, state and data update via StateTracker
Loading

Poem

In the warren of code, a new grid appears,
Where data and views are managed by peers.
With stateful new trackers and compression so neat,
Our rabbit devs hop to a modular beat.
No more tangled carrots—just streamlined delight,
As components and data now render just right!
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 529e872 and 9301a8e.

📒 Files selected for processing (1)
  • .github/workflows/build-windows-executable-app.yaml (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🔭 Outside diff range comments (2)
content/FLASHTnT/FLASHTnTViewer.py (1)

18-25: ⚠️ Potential issue

Guard against missing keys in st.session_state

st.session_state[...] raises a KeyError if the key is absent.
During the first page load (or after a browser refresh) the dynamic keys
selected_experiment_dropdown_{idx}_tagger are not populated yet, so the
loop can crash before the user even interacts with the UI.

-        for exp_index in range(1, len(st.session_state["saved_layout_setting_tagger"])):
-            if st.session_state[f'selected_experiment_dropdown_{exp_index}_tagger'] is None:
+        for exp_index in range(1, len(st.session_state["saved_layout_setting_tagger"])):
+            dropdown_key = f'selected_experiment_dropdown_{exp_index}_tagger'
+            if st.session_state.get(dropdown_key) is None:
                 continue
-            st.session_state[f"selected_experiment{exp_index}_tagger"] = st.session_state[f'selected_experiment_dropdown_{exp_index}_tagger']
+            st.session_state[f"selected_experiment{exp_index}_tagger"] = st.session_state[dropdown_key]

Using .get() (or an in check) keeps the page responsive and avoids hard-to-trace
state crashes.

src/render/components.py (1)

39-46: ⚠️ Potential issue

Attribute naming mismatch breaks front-end expectations

PlotlyHeatmap defines showLegend (camelCase) at class level but
initialises self.show_legend (snake_case) in __init__. If the
frontend expects showLegend, the prop will be missing.

-        self.show_legend = show_legend
+        self.showLegend = show_legend

Aligning the attribute name prevents silent UI glitches.

🧹 Nitpick comments (18)
.streamlit/config.toml (1)

6-6: Clarify the unit for maxUploadSize.
The new maxUploadSize setting isn’t explicit about whether the value is in megabytes or bytes. Please verify the correct unit in the Streamlit server config docs and consider adding an inline comment, for example:

maxUploadSize = 1000  # in MB
src/render/util.py (1)

1-6: New utility function for object hashing.

The hash_complex function provides a way to generate consistent hash keys for complex data structures, which is essential for state tracking and efficient UI updates in the new rendering system.

However, be aware that using pickle for serialization can be a security concern if you ever process untrusted data.

For improved security, consider using a safer serialization method like JSON for data that could potentially come from untrusted sources:

-import pickle
+import json
 import hashlib

 def hash_complex(d):
-    serialized = pickle.dumps(d)
+    try:
+        serialized = json.dumps(d, sort_keys=True).encode('utf-8')
+    except (TypeError, ValueError):
+        # Fallback to pickle only for trusted internal data
+        import pickle
+        serialized = pickle.dumps(d)
     return hashlib.sha256(serialized).hexdigest()
src/parse/masstable.py (2)

8-8: logger parameter is unused – decide whether to wire it in or drop it

parseFLASHDeconvOutput now accepts a logger but never references it. Keeping dead-code parameters is confusing and invites drift between the call-site and implementation. Either:

  1. Integrate the logger (e.g. logger.debug(...) for the heavy loop below), or
  2. Remove the parameter until it is genuinely required.

A quick fix, if you do not need logging yet:

-def parseFLASHDeconvOutput(annotated, deconvolved, logger=None):
+def parseFLASHDeconvOutput(annotated, deconvolved):

32-32: Unused loop index i – remove to prevent shadowing and lint noise

enumerate introduces i, but the variable is never used. Using a plain zip(...) is cleaner and avoids accidentally shadowing the inner-loop variable named i a few lines later.

-    for i, (spec, aspec) in enumerate(zip(deconvolved_exp, annotated_exp)):
+    for spec, aspec in zip(deconvolved_exp, annotated_exp):
src/render/StateTracker.py (2)

30-33: Minor: key in dict is faster than key in dict.keys()

Static-analysis hint SIM118 is valid here.

-                k: newState[k] for k in newState.keys() 
+                k: working_state[k] for k in working_state
🧰 Tools
🪛 Ruff (0.8.2)

30-30: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


44-47: Return the boolean directly

You can collapse the final if/else, as flagged by SIM103:

-        if modified:
-            return True
-        else:
-            return False
+        return bool(modified)
🧰 Tools
🪛 Ruff (0.8.2)

44-47: Return the condition bool(modified) directly

Replace with return bool(modified)

(SIM103)

content/FLASHDeconv/FLASHDeconvViewer.py (1)

11-12: Appending to DEFAULT_LAYOUT mutates the list between reruns

Because Streamlit re-executes the script on every interaction, the line

DEFAULT_LAYOUT = DEFAULT_LAYOUT + [['sequence_view']]

creates a new list each run, but if any other module imports this symbol it may already hold the original list without sequence_view. If you intend the layout to be dynamic, rename the variable (e.g. default_layout) and build it fresh each time; otherwise define two constants and pick one.

src/parse/tnt.py (3)

38-52: Inefficient & verbose dictionary access + loop variable shadowing

  1. new_tag_df.keys() is called repeatedly – use direct iteration (for col in new_tag_df:).
  2. The inner loop re-uses the outer loop variable i, hiding it and triggering Ruff B007.
-for i, row in tag_df.iterrows():
+for row_idx, row in tag_df.iterrows():
     ...
-    for i in range(len(sequence)):
+    for pos in range(len(sequence)):
         ...

This removes the shadowing and the unnecessary .keys() calls.

🧰 Tools
🪛 Ruff (0.8.2)

38-38: Loop control variable i not used within loop body

(B007)


44-44: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


50-50: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


124-131: Rename ambiguous variable l (E741) for clarity

Using single-letter l is hard to read and was flagged by Ruff.

-            } for s, e, m, l in zip(mod_starts, mod_ends, mod_masses, mod_labels)
+            } for s, e, m, label in zip(mod_starts, mod_ends, mod_masses, mod_labels)

Also update the key in the dict comprehension accordingly ('labels': label).

🧰 Tools
🪛 Ruff (0.8.2)

130-130: Ambiguous variable name: l

(E741)


75-88: Coverage computation is O(N²); consider vectorisation for large proteins

Iterating over every residue for every protein will become a hotspot for long sequences and large datasets.

A simple vectorised alternative:

mask = (tag_df['ProteinIndex'] == pid)
starts = tag_df.loc[mask, 'StartPos'].to_numpy()
ends   = tag_df.loc[mask, 'EndPos'].to_numpy()

coverage = np.add.reduce([
    (np.arange(len(sequence))[:, None] >= starts) &
    (np.arange(len(sequence))[:, None] <= ends)
], axis=1).astype(float)

Not blocking, but worth addressing for scalability.

src/render/components.py (1)

12-15: _component_func global is never used – remove to avoid confusion

The function stores the component in st.session_state but keeps an
unused module-level _component_func. Dropping the global eliminates
dead state and clarifies the single source of truth.

src/render/initialize.py (2)

13-18: Component name / dataset key spelling inconsistency

comp_name == 'ms1_deconv_heat_map' (extra underscore) retrieves the
dataset 'ms1_deconv_heatmap' (no underscore).
The mismatch is harmless now but easily leads to copy-paste errors and
UI/readability confusion. Pick one spelling and use it consistently
across the codebase and frontend.


8-125: Monster initialize_data – consider splitting by component type

The 120-line if/elif chain is hard to read, fragile to edit, and
impossible to unit-test granularly. A registry/dictionary approach keeps
logic flat and extendable:

COMPONENT_BUILDERS = {
    'scan_table': (
        lambda fm, ds, tool: {'per_scan_data': fm.get_results(ds, ['scan_table'])['scan_table']},
        lambda: Tabulator('ScanTable')
    ),
    ...
}

def initialize_data(comp_name, selected_data, file_manager, tool):
    data_fn, comp_fn = COMPONENT_BUILDERS[comp_name]
    data_to_send = {}
    additional_data = {'dataset': selected_data}
    data_to_send.update(data_fn(file_manager, selected_data, tool))
    components = [[FlashViewerComponent(comp_fn())]]
    return data_to_send, components, additional_data

This adheres to the Open/Closed principle and makes adding a new
component a one-liner.

src/render/compression.py (1)

1-2: Remove unused pandas import.

pandas is imported but never referenced. Apart from the wasted import time, Ruff already flags this as ‑-F401.
Simply drop the line – downsample_heatmap receives a pd.DataFrame, but the function never calls pd.* APIs.

-import pandas as pd
🧰 Tools
🪛 Ruff (0.8.2)

2-2: pandas imported but unused

Remove unused import: pandas

(F401)

src/render/update.py (3)

11-18: Unused parameters and divergent return type.

render_heatmap(…, dataset_name, component_name) never consumes
dataset_name nor component_name. Readers will assume these matter for the
computation or Streamlit cache key (they do not), and the extra arguments get
propagated throughout the call stack.

-def render_heatmap(full_data, selection, dataset_name, component_name):
+def render_heatmap(full_data, selection):

Remember to update the two call-sites below.
This also shortens the cache key (Streamlit includes arg values).


20-33: relevant_data can remain uninitialised if the for-loop never executes.

Although full_data is expected to be non-empty, guarding for corrupt input
prevents UnboundLocalError.

if not full_data:
    return pd.DataFrame(columns=['rt', 'mass', 'intensity'])

A lightweight check above the for dataset in full_data: line improves
defensive robustness.


80-88: Flatten nested if for readability and to satisfy Ruff SIM102.

-    elif (component == 'Deconvolved MS1 Heatmap'):
-        if 'heatmap_deconv' in selection_store:
+    elif component == 'Deconvolved MS1 Heatmap' and 'heatmap_deconv' in selection_store:

Same for the raw heatmap branch below. Behaviour is identical; intent is
clearer.

🧰 Tools
🪛 Ruff (0.8.2)

87-88: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

src/render/render.py (1)

63-78: Variable leakage: default points to the last loop value.

Inside the post-initialisation block you reuse the name default
from the preceding for … in zip() loop. After that loop finishes,
default holds default_state, not default_data – relying on this
implicit behaviour is brittle and obscures intent.

-        st.session_state['state_tracker'][tool][identifier] = default
+        st.session_state['state_tracker'][tool][identifier] = StateTracker()

Likewise, consider assigning default_plot_data and
default_tracker variables to avoid shadowing.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f38be28 and 494611a.

⛔ Files ignored due to path filters (3)
  • js-component/dist/assets/index-422c6bea.css is excluded by !**/dist/**
  • js-component/dist/assets/index-c2f12ebe.js is excluded by !**/dist/**
  • js-component/dist/index.html is excluded by !**/dist/**
📒 Files selected for processing (18)
  • .streamlit/config.toml (1 hunks)
  • content/FLASHDeconv/FLASHDeconvViewer.py (3 hunks)
  • content/FLASHQuant/FLASHQuantViewer.py (2 hunks)
  • content/FLASHTnT/FLASHTnTLayoutManager.py (1 hunks)
  • content/FLASHTnT/FLASHTnTViewer.py (4 hunks)
  • openms-streamlit-vue-component (1 hunks)
  • src/Workflow.py (5 hunks)
  • src/parse/deconv.py (1 hunks)
  • src/parse/masstable.py (2 hunks)
  • src/parse/quant.py (1 hunks)
  • src/parse/tnt.py (1 hunks)
  • src/render/StateTracker.py (1 hunks)
  • src/render/components.py (3 hunks)
  • src/render/compression.py (1 hunks)
  • src/render/initialize.py (1 hunks)
  • src/render/render.py (1 hunks)
  • src/render/update.py (1 hunks)
  • src/render/util.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
src/parse/quant.py (1)
src/parse/flashquant.py (2)
  • parseFLASHQuantOutput (7-32)
  • connectTraceWithResult (36-49)
content/FLASHDeconv/FLASHDeconvViewer.py (1)
src/render/render.py (1)
  • render_grid (56-107)
src/Workflow.py (3)
src/parse/tnt.py (1)
  • parseTnT (15-157)
src/parse/deconv.py (1)
  • parseDeconv (8-110)
src/workflow/FileManager.py (2)
  • get_results (403-445)
  • result_exists (447-465)
src/render/render.py (5)
src/render/util.py (1)
  • hash_complex (4-6)
src/render/StateTracker.py (3)
  • StateTracker (3-55)
  • getState (49-55)
  • updateState (10-47)
src/render/initialize.py (1)
  • initialize_data (8-125)
src/render/update.py (2)
  • update_data (46-60)
  • filter_data (63-118)
src/render/components.py (1)
  • get_component_function (13-28)
src/render/update.py (2)
src/render/compression.py (1)
  • downsample_heatmap (18-56)
src/render/sequence.py (2)
  • getFragmentDataFromSeq (87-145)
  • getInternalFragmentDataFromSeq (258-272)
🪛 Ruff (0.8.2)
src/render/StateTracker.py

30-30: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


44-47: Return the condition bool(modified) directly

Replace with return bool(modified)

(SIM103)

src/render/update.py

87-88: Use a single if statement instead of nested if statements

Combine if statements using and

(SIM102)

src/render/compression.py

2-2: pandas imported but unused

Remove unused import: pandas

(F401)

src/parse/tnt.py

38-38: Loop control variable i not used within loop body

(B007)


44-44: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


50-50: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


130-130: Ambiguous variable name: l

(E741)

🔇 Additional comments (8)
openms-streamlit-vue-component (1)

1-1: Verify updated submodule commit
The submodule pointer has been bumped to 2185551476438398ca3ce8e33d3af018aeebb1f9. Please confirm that this commit exists in the openms-streamlit-vue-component repo and includes the Vue component changes required by the new rendering framework. Also ensure CI workflows properly initialize and update this submodule.

src/parse/quant.py (1)

1-2: Import statements updated to align with project restructuring.

The imports have been moved from src.flashquant to src.parse.flashquant, which aligns with the broader code reorganization happening in the parsing modules. This change improves the code organization by keeping parsing-related functionality in the dedicated parse package.

content/FLASHQuant/FLASHQuantViewer.py (2)

7-8: Updated imports for the new rendering system.

The change from specific component imports to the centralized render_grid function aligns with the new modular rendering approach being implemented.


37-40: Simplified UI rendering with centralized render_grid function.

The implementation now uses the centralized render_grid function instead of explicit component initialization and rendering. This approach is more maintainable and consistent with the new architecture.

src/render/StateTracker.py (1)

49-55: Do not expose internal _id in public state

getState currently leaks the tracker's private identifier. External code should not be able to spoof another tracker so easily. Consider omitting _id or renaming it to something clearly internal (e.g. tracker_id) and documenting the expected usage contract.

src/Workflow.py (1)

234-261: Ensure files remain accessible after rmtree

parseDeconv / parseTnT read the paths returned by file_manager.get_results.
If file_manager.store_file() merely records the original path (and does not copy the file into the cache), the subsequent rmtree(folder_path) (l 264) will remove those files, breaking later retrieval/download functionality.

Please verify that store_file copies or hard-links the file into a persistent cache directory; otherwise copy the files manually before deleting the temp folder.

content/FLASHTnT/FLASHTnTViewer.py (1)

48-53: Index lookup may crash when a previously-selected experiment disappears

name_to_index[st.session_state.selected_experiment0_tagger]
will raise a KeyError when the experiment that was in the user’s
session state is no longer present in results
(e.g. file deleted or workspace switched).
Consider a safer default:

-    index=name_to_index[st.session_state.selected_experiment0_tagger] if 'selected_experiment0_tagger' in st.session_state else None,
+    index=name_to_index.get(
+        st.session_state.get('selected_experiment0_tagger', ''),
+        0  # fallback to first element
+    ),

This prevents an unhandled exception and keeps the UI usable.

src/render/update.py (1)

9-18: hash_funcs forces cache collisions – confirm that this is intentional.

Returning a constant (lambda _: 1) makes st.cache_data treat all
DataFrames as identical → different datasets can be mis-served from the same
cache entry. If the goal is “ignore raw data size but keep per-call
uniqueness”, consider hashing cheap metadata instead:

hash_funcs = {pd.DataFrame: lambda df: (df.shape, int(df.memory_usage().sum()))}

Otherwise callers might receive stale heatmaps after switching files.

'internal_fragment_map',
'tag_table',
'deconv_spectrum'
'combined_spectrum'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Component name updated from 'deconv_spectrum' to 'combined_spectrum'.

This naming change aligns with the unified data rendering approach being implemented across the visualization modules. Ensure that all references to this component in other parts of the codebase (such as templates, component initialization logic, and documentation) have been updated consistently.


🏁 Script executed:

#!/bin/bash
# Verify all references to the spectrum component have been updated consistently
echo "Searching for any remaining references to 'deconv_spectrum'..."
rg "deconv_spectrum" --type py

Length of output: 1091


Missing references to rename ‘deconv_spectrum’ to ‘combined_spectrum’

The search uncovered several remaining usages of the old component name that must be updated:

• src/render/initialize.py
elif comp_name == 'deconv_spectrum':
file_manager.get_results(..., ['deconv_spectrum'])
data_to_send['per_scan_data'] = data['deconv_spectrum']
• src/parse/deconv.py
file_manager.store_data(dataset_id, 'deconv_spectrum', deconv_spectrum)
– References in concatenation, renaming, and list literals
• content/FLASHDeconv/FLASHDeconvViewer.py
– Initialization lists including ‘deconv_spectrum’
• content/FLASHDeconv/FLASHDeconvLayoutManager.py
– LayoutManager component list entry

Please update all of these to use 'combined_spectrum' (including import/initialization logic, data-store keys, and template references) so that the component name is consistent across the codebase.

Comment on lines +42 to +46
# # Get data
# quant_df = file_manager.get_results(selected_exp0, 'quant_dfs')['quant_dfs']

component = [[FlashViewerComponent(FLASHQuant())]]
flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid')
# component = [[FlashViewerComponent(FLASHQuant())]]
# flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove commented-out code.

Since the refactoring to use the new rendering system is now complete, the commented-out code for manual data retrieval and component initialization should be removed.

-# # Get data
-# quant_df = file_manager.get_results(selected_exp0, 'quant_dfs')['quant_dfs']

-# component = [[FlashViewerComponent(FLASHQuant())]]
-# flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# # Get data
# quant_df = file_manager.get_results(selected_exp0, 'quant_dfs')['quant_dfs']
component = [[FlashViewerComponent(FLASHQuant())]]
flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid')
# component = [[FlashViewerComponent(FLASHQuant())]]
# flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid')

Comment on lines +18 to +27
# Extract counter
counter = newState.pop('counter')

# We always take previously undefined keys
for k, v in newState.items():
if k not in self.currentState:
self.currentState[k] = v
modified = True

# We only accept conflicts for new states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid mutating caller’s dict & filter out meta-fields

pop('counter') mutates the input dict in-place. If the caller reuses the same object, their view of the data silently changes. Copy the dict first and strip meta-fields explicitly:

-        counter = newState.pop('counter')
+        # Work on a shallow copy so that the caller’s object is untouched
+        working_state = newState.copy()
+        counter = working_state.pop('counter', -1)
+        # Remove meta keys we never persist
+        working_state.pop('id', None)

Then use working_state below.
This also stops the internal state from being polluted with the id field.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +12 to +14
if newState['id'] != self.id:
return False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Gracefully handle missing metadata & avoid shadowing built-ins

newState['id'] will raise a KeyError if the caller forgets to include the field.
Also, naming the attribute id shadows Python’s built-in id() function and can be confusing in debuggers.

-        if newState['id'] != self.id:
+        tracker_id = newState.get('id')
+        if tracker_id is None or tracker_id != self._id:
             return False

Don’t forget to rename the instance attribute in __init__:

-        self.id = np.random.random()
+        self._id = np.random.random()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if newState['id'] != self.id:
return False
tracker_id = newState.get('id')
if tracker_id is None or tracker_id != self._id:
return False
Suggested change
if newState['id'] != self.id:
return False
self._id = np.random.random()

Comment on lines +112 to +125

def ecdf(df):
target_qscores = df[df['TargetDecoyType'] == 0]['Qscore']
decoy_qscores = df[df['TargetDecoyType'] > 0]['Qscore']

ecdf_target = pd.DataFrame({
'x' : np.sort(target_qscores),
'y' : np.arange(1, len(target_qscores) + 1) / len(target_qscores)
})
ecdf_decoy = pd.DataFrame({
'x' : np.sort(decoy_qscores),
'y' : np.arange(1, len(decoy_qscores) + 1) / len(decoy_qscores)
})
return ecdf_target, ecdf_decoy No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

ecdf can divide by zero when no target/decoy spectra are present

When either target_qscores or decoy_qscores is empty, the denominator becomes 0, yielding NaN / Inf.

-def ecdf(df):
+def ecdf(df):
     ...
-    ecdf_target = pd.DataFrame({
-        'x' : np.sort(target_qscores),
-        'y' : np.arange(1, len(target_qscores) + 1) / len(target_qscores)
-    })
+    if len(target_qscores) == 0:
+        ecdf_target = pd.DataFrame({'x': [], 'y': []})
+    else:
+        ecdf_target = pd.DataFrame({
+            'x': np.sort(target_qscores),
+            'y': np.arange(1, len(target_qscores) + 1) / len(target_qscores)
+        })
     ...

Apply the same guard for decoy_qscores.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def ecdf(df):
target_qscores = df[df['TargetDecoyType'] == 0]['Qscore']
decoy_qscores = df[df['TargetDecoyType'] > 0]['Qscore']
ecdf_target = pd.DataFrame({
'x' : np.sort(target_qscores),
'y' : np.arange(1, len(target_qscores) + 1) / len(target_qscores)
})
ecdf_decoy = pd.DataFrame({
'x' : np.sort(decoy_qscores),
'y' : np.arange(1, len(decoy_qscores) + 1) / len(decoy_qscores)
})
return ecdf_target, ecdf_decoy
def ecdf(df):
target_qscores = df[df['TargetDecoyType'] == 0]['Qscore']
decoy_qscores = df[df['TargetDecoyType'] > 0]['Qscore']
# guard against empty target_qscores
if len(target_qscores) == 0:
ecdf_target = pd.DataFrame({'x': [], 'y': []})
else:
ecdf_target = pd.DataFrame({
'x': np.sort(target_qscores),
'y': np.arange(1, len(target_qscores) + 1) / len(target_qscores)
})
ecdf_decoy = pd.DataFrame({
'x': np.sort(decoy_qscores),
'y': np.arange(1, len(decoy_qscores) + 1) / len(decoy_qscores)
})
return ecdf_target, ecdf_decoy

Comment on lines +21 to +28
cached_compression_levels = []
for size in compute_compression_levels(20000, len(data_full)):
cached_compression_levels.append(
file_manager.get_results(
selected_data, [f'ms1_deconv_heatmap_{size}']
)[f'ms1_deconv_heatmap_{size}']
)
cached_compression_levels.append(data_full)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

I/O-heavy loop – fetch compressed heatmaps in a single call

Calling file_manager.get_results inside the loop performs N separate
disk/IO operations. Most FileManager back-ends support batched
fetches; requesting all keys at once is noticeably faster for large
projects.

-        cached_compression_levels = []
-        for size in compute_compression_levels(20000, len(data_full)):
-            cached_compression_levels.append(
-                file_manager.get_results(
-                    selected_data,  [f'ms1_deconv_heatmap_{size}']
-                )[f'ms1_deconv_heatmap_{size}']
-            )
+        keys = [f"ms1_deconv_heatmap_{s}" for s in compute_compression_levels(20000, len(data_full))]
+        cached = file_manager.get_results(selected_data, keys)
+        cached_compression_levels = [cached[k] for k in keys]

This reduces disk seeks / S3 requests and makes page load more
responsive.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cached_compression_levels = []
for size in compute_compression_levels(20000, len(data_full)):
cached_compression_levels.append(
file_manager.get_results(
selected_data, [f'ms1_deconv_heatmap_{size}']
)[f'ms1_deconv_heatmap_{size}']
)
cached_compression_levels.append(data_full)
# Batch‐fetch all the heatmap keys in one call instead of N separate I/O ops
keys = [f"ms1_deconv_heatmap_{s}" for s in compute_compression_levels(20000, len(data_full))]
cached = file_manager.get_results(selected_data, keys)
cached_compression_levels = [cached[k] for k in keys]
cached_compression_levels.append(data_full)

Comment on lines +35 to +52
# Compute maximum amount of peaks per bin that does not exceed limit
counted_peaks = 0
max_peaks_per_bin = -1
new_count = 0
while ((counted_peaks + new_count) < max_datapoints):
# commit prev result
max_peaks_per_bin += 1
counted_peaks += new_count
# compute count for next value
new_count = np.sum(count.flatten() >= (max_peaks_per_bin + 1))

if counted_peaks >= len(data):
break


data = data.groupby(
['mass_bin', 'rt_bin'], group_keys=False, sort=False
).head(max_peaks_per_bin).reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Off-by-one: downsample_heatmap can return an empty DataFrame.

When every bin already exceeds max_datapoints, the first while iteration sets
max_peaks_per_bin → 0, so the final groupby(...).head(0) yields nothing.
The practical expectation is to keep at least one peak per bin capped by
max_datapoints.

-    max_peaks_per_bin = -1
+    max_peaks_per_bin = 0          # keep at least one per bin
 ...
-    while ((counted_peaks + new_count) < max_datapoints):
+    while counted_peaks < max_datapoints:
         # commit prev result
-        max_peaks_per_bin += 1
+        if (counted_peaks + new_count) >= max_datapoints:
+            break
+        max_peaks_per_bin += 1

Also consider short-circuiting by checking len(data) <= max_datapoints at
the top – it avoids all binning work for small datasets.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Compute maximum amount of peaks per bin that does not exceed limit
counted_peaks = 0
max_peaks_per_bin = -1
new_count = 0
while ((counted_peaks + new_count) < max_datapoints):
# commit prev result
max_peaks_per_bin += 1
counted_peaks += new_count
# compute count for next value
new_count = np.sum(count.flatten() >= (max_peaks_per_bin + 1))
if counted_peaks >= len(data):
break
data = data.groupby(
['mass_bin', 'rt_bin'], group_keys=False, sort=False
).head(max_peaks_per_bin).reset_index(drop=True)
# Compute maximum amount of peaks per bin that does not exceed limit
counted_peaks = 0
- max_peaks_per_bin = -1
+ max_peaks_per_bin = 0 # keep at least one per bin
new_count = 0
- while ((counted_peaks + new_count) < max_datapoints):
+ while counted_peaks < max_datapoints:
# commit prev result
- max_peaks_per_bin += 1
+ if (counted_peaks + new_count) >= max_datapoints:
+ break
+ max_peaks_per_bin += 1
counted_peaks += new_count
# compute count for next value
new_count = np.sum(count.flatten() >= (max_peaks_per_bin + 1))
if counted_peaks >= len(data):
break
data = data.groupby(
['mass_bin', 'rt_bin'], group_keys=False, sort=False
).head(max_peaks_per_bin).reset_index(drop=True)

Comment on lines +7 to +15
def compute_compression_levels(minimal_size, total_amount, logger=None):
if total_amount <= minimal_size:
return []
levels = np.logspace(
int(np.log10(minimal_size)), int(np.log10(total_amount)),
int(np.log10(total_amount)) - int(np.log10(minimal_size)) + 1,
dtype='int'
)*int(10**(np.log10(minimal_size)%1))
return levels[levels<total_amount]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Validate input and simplify compute_compression_levels.

  1. Calling np.log10 on minimal_size <= 0 will raise a runtime warning / nan; guard early.
  2. The arithmetic mixing logspace with an additional *10**(fract_log) step is hard to reason about and can silently create duplicated levels for powers of ten (e.g. 100, 1000).
  3. You can obtain strictly-increasing integer levels with a clearer, branch-free expression.
 def compute_compression_levels(minimal_size, total_amount, logger=None):
-    if total_amount <= minimal_size:
-        return []
+    if minimal_size <= 0:
+        raise ValueError("`minimal_size` must be > 0.")
+    if total_amount <= minimal_size:
+        return []
 
-    levels = np.logspace(
-        int(np.log10(minimal_size)), int(np.log10(total_amount)), 
-        int(np.log10(total_amount)) - int(np.log10(minimal_size)) + 1,
-        dtype='int'
-    )*int(10**(np.log10(minimal_size)%1))
-    return levels[levels<total_amount]
+    exponents = np.arange(
+        np.floor(np.log10(minimal_size)),
+        np.ceil(np.log10(total_amount)),
+        dtype=int,
+    )
+    levels = (10 ** exponents).astype(int)
+    return levels[levels < total_amount]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def compute_compression_levels(minimal_size, total_amount, logger=None):
if total_amount <= minimal_size:
return []
levels = np.logspace(
int(np.log10(minimal_size)), int(np.log10(total_amount)),
int(np.log10(total_amount)) - int(np.log10(minimal_size)) + 1,
dtype='int'
)*int(10**(np.log10(minimal_size)%1))
return levels[levels<total_amount]
def compute_compression_levels(minimal_size, total_amount, logger=None):
if minimal_size <= 0:
raise ValueError("`minimal_size` must be > 0.")
if total_amount <= minimal_size:
return []
exponents = np.arange(
np.floor(np.log10(minimal_size)),
np.ceil(np.log10(total_amount)),
dtype=int,
)
levels = (10 ** exponents).astype(int)
return levels[levels < total_amount]

Comment on lines +52 to +53
if updated:
st.rerun(scope='app')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

st.rerun() does not accept a scope argument.

Current Streamlit (≥ 1.30) defines st.rerun() with no parameters.
Passing scope='app' raises TypeError, preventing any state update
from propagating.

-            st.rerun(scope='app')
+            st.rerun()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if updated:
st.rerun(scope='app')
if updated:
st.rerun()

Comment on lines +37 to +39
# Hash updated. filtered data
data['hash'] = hash_complex(data)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Hashing full DataFrames is extremely expensive.

hash_complex pickles the whole data dict → 20 k-row heatmaps or large
spectra easily reach multiple MB, causing noticeable latency every render.
Instead, hash lightweight, immutable identifiers:

-    data['hash'] = hash_complex(data)
+    data['hash'] = hash_complex({
+        k: v for k, v in data.items()
+        if not isinstance(v, pd.DataFrame)
+    })

or compute and store data_version counters during updates.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Hash updated. filtered data
data['hash'] = hash_complex(data)
# Hash updated. filtered data
data['hash'] = hash_complex({
k: v for k, v in data.items()
if not isinstance(v, pd.DataFrame)
})

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/parse/tnt.py (2)

40-55: Fix static analysis issues in the tag dataframe linearization code.

There are a few minor issues in this loop:

  1. Unused loop control variable i in line 41
  2. Unnecessarily using .keys() in the dictionary membership tests
- new_tag_df = {c : [] for c in tag_df.columns}
- for i, row in tag_df.iterrows():
+ new_tag_df = {c: [] for c in tag_df.columns}
+ for _, row in tag_df.iterrows():
    # No splitting if it is not recognized as string
    if pd.isna(row['ProteoformIndex']):
        row['ProteoformIndex'] = -1
    if isinstance(row['ProteoformIndex'], str) and (';' in row['ProteoformIndex']):
        no_items = row['ProteoformIndex'].count(';') + 1
-       for c in new_tag_df.keys():
+       for c in new_tag_df:
            if (isinstance(row[c], str)) and (';' in row[c]):
                new_tag_df[c] += row[c].split(';')
            else:
                new_tag_df[c] += [row[c]]*no_items
    else:
-       for c in new_tag_df.keys():
+       for c in new_tag_df:
            new_tag_df[c].append(row[c])
🧰 Tools
🪛 Ruff (0.8.2)

41-41: Loop control variable i not used within loop body

(B007)


47-47: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


53-53: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


133-133: Avoid using ambiguous variable names.

The single-letter variable l is ambiguous and could be confused with the number 1 or capital I.

- } for s, e, m, l in zip(mod_starts, mod_ends, mod_masses, mod_labels)
+ } for s, e, m, label in zip(mod_starts, mod_ends, mod_masses, mod_labels)
🧰 Tools
🪛 Ruff (0.8.2)

133-133: Ambiguous variable name: l

(E741)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 494611a and 25769f9.

📒 Files selected for processing (5)
  • content/FLASHDeconv/FLASHDeconvViewer.py (5 hunks)
  • content/FLASHTnT/FLASHTnTLayoutManager.py (1 hunks)
  • content/FLASHTnT/FLASHTnTViewer.py (6 hunks)
  • src/parse/tnt.py (1 hunks)
  • src/render/components.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • content/FLASHTnT/FLASHTnTLayoutManager.py
  • src/render/components.py
🧰 Additional context used
🪛 Ruff (0.8.2)
src/parse/tnt.py

41-41: Loop control variable i not used within loop body

(B007)


47-47: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


53-53: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


133-133: Ambiguous variable name: l

(E741)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-full-app
  • GitHub Check: build-openms
🔇 Additional comments (13)
content/FLASHDeconv/FLASHDeconvViewer.py (6)

7-7: Good architectural change with simplified import structure.

Replacing multiple specific imports with a single render_grid import significantly simplifies this file by delegating rendering and data preparation logic to a centralized function.


11-12: LGTM! Dynamic layout adjustment based on availability of sequence data.

The conditional logic to add 'sequence_view' to DEFAULT_LAYOUT when input_sequence is available is a good enhancement that makes the UI more responsive to the available data.


60-63: LGTM! Simplified rendering logic with proper parameter passing.

Delegating the complex data processing and rendering to the render_grid function improves separation of concerns and code maintainability.


73-77: LGTM! Consistent use of render_grid for the second view.

The side-by-side view implementation correctly passes appropriate parameters to distinguish it from the first view.


90-93: LGTM! Consistent implementation for single experiment view.

The rendering approach is consistent with the other views, maintaining a clean separation of concerns.


111-116: LGTM! Appropriate parameterization for multiple experiment layouts.

The multi-experiment rendering correctly identifies each grid with unique identifiers.

src/parse/tnt.py (1)

15-15: Good function signature update aligns with architectural changes.

Adding file_manager and dataset_id parameters shifts the responsibility of data storage to the caller and makes the data flow more explicit, which is a solid architectural improvement.

content/FLASHTnT/FLASHTnTViewer.py (6)

7-7: Good architectural change with simplified import structure.

Replacing multiple specific imports with a single render_grid import significantly simplifies this file by delegating rendering and data preparation logic to a centralized function.


14-14: LGTM! Appropriate component replacement in the default layout.

Replacing 'deconv_spectrum' with 'combined_spectrum' suggests an improved or more integrated visualization component.


67-67: LGTM! Simplified rendering with proper parameter passing.

Delegating the complex data processing and rendering to the render_grid function improves separation of concerns and code maintainability.


76-76: LGTM! Consistent use of render_grid for the second view with proper grid identifier.

The side-by-side view implementation correctly passes appropriate parameters to distinguish it from the first view.


89-89: LGTM! Consistent implementation for single experiment view.

The rendering approach is consistent with the other views, maintaining a clean separation of concerns.


108-108: LGTM! Appropriate parameterization for multiple experiment layouts.

The multi-experiment rendering correctly identifies each grid with unique identifiers and handles experiment keys properly.

Comment on lines +15 to +160
def parseTnT(file_manager, dataset_id, deconv_mzML, anno_mzML, tag_tsv, protein_tsv, logger=None):

deconv_df, _, tolerance, _, _, = parseFLASHDeconvOutput(
anno_mzML, deconv_mzML
)
tag_df, protein_df = parseFLASHTaggerOutput(tag_tsv, protein_tsv)

# Add tolerance as a new columnn, wasteful but gets the job done..
deconv_df['tol'] = tolerance

return {
'anno_dfs' : anno_df,
'deconv_dfs' : deconv_df,
'tag_dfs' : tag_df,
'protein_dfs' : protein_df
# protein_table
protein_df['length'] = protein_df['DatabaseSequence'].apply(lambda x : len(x))
protein_df = protein_df.rename(
columns={
'ProteoformIndex' : 'index',
'ProteinAccession' : 'accession',
'ProteinDescription' : 'description',
'DatabaseSequence' : 'sequence'
}
)
protein_df['description'] = protein_df['description'].apply(
lambda x: x[:50] + '...' if len(x) > 50 else x
)
file_manager.store_data(dataset_id, 'protein_dfs', protein_df)

# tag_table

# Process tag df into a linear data format
new_tag_df = {c : [] for c in tag_df.columns}
for i, row in tag_df.iterrows():
# No splitting if it is not recognized as string
if pd.isna(row['ProteoformIndex']):
row['ProteoformIndex'] = -1
if isinstance(row['ProteoformIndex'], str) and (';' in row['ProteoformIndex']):
no_items = row['ProteoformIndex'].count(';') + 1
for c in new_tag_df.keys():
if (isinstance(row[c], str)) and (';' in row[c]):
new_tag_df[c] += row[c].split(';')
else:
new_tag_df[c] += [row[c]]*no_items
else:
for c in new_tag_df.keys():
new_tag_df[c].append(row[c])
tag_df = pd.DataFrame(new_tag_df)

tsv_buffer = StringIO()
tag_df.to_csv(tsv_buffer, sep='\t', index=False)
tsv_buffer.seek(0)
tag_df = pd.read_csv(tsv_buffer, sep='\t')

# Complete df
tag_df['StartPosition'] = tag_df['StartPosition'] - 1
tag_df['EndPos'] = tag_df['StartPosition'] + tag_df['Length'] - 1
tag_df = tag_df.rename(
columns={
'ProteoformIndex' : 'ProteinIndex',
'DeNovoScore' : 'Score',
'Masses' : 'mzs',
'StartPosition' : 'StartPos'
}
)
file_manager.store_data(dataset_id, 'tag_dfs', tag_df)
# sequence_view & internal_fragment_map
sequence_data = {}
internal_fragment_data = {}
# Compute coverage
for i, row in protein_df.iterrows():
pid = row['index']
sequence = row['sequence']
coverage = np.zeros(len(sequence), dtype='float')
for i in range(len(sequence)):
coverage[i] = np.sum(
(tag_df['ProteinIndex'] == pid) &
(tag_df['StartPos'] <= i) &
(tag_df['EndPos'] >= i)
)
p_cov = np.zeros(len(coverage))
if np.max(coverage) > 0:
p_cov = coverage/np.max(coverage)

proteoform_start = row['StartPosition']
proteoform_end = row['EndPosition']
start_index = 0 if proteoform_start <= 0 else proteoform_start - 1
end_index = len(sequence) - 1 if proteoform_end <= 0 else proteoform_end - 1


if row['ModCount'] > 0:
mod_masses = [float(m) for m in str(row['ModMass']).split(';')]
mod_starts = [int(float(s)) for s in str(row['ModStart']).split(';')]
mod_ends = [int(float(s)) for s in str(row['ModEnd']).split(';')]
if pd.isna(row['ModID']):
mod_labels = [''] * row['ModCount']
else:
mod_labels = [s[:-1].replace(',', '; ') for s in str(row['ModID']).split(';')]
else:
mod_masses = []
mod_starts = []
mod_ends = []
mod_labels = []
modifications = []
for s, e, m in zip(mod_starts, mod_ends, mod_masses):
modifications.append((s-start_index, e-start_index, m))

sequence = str(sequence)
sequence_data[pid] = getFragmentDataFromSeq(
str(sequence)[start_index:end_index+1], p_cov, np.max(coverage),
modifications
)

sequence_data[pid]['sequence'] = list(sequence)
sequence_data[pid]['proteoform_start'] = proteoform_start - 1
sequence_data[pid]['proteoform_end'] = proteoform_end - 1
sequence_data[pid]['computed_mass'] = row['ProteoformMass']
sequence_data[pid]['theoretical_mass'] = remove_ambigious(AASequence.fromString(sequence)).getMonoWeight()
sequence_data[pid]['modifications'] = [
{
# Modfications are zero based
'start' : s - 1,
'end' : e - 1,
'mass_diff' : m,
'labels' : l
} for s, e, m, l in zip(mod_starts, mod_ends, mod_masses, mod_labels)
]

internal_fragment_data[pid] = getInternalFragmentDataFromSeq(
str(sequence)[start_index:end_index+1], modifications
)

file_manager.store_data(dataset_id, 'sequence_data', sequence_data)
file_manager.store_data(
dataset_id, 'internal_fragment_data', internal_fragment_data
)

fragments = ['b', 'y']
if file_manager.result_exists(dataset_id, 'FTnT_parameters_json'):
tnt_settings_file = file_manager.get_results(
dataset_id, ['FTnT_parameters_json']
)['FTnT_parameters_json']
with open(tnt_settings_file, 'r') as f:
tnt_settings = json.load(f)
if 'ion_type' in tnt_settings:
fragments = tnt_settings['ion_type'].split('\n')
settings = {
'tolerance' : tolerance,
'ion_types' : fragments
}
file_manager.store_data(
dataset_id, 'settings', settings
) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider refactoring the large parseTnT function for better maintainability.

This function handles multiple responsibilities (parsing data, processing tags, computing coverage, generating fragments, reading settings). Consider breaking it down into smaller, focused functions for improved maintainability and testability.

For example, you could extract these logical sections into separate functions:

  1. transform_protein_dataframe(protein_df)
  2. linearize_tag_dataframe(tag_df)
  3. compute_sequence_coverage(protein_df, tag_df)
  4. generate_fragment_data(protein_row, modifications)
  5. load_ion_type_settings(file_manager, dataset_id)

This would make the main function much cleaner and more maintainable:

def parseTnT(file_manager, dataset_id, deconv_mzML, anno_mzML, tag_tsv, protein_tsv, logger=None):
    # Parse input data
    deconv_df, _, tolerance, _, _,  = parseFLASHDeconvOutput(anno_mzML, deconv_mzML)
    tag_df, protein_df = parseFLASHTaggerOutput(tag_tsv, protein_tsv)
    
    # Transform and store protein data
    protein_df = transform_protein_dataframe(protein_df)
    file_manager.store_data(dataset_id, 'protein_dfs', protein_df)
    
    # Transform and store tag data
    tag_df = linearize_tag_dataframe(tag_df)
    file_manager.store_data(dataset_id, 'tag_dfs', tag_df)
    
    # Generate and store sequence and fragment data
    sequence_data, internal_fragment_data = generate_sequence_and_fragment_data(protein_df, tag_df)
    file_manager.store_data(dataset_id, 'sequence_data', sequence_data)
    file_manager.store_data(dataset_id, 'internal_fragment_data', internal_fragment_data)
    
    # Load and store settings
    settings = load_ion_type_settings(file_manager, dataset_id, tolerance)
    file_manager.store_data(dataset_id, 'settings', settings)

This improves readability, testability, and maintainability.

🧰 Tools
🪛 Ruff (0.8.2)

41-41: Loop control variable i not used within loop body

(B007)


47-47: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


53-53: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


133-133: Ambiguous variable name: l

(E741)

@t0mdavid-m t0mdavid-m merged commit 962c7b0 into develop May 6, 2025
0 of 5 checks passed
@t0mdavid-m t0mdavid-m deleted the introduce_fragments_finished branch May 7, 2025 10:20
@coderabbitai coderabbitai bot mentioned this pull request May 25, 2025
This was referenced Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants