-
Notifications
You must be signed in to change notification settings - Fork 0
add fragments #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Caution Review failedThe pull request is closed. WalkthroughThis update introduces a major refactor of the data rendering and visualization pipeline. It replaces inline data processing and component rendering logic in viewer scripts with a unified, modular approach centered on new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant StreamlitApp
participant render_grid
participant initialize_data
participant FileManager
participant StateTracker
User->>StreamlitApp: Interacts/selects experiment/layout
StreamlitApp->>render_grid: Calls with selected_data, layout, file_manager, tool, identifier
render_grid->>initialize_data: For each component, prepare data and component
initialize_data->>FileManager: Fetches required datasets
initialize_data-->>render_grid: Returns data, components, additional_data
render_grid->>StateTracker: Retrieve or update state for component
render_grid->>StreamlitApp: Renders component in grid layout
Note over StreamlitApp,User: User interacts, state and data update via StateTracker
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
🔭 Outside diff range comments (2)
content/FLASHTnT/FLASHTnTViewer.py (1)
18-25:⚠️ Potential issueGuard against missing keys in
st.session_state
st.session_state[...]raises aKeyErrorif the key is absent.
During the first page load (or after a browser refresh) the dynamic keys
selected_experiment_dropdown_{idx}_taggerare not populated yet, so the
loop can crash before the user even interacts with the UI.- for exp_index in range(1, len(st.session_state["saved_layout_setting_tagger"])): - if st.session_state[f'selected_experiment_dropdown_{exp_index}_tagger'] is None: + for exp_index in range(1, len(st.session_state["saved_layout_setting_tagger"])): + dropdown_key = f'selected_experiment_dropdown_{exp_index}_tagger' + if st.session_state.get(dropdown_key) is None: continue - st.session_state[f"selected_experiment{exp_index}_tagger"] = st.session_state[f'selected_experiment_dropdown_{exp_index}_tagger'] + st.session_state[f"selected_experiment{exp_index}_tagger"] = st.session_state[dropdown_key]Using
.get()(or anincheck) keeps the page responsive and avoids hard-to-trace
state crashes.src/render/components.py (1)
39-46:⚠️ Potential issueAttribute naming mismatch breaks front-end expectations
PlotlyHeatmapdefinesshowLegend(camelCase) at class level but
initialisesself.show_legend(snake_case) in__init__. If the
frontend expectsshowLegend, the prop will be missing.- self.show_legend = show_legend + self.showLegend = show_legendAligning the attribute name prevents silent UI glitches.
🧹 Nitpick comments (18)
.streamlit/config.toml (1)
6-6: Clarify the unit formaxUploadSize.
The newmaxUploadSizesetting isn’t explicit about whether the value is in megabytes or bytes. Please verify the correct unit in the Streamlit server config docs and consider adding an inline comment, for example:maxUploadSize = 1000 # in MBsrc/render/util.py (1)
1-6: New utility function for object hashing.The
hash_complexfunction provides a way to generate consistent hash keys for complex data structures, which is essential for state tracking and efficient UI updates in the new rendering system.However, be aware that using
picklefor serialization can be a security concern if you ever process untrusted data.For improved security, consider using a safer serialization method like JSON for data that could potentially come from untrusted sources:
-import pickle +import json import hashlib def hash_complex(d): - serialized = pickle.dumps(d) + try: + serialized = json.dumps(d, sort_keys=True).encode('utf-8') + except (TypeError, ValueError): + # Fallback to pickle only for trusted internal data + import pickle + serialized = pickle.dumps(d) return hashlib.sha256(serialized).hexdigest()src/parse/masstable.py (2)
8-8:loggerparameter is unused – decide whether to wire it in or drop it
parseFLASHDeconvOutputnow accepts aloggerbut never references it. Keeping dead-code parameters is confusing and invites drift between the call-site and implementation. Either:
- Integrate the logger (e.g.
logger.debug(...)for the heavy loop below), or- Remove the parameter until it is genuinely required.
A quick fix, if you do not need logging yet:
-def parseFLASHDeconvOutput(annotated, deconvolved, logger=None): +def parseFLASHDeconvOutput(annotated, deconvolved):
32-32: Unused loop indexi– remove to prevent shadowing and lint noise
enumerateintroducesi, but the variable is never used. Using a plainzip(...)is cleaner and avoids accidentally shadowing the inner-loop variable namedia few lines later.- for i, (spec, aspec) in enumerate(zip(deconvolved_exp, annotated_exp)): + for spec, aspec in zip(deconvolved_exp, annotated_exp):src/render/StateTracker.py (2)
30-33: Minor:key in dictis faster thankey in dict.keys()Static-analysis hint SIM118 is valid here.
- k: newState[k] for k in newState.keys() + k: working_state[k] for k in working_state🧰 Tools
🪛 Ruff (0.8.2)
30-30: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
44-47: Return the boolean directlyYou can collapse the final
if/else, as flagged by SIM103:- if modified: - return True - else: - return False + return bool(modified)🧰 Tools
🪛 Ruff (0.8.2)
44-47: Return the condition
bool(modified)directlyReplace with
return bool(modified)(SIM103)
content/FLASHDeconv/FLASHDeconvViewer.py (1)
11-12: Appending toDEFAULT_LAYOUTmutates the list between rerunsBecause Streamlit re-executes the script on every interaction, the line
DEFAULT_LAYOUT = DEFAULT_LAYOUT + [['sequence_view']]creates a new list each run, but if any other module imports this symbol it may already hold the original list without
sequence_view. If you intend the layout to be dynamic, rename the variable (e.g.default_layout) and build it fresh each time; otherwise define two constants and pick one.src/parse/tnt.py (3)
38-52: Inefficient & verbose dictionary access + loop variable shadowing
new_tag_df.keys()is called repeatedly – use direct iteration (for col in new_tag_df:).- The inner loop re-uses the outer loop variable
i, hiding it and triggering Ruff B007.-for i, row in tag_df.iterrows(): +for row_idx, row in tag_df.iterrows(): ... - for i in range(len(sequence)): + for pos in range(len(sequence)): ...This removes the shadowing and the unnecessary
.keys()calls.🧰 Tools
🪛 Ruff (0.8.2)
38-38: Loop control variable
inot used within loop body(B007)
44-44: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
50-50: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
124-131: Rename ambiguous variablel(E741) for clarityUsing single-letter
lis hard to read and was flagged by Ruff.- } for s, e, m, l in zip(mod_starts, mod_ends, mod_masses, mod_labels) + } for s, e, m, label in zip(mod_starts, mod_ends, mod_masses, mod_labels)Also update the key in the dict comprehension accordingly (
'labels': label).🧰 Tools
🪛 Ruff (0.8.2)
130-130: Ambiguous variable name:
l(E741)
75-88: Coverage computation is O(N²); consider vectorisation for large proteinsIterating over every residue for every protein will become a hotspot for long sequences and large datasets.
A simple vectorised alternative:
mask = (tag_df['ProteinIndex'] == pid) starts = tag_df.loc[mask, 'StartPos'].to_numpy() ends = tag_df.loc[mask, 'EndPos'].to_numpy() coverage = np.add.reduce([ (np.arange(len(sequence))[:, None] >= starts) & (np.arange(len(sequence))[:, None] <= ends) ], axis=1).astype(float)Not blocking, but worth addressing for scalability.
src/render/components.py (1)
12-15:_component_funcglobal is never used – remove to avoid confusionThe function stores the component in
st.session_statebut keeps an
unused module-level_component_func. Dropping the global eliminates
dead state and clarifies the single source of truth.src/render/initialize.py (2)
13-18: Component name / dataset key spelling inconsistency
comp_name == 'ms1_deconv_heat_map'(extra underscore) retrieves the
dataset'ms1_deconv_heatmap'(no underscore).
The mismatch is harmless now but easily leads to copy-paste errors and
UI/readability confusion. Pick one spelling and use it consistently
across the codebase and frontend.
8-125: Monsterinitialize_data– consider splitting by component typeThe 120-line
if/elifchain is hard to read, fragile to edit, and
impossible to unit-test granularly. A registry/dictionary approach keeps
logic flat and extendable:COMPONENT_BUILDERS = { 'scan_table': ( lambda fm, ds, tool: {'per_scan_data': fm.get_results(ds, ['scan_table'])['scan_table']}, lambda: Tabulator('ScanTable') ), ... } def initialize_data(comp_name, selected_data, file_manager, tool): data_fn, comp_fn = COMPONENT_BUILDERS[comp_name] data_to_send = {} additional_data = {'dataset': selected_data} data_to_send.update(data_fn(file_manager, selected_data, tool)) components = [[FlashViewerComponent(comp_fn())]] return data_to_send, components, additional_dataThis adheres to the Open/Closed principle and makes adding a new
component a one-liner.src/render/compression.py (1)
1-2: Remove unusedpandasimport.
pandasis imported but never referenced. Apart from the wasted import time, Ruff already flags this as ‑-F401.
Simply drop the line –downsample_heatmapreceives apd.DataFrame, but the function never callspd.*APIs.-import pandas as pd🧰 Tools
🪛 Ruff (0.8.2)
2-2:
pandasimported but unusedRemove unused import:
pandas(F401)
src/render/update.py (3)
11-18: Unused parameters and divergent return type.
render_heatmap(…, dataset_name, component_name)never consumes
dataset_namenorcomponent_name. Readers will assume these matter for the
computation or Streamlit cache key (they do not), and the extra arguments get
propagated throughout the call stack.-def render_heatmap(full_data, selection, dataset_name, component_name): +def render_heatmap(full_data, selection):Remember to update the two call-sites below.
This also shortens the cache key (Streamlit includes arg values).
20-33:relevant_datacan remain uninitialised if the for-loop never executes.Although
full_datais expected to be non-empty, guarding for corrupt input
preventsUnboundLocalError.if not full_data: return pd.DataFrame(columns=['rt', 'mass', 'intensity'])A lightweight check above the
for dataset in full_data:line improves
defensive robustness.
80-88: Flatten nestediffor readability and to satisfy Ruff SIM102.- elif (component == 'Deconvolved MS1 Heatmap'): - if 'heatmap_deconv' in selection_store: + elif component == 'Deconvolved MS1 Heatmap' and 'heatmap_deconv' in selection_store:Same for the raw heatmap branch below. Behaviour is identical; intent is
clearer.🧰 Tools
🪛 Ruff (0.8.2)
87-88: Use a single
ifstatement instead of nestedifstatementsCombine
ifstatements usingand(SIM102)
src/render/render.py (1)
63-78: Variable leakage:defaultpoints to the last loop value.Inside the post-initialisation block you reuse the name
default
from the precedingfor … in zip()loop. After that loop finishes,
defaultholdsdefault_state, notdefault_data– relying on this
implicit behaviour is brittle and obscures intent.- st.session_state['state_tracker'][tool][identifier] = default + st.session_state['state_tracker'][tool][identifier] = StateTracker()Likewise, consider assigning
default_plot_dataand
default_trackervariables to avoid shadowing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
js-component/dist/assets/index-422c6bea.cssis excluded by!**/dist/**js-component/dist/assets/index-c2f12ebe.jsis excluded by!**/dist/**js-component/dist/index.htmlis excluded by!**/dist/**
📒 Files selected for processing (18)
.streamlit/config.toml(1 hunks)content/FLASHDeconv/FLASHDeconvViewer.py(3 hunks)content/FLASHQuant/FLASHQuantViewer.py(2 hunks)content/FLASHTnT/FLASHTnTLayoutManager.py(1 hunks)content/FLASHTnT/FLASHTnTViewer.py(4 hunks)openms-streamlit-vue-component(1 hunks)src/Workflow.py(5 hunks)src/parse/deconv.py(1 hunks)src/parse/masstable.py(2 hunks)src/parse/quant.py(1 hunks)src/parse/tnt.py(1 hunks)src/render/StateTracker.py(1 hunks)src/render/components.py(3 hunks)src/render/compression.py(1 hunks)src/render/initialize.py(1 hunks)src/render/render.py(1 hunks)src/render/update.py(1 hunks)src/render/util.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
src/parse/quant.py (1)
src/parse/flashquant.py (2)
parseFLASHQuantOutput(7-32)connectTraceWithResult(36-49)
content/FLASHDeconv/FLASHDeconvViewer.py (1)
src/render/render.py (1)
render_grid(56-107)
src/Workflow.py (3)
src/parse/tnt.py (1)
parseTnT(15-157)src/parse/deconv.py (1)
parseDeconv(8-110)src/workflow/FileManager.py (2)
get_results(403-445)result_exists(447-465)
src/render/render.py (5)
src/render/util.py (1)
hash_complex(4-6)src/render/StateTracker.py (3)
StateTracker(3-55)getState(49-55)updateState(10-47)src/render/initialize.py (1)
initialize_data(8-125)src/render/update.py (2)
update_data(46-60)filter_data(63-118)src/render/components.py (1)
get_component_function(13-28)
src/render/update.py (2)
src/render/compression.py (1)
downsample_heatmap(18-56)src/render/sequence.py (2)
getFragmentDataFromSeq(87-145)getInternalFragmentDataFromSeq(258-272)
🪛 Ruff (0.8.2)
src/render/StateTracker.py
30-30: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
44-47: Return the condition bool(modified) directly
Replace with return bool(modified)
(SIM103)
src/render/update.py
87-88: Use a single if statement instead of nested if statements
Combine if statements using and
(SIM102)
src/render/compression.py
2-2: pandas imported but unused
Remove unused import: pandas
(F401)
src/parse/tnt.py
38-38: Loop control variable i not used within loop body
(B007)
44-44: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
50-50: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
130-130: Ambiguous variable name: l
(E741)
🔇 Additional comments (8)
openms-streamlit-vue-component (1)
1-1: Verify updated submodule commit
The submodule pointer has been bumped to2185551476438398ca3ce8e33d3af018aeebb1f9. Please confirm that this commit exists in theopenms-streamlit-vue-componentrepo and includes the Vue component changes required by the new rendering framework. Also ensure CI workflows properly initialize and update this submodule.src/parse/quant.py (1)
1-2: Import statements updated to align with project restructuring.The imports have been moved from
src.flashquanttosrc.parse.flashquant, which aligns with the broader code reorganization happening in the parsing modules. This change improves the code organization by keeping parsing-related functionality in the dedicatedparsepackage.content/FLASHQuant/FLASHQuantViewer.py (2)
7-8: Updated imports for the new rendering system.The change from specific component imports to the centralized
render_gridfunction aligns with the new modular rendering approach being implemented.
37-40: Simplified UI rendering with centralized render_grid function.The implementation now uses the centralized
render_gridfunction instead of explicit component initialization and rendering. This approach is more maintainable and consistent with the new architecture.src/render/StateTracker.py (1)
49-55: Do not expose internal_idin public state
getStatecurrently leaks the tracker's private identifier. External code should not be able to spoof another tracker so easily. Consider omitting_idor renaming it to something clearly internal (e.g.tracker_id) and documenting the expected usage contract.src/Workflow.py (1)
234-261: Ensure files remain accessible afterrmtree
parseDeconv/parseTnTread the paths returned byfile_manager.get_results.
Iffile_manager.store_file()merely records the original path (and does not copy the file into the cache), the subsequentrmtree(folder_path)(l 264) will remove those files, breaking later retrieval/download functionality.Please verify that
store_filecopies or hard-links the file into a persistent cache directory; otherwise copy the files manually before deleting the temp folder.content/FLASHTnT/FLASHTnTViewer.py (1)
48-53: Index lookup may crash when a previously-selected experiment disappears
name_to_index[st.session_state.selected_experiment0_tagger]
will raise aKeyErrorwhen the experiment that was in the user’s
session state is no longer present inresults
(e.g. file deleted or workspace switched).
Consider a safer default:- index=name_to_index[st.session_state.selected_experiment0_tagger] if 'selected_experiment0_tagger' in st.session_state else None, + index=name_to_index.get( + st.session_state.get('selected_experiment0_tagger', ''), + 0 # fallback to first element + ),This prevents an unhandled exception and keeps the UI usable.
src/render/update.py (1)
9-18:hash_funcsforces cache collisions – confirm that this is intentional.Returning a constant (
lambda _: 1) makesst.cache_datatreat all
DataFrames as identical → different datasets can be mis-served from the same
cache entry. If the goal is “ignore raw data size but keep per-call
uniqueness”, consider hashing cheap metadata instead:hash_funcs = {pd.DataFrame: lambda df: (df.shape, int(df.memory_usage().sum()))}Otherwise callers might receive stale heatmaps after switching files.
| 'internal_fragment_map', | ||
| 'tag_table', | ||
| 'deconv_spectrum' | ||
| 'combined_spectrum' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Component name updated from 'deconv_spectrum' to 'combined_spectrum'.
This naming change aligns with the unified data rendering approach being implemented across the visualization modules. Ensure that all references to this component in other parts of the codebase (such as templates, component initialization logic, and documentation) have been updated consistently.
🏁 Script executed:
#!/bin/bash
# Verify all references to the spectrum component have been updated consistently
echo "Searching for any remaining references to 'deconv_spectrum'..."
rg "deconv_spectrum" --type pyLength of output: 1091
Missing references to rename ‘deconv_spectrum’ to ‘combined_spectrum’
The search uncovered several remaining usages of the old component name that must be updated:
• src/render/initialize.py
– elif comp_name == 'deconv_spectrum':
– file_manager.get_results(..., ['deconv_spectrum'])
– data_to_send['per_scan_data'] = data['deconv_spectrum']
• src/parse/deconv.py
– file_manager.store_data(dataset_id, 'deconv_spectrum', deconv_spectrum)
– References in concatenation, renaming, and list literals
• content/FLASHDeconv/FLASHDeconvViewer.py
– Initialization lists including ‘deconv_spectrum’
• content/FLASHDeconv/FLASHDeconvLayoutManager.py
– LayoutManager component list entry
Please update all of these to use 'combined_spectrum' (including import/initialization logic, data-store keys, and template references) so that the component name is consistent across the codebase.
| # # Get data | ||
| # quant_df = file_manager.get_results(selected_exp0, 'quant_dfs')['quant_dfs'] | ||
|
|
||
| component = [[FlashViewerComponent(FLASHQuant())]] | ||
| flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid') | ||
| # component = [[FlashViewerComponent(FLASHQuant())]] | ||
| # flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove commented-out code.
Since the refactoring to use the new rendering system is now complete, the commented-out code for manual data retrieval and component initialization should be removed.
-# # Get data
-# quant_df = file_manager.get_results(selected_exp0, 'quant_dfs')['quant_dfs']
-# component = [[FlashViewerComponent(FLASHQuant())]]
-# flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # # Get data | |
| # quant_df = file_manager.get_results(selected_exp0, 'quant_dfs')['quant_dfs'] | |
| component = [[FlashViewerComponent(FLASHQuant())]] | |
| flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid') | |
| # component = [[FlashViewerComponent(FLASHQuant())]] | |
| # flash_viewer_grid_component(components=component, data={'quant_data': quant_df, 'dataset': selected_exp0}, component_key='flash_viewer_grid') |
| # Extract counter | ||
| counter = newState.pop('counter') | ||
|
|
||
| # We always take previously undefined keys | ||
| for k, v in newState.items(): | ||
| if k not in self.currentState: | ||
| self.currentState[k] = v | ||
| modified = True | ||
|
|
||
| # We only accept conflicts for new states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid mutating caller’s dict & filter out meta-fields
pop('counter') mutates the input dict in-place. If the caller reuses the same object, their view of the data silently changes. Copy the dict first and strip meta-fields explicitly:
- counter = newState.pop('counter')
+ # Work on a shallow copy so that the caller’s object is untouched
+ working_state = newState.copy()
+ counter = working_state.pop('counter', -1)
+ # Remove meta keys we never persist
+ working_state.pop('id', None)Then use working_state below.
This also stops the internal state from being polluted with the id field.
Committable suggestion skipped: line range outside the PR's diff.
| if newState['id'] != self.id: | ||
| return False | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Gracefully handle missing metadata & avoid shadowing built-ins
newState['id'] will raise a KeyError if the caller forgets to include the field.
Also, naming the attribute id shadows Python’s built-in id() function and can be confusing in debuggers.
- if newState['id'] != self.id:
+ tracker_id = newState.get('id')
+ if tracker_id is None or tracker_id != self._id:
return FalseDon’t forget to rename the instance attribute in __init__:
- self.id = np.random.random()
+ self._id = np.random.random()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if newState['id'] != self.id: | |
| return False | |
| tracker_id = newState.get('id') | |
| if tracker_id is None or tracker_id != self._id: | |
| return False |
| if newState['id'] != self.id: | |
| return False | |
| self._id = np.random.random() |
|
|
||
| def ecdf(df): | ||
| target_qscores = df[df['TargetDecoyType'] == 0]['Qscore'] | ||
| decoy_qscores = df[df['TargetDecoyType'] > 0]['Qscore'] | ||
|
|
||
| ecdf_target = pd.DataFrame({ | ||
| 'x' : np.sort(target_qscores), | ||
| 'y' : np.arange(1, len(target_qscores) + 1) / len(target_qscores) | ||
| }) | ||
| ecdf_decoy = pd.DataFrame({ | ||
| 'x' : np.sort(decoy_qscores), | ||
| 'y' : np.arange(1, len(decoy_qscores) + 1) / len(decoy_qscores) | ||
| }) | ||
| return ecdf_target, ecdf_decoy No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ecdf can divide by zero when no target/decoy spectra are present
When either target_qscores or decoy_qscores is empty, the denominator becomes 0, yielding NaN / Inf.
-def ecdf(df):
+def ecdf(df):
...
- ecdf_target = pd.DataFrame({
- 'x' : np.sort(target_qscores),
- 'y' : np.arange(1, len(target_qscores) + 1) / len(target_qscores)
- })
+ if len(target_qscores) == 0:
+ ecdf_target = pd.DataFrame({'x': [], 'y': []})
+ else:
+ ecdf_target = pd.DataFrame({
+ 'x': np.sort(target_qscores),
+ 'y': np.arange(1, len(target_qscores) + 1) / len(target_qscores)
+ })
...Apply the same guard for decoy_qscores.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def ecdf(df): | |
| target_qscores = df[df['TargetDecoyType'] == 0]['Qscore'] | |
| decoy_qscores = df[df['TargetDecoyType'] > 0]['Qscore'] | |
| ecdf_target = pd.DataFrame({ | |
| 'x' : np.sort(target_qscores), | |
| 'y' : np.arange(1, len(target_qscores) + 1) / len(target_qscores) | |
| }) | |
| ecdf_decoy = pd.DataFrame({ | |
| 'x' : np.sort(decoy_qscores), | |
| 'y' : np.arange(1, len(decoy_qscores) + 1) / len(decoy_qscores) | |
| }) | |
| return ecdf_target, ecdf_decoy | |
| def ecdf(df): | |
| target_qscores = df[df['TargetDecoyType'] == 0]['Qscore'] | |
| decoy_qscores = df[df['TargetDecoyType'] > 0]['Qscore'] | |
| # guard against empty target_qscores | |
| if len(target_qscores) == 0: | |
| ecdf_target = pd.DataFrame({'x': [], 'y': []}) | |
| else: | |
| ecdf_target = pd.DataFrame({ | |
| 'x': np.sort(target_qscores), | |
| 'y': np.arange(1, len(target_qscores) + 1) / len(target_qscores) | |
| }) | |
| ecdf_decoy = pd.DataFrame({ | |
| 'x': np.sort(decoy_qscores), | |
| 'y': np.arange(1, len(decoy_qscores) + 1) / len(decoy_qscores) | |
| }) | |
| return ecdf_target, ecdf_decoy |
| cached_compression_levels = [] | ||
| for size in compute_compression_levels(20000, len(data_full)): | ||
| cached_compression_levels.append( | ||
| file_manager.get_results( | ||
| selected_data, [f'ms1_deconv_heatmap_{size}'] | ||
| )[f'ms1_deconv_heatmap_{size}'] | ||
| ) | ||
| cached_compression_levels.append(data_full) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
I/O-heavy loop – fetch compressed heatmaps in a single call
Calling file_manager.get_results inside the loop performs N separate
disk/IO operations. Most FileManager back-ends support batched
fetches; requesting all keys at once is noticeably faster for large
projects.
- cached_compression_levels = []
- for size in compute_compression_levels(20000, len(data_full)):
- cached_compression_levels.append(
- file_manager.get_results(
- selected_data, [f'ms1_deconv_heatmap_{size}']
- )[f'ms1_deconv_heatmap_{size}']
- )
+ keys = [f"ms1_deconv_heatmap_{s}" for s in compute_compression_levels(20000, len(data_full))]
+ cached = file_manager.get_results(selected_data, keys)
+ cached_compression_levels = [cached[k] for k in keys]This reduces disk seeks / S3 requests and makes page load more
responsive.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cached_compression_levels = [] | |
| for size in compute_compression_levels(20000, len(data_full)): | |
| cached_compression_levels.append( | |
| file_manager.get_results( | |
| selected_data, [f'ms1_deconv_heatmap_{size}'] | |
| )[f'ms1_deconv_heatmap_{size}'] | |
| ) | |
| cached_compression_levels.append(data_full) | |
| # Batch‐fetch all the heatmap keys in one call instead of N separate I/O ops | |
| keys = [f"ms1_deconv_heatmap_{s}" for s in compute_compression_levels(20000, len(data_full))] | |
| cached = file_manager.get_results(selected_data, keys) | |
| cached_compression_levels = [cached[k] for k in keys] | |
| cached_compression_levels.append(data_full) |
| # Compute maximum amount of peaks per bin that does not exceed limit | ||
| counted_peaks = 0 | ||
| max_peaks_per_bin = -1 | ||
| new_count = 0 | ||
| while ((counted_peaks + new_count) < max_datapoints): | ||
| # commit prev result | ||
| max_peaks_per_bin += 1 | ||
| counted_peaks += new_count | ||
| # compute count for next value | ||
| new_count = np.sum(count.flatten() >= (max_peaks_per_bin + 1)) | ||
|
|
||
| if counted_peaks >= len(data): | ||
| break | ||
|
|
||
|
|
||
| data = data.groupby( | ||
| ['mass_bin', 'rt_bin'], group_keys=False, sort=False | ||
| ).head(max_peaks_per_bin).reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off-by-one: downsample_heatmap can return an empty DataFrame.
When every bin already exceeds max_datapoints, the first while iteration sets
max_peaks_per_bin → 0, so the final groupby(...).head(0) yields nothing.
The practical expectation is to keep at least one peak per bin capped by
max_datapoints.
- max_peaks_per_bin = -1
+ max_peaks_per_bin = 0 # keep at least one per bin
...
- while ((counted_peaks + new_count) < max_datapoints):
+ while counted_peaks < max_datapoints:
# commit prev result
- max_peaks_per_bin += 1
+ if (counted_peaks + new_count) >= max_datapoints:
+ break
+ max_peaks_per_bin += 1Also consider short-circuiting by checking len(data) <= max_datapoints at
the top – it avoids all binning work for small datasets.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Compute maximum amount of peaks per bin that does not exceed limit | |
| counted_peaks = 0 | |
| max_peaks_per_bin = -1 | |
| new_count = 0 | |
| while ((counted_peaks + new_count) < max_datapoints): | |
| # commit prev result | |
| max_peaks_per_bin += 1 | |
| counted_peaks += new_count | |
| # compute count for next value | |
| new_count = np.sum(count.flatten() >= (max_peaks_per_bin + 1)) | |
| if counted_peaks >= len(data): | |
| break | |
| data = data.groupby( | |
| ['mass_bin', 'rt_bin'], group_keys=False, sort=False | |
| ).head(max_peaks_per_bin).reset_index(drop=True) | |
| # Compute maximum amount of peaks per bin that does not exceed limit | |
| counted_peaks = 0 | |
| - max_peaks_per_bin = -1 | |
| + max_peaks_per_bin = 0 # keep at least one per bin | |
| new_count = 0 | |
| - while ((counted_peaks + new_count) < max_datapoints): | |
| + while counted_peaks < max_datapoints: | |
| # commit prev result | |
| - max_peaks_per_bin += 1 | |
| + if (counted_peaks + new_count) >= max_datapoints: | |
| + break | |
| + max_peaks_per_bin += 1 | |
| counted_peaks += new_count | |
| # compute count for next value | |
| new_count = np.sum(count.flatten() >= (max_peaks_per_bin + 1)) | |
| if counted_peaks >= len(data): | |
| break | |
| data = data.groupby( | |
| ['mass_bin', 'rt_bin'], group_keys=False, sort=False | |
| ).head(max_peaks_per_bin).reset_index(drop=True) |
| def compute_compression_levels(minimal_size, total_amount, logger=None): | ||
| if total_amount <= minimal_size: | ||
| return [] | ||
| levels = np.logspace( | ||
| int(np.log10(minimal_size)), int(np.log10(total_amount)), | ||
| int(np.log10(total_amount)) - int(np.log10(minimal_size)) + 1, | ||
| dtype='int' | ||
| )*int(10**(np.log10(minimal_size)%1)) | ||
| return levels[levels<total_amount] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Validate input and simplify compute_compression_levels.
- Calling
np.log10onminimal_size <= 0will raise a runtime warning / nan; guard early. - The arithmetic mixing
logspacewith an additional*10**(fract_log)step is hard to reason about and can silently create duplicated levels for powers of ten (e.g. 100, 1000). - You can obtain strictly-increasing integer levels with a clearer, branch-free expression.
def compute_compression_levels(minimal_size, total_amount, logger=None):
- if total_amount <= minimal_size:
- return []
+ if minimal_size <= 0:
+ raise ValueError("`minimal_size` must be > 0.")
+ if total_amount <= minimal_size:
+ return []
- levels = np.logspace(
- int(np.log10(minimal_size)), int(np.log10(total_amount)),
- int(np.log10(total_amount)) - int(np.log10(minimal_size)) + 1,
- dtype='int'
- )*int(10**(np.log10(minimal_size)%1))
- return levels[levels<total_amount]
+ exponents = np.arange(
+ np.floor(np.log10(minimal_size)),
+ np.ceil(np.log10(total_amount)),
+ dtype=int,
+ )
+ levels = (10 ** exponents).astype(int)
+ return levels[levels < total_amount]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def compute_compression_levels(minimal_size, total_amount, logger=None): | |
| if total_amount <= minimal_size: | |
| return [] | |
| levels = np.logspace( | |
| int(np.log10(minimal_size)), int(np.log10(total_amount)), | |
| int(np.log10(total_amount)) - int(np.log10(minimal_size)) + 1, | |
| dtype='int' | |
| )*int(10**(np.log10(minimal_size)%1)) | |
| return levels[levels<total_amount] | |
| def compute_compression_levels(minimal_size, total_amount, logger=None): | |
| if minimal_size <= 0: | |
| raise ValueError("`minimal_size` must be > 0.") | |
| if total_amount <= minimal_size: | |
| return [] | |
| exponents = np.arange( | |
| np.floor(np.log10(minimal_size)), | |
| np.ceil(np.log10(total_amount)), | |
| dtype=int, | |
| ) | |
| levels = (10 ** exponents).astype(int) | |
| return levels[levels < total_amount] |
| if updated: | ||
| st.rerun(scope='app') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
st.rerun() does not accept a scope argument.
Current Streamlit (≥ 1.30) defines st.rerun() with no parameters.
Passing scope='app' raises TypeError, preventing any state update
from propagating.
- st.rerun(scope='app')
+ st.rerun()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if updated: | |
| st.rerun(scope='app') | |
| if updated: | |
| st.rerun() |
| # Hash updated. filtered data | ||
| data['hash'] = hash_complex(data) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Hashing full DataFrames is extremely expensive.
hash_complex pickles the whole data dict → 20 k-row heatmaps or large
spectra easily reach multiple MB, causing noticeable latency every render.
Instead, hash lightweight, immutable identifiers:
- data['hash'] = hash_complex(data)
+ data['hash'] = hash_complex({
+ k: v for k, v in data.items()
+ if not isinstance(v, pd.DataFrame)
+ })or compute and store data_version counters during updates.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Hash updated. filtered data | |
| data['hash'] = hash_complex(data) | |
| # Hash updated. filtered data | |
| data['hash'] = hash_complex({ | |
| k: v for k, v in data.items() | |
| if not isinstance(v, pd.DataFrame) | |
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/parse/tnt.py (2)
40-55: Fix static analysis issues in the tag dataframe linearization code.There are a few minor issues in this loop:
- Unused loop control variable
iin line 41- Unnecessarily using
.keys()in the dictionary membership tests- new_tag_df = {c : [] for c in tag_df.columns} - for i, row in tag_df.iterrows(): + new_tag_df = {c: [] for c in tag_df.columns} + for _, row in tag_df.iterrows(): # No splitting if it is not recognized as string if pd.isna(row['ProteoformIndex']): row['ProteoformIndex'] = -1 if isinstance(row['ProteoformIndex'], str) and (';' in row['ProteoformIndex']): no_items = row['ProteoformIndex'].count(';') + 1 - for c in new_tag_df.keys(): + for c in new_tag_df: if (isinstance(row[c], str)) and (';' in row[c]): new_tag_df[c] += row[c].split(';') else: new_tag_df[c] += [row[c]]*no_items else: - for c in new_tag_df.keys(): + for c in new_tag_df: new_tag_df[c].append(row[c])🧰 Tools
🪛 Ruff (0.8.2)
41-41: Loop control variable
inot used within loop body(B007)
47-47: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
53-53: Use
key in dictinstead ofkey in dict.keys()Remove
.keys()(SIM118)
133-133: Avoid using ambiguous variable names.The single-letter variable
lis ambiguous and could be confused with the number 1 or capital I.- } for s, e, m, l in zip(mod_starts, mod_ends, mod_masses, mod_labels) + } for s, e, m, label in zip(mod_starts, mod_ends, mod_masses, mod_labels)🧰 Tools
🪛 Ruff (0.8.2)
133-133: Ambiguous variable name:
l(E741)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
content/FLASHDeconv/FLASHDeconvViewer.py(5 hunks)content/FLASHTnT/FLASHTnTLayoutManager.py(1 hunks)content/FLASHTnT/FLASHTnTViewer.py(6 hunks)src/parse/tnt.py(1 hunks)src/render/components.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- content/FLASHTnT/FLASHTnTLayoutManager.py
- src/render/components.py
🧰 Additional context used
🪛 Ruff (0.8.2)
src/parse/tnt.py
41-41: Loop control variable i not used within loop body
(B007)
47-47: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
53-53: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
133-133: Ambiguous variable name: l
(E741)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-full-app
- GitHub Check: build-openms
🔇 Additional comments (13)
content/FLASHDeconv/FLASHDeconvViewer.py (6)
7-7: Good architectural change with simplified import structure.Replacing multiple specific imports with a single
render_gridimport significantly simplifies this file by delegating rendering and data preparation logic to a centralized function.
11-12: LGTM! Dynamic layout adjustment based on availability of sequence data.The conditional logic to add 'sequence_view' to DEFAULT_LAYOUT when input_sequence is available is a good enhancement that makes the UI more responsive to the available data.
60-63: LGTM! Simplified rendering logic with proper parameter passing.Delegating the complex data processing and rendering to the
render_gridfunction improves separation of concerns and code maintainability.
73-77: LGTM! Consistent use ofrender_gridfor the second view.The side-by-side view implementation correctly passes appropriate parameters to distinguish it from the first view.
90-93: LGTM! Consistent implementation for single experiment view.The rendering approach is consistent with the other views, maintaining a clean separation of concerns.
111-116: LGTM! Appropriate parameterization for multiple experiment layouts.The multi-experiment rendering correctly identifies each grid with unique identifiers.
src/parse/tnt.py (1)
15-15: Good function signature update aligns with architectural changes.Adding
file_manageranddataset_idparameters shifts the responsibility of data storage to the caller and makes the data flow more explicit, which is a solid architectural improvement.content/FLASHTnT/FLASHTnTViewer.py (6)
7-7: Good architectural change with simplified import structure.Replacing multiple specific imports with a single
render_gridimport significantly simplifies this file by delegating rendering and data preparation logic to a centralized function.
14-14: LGTM! Appropriate component replacement in the default layout.Replacing 'deconv_spectrum' with 'combined_spectrum' suggests an improved or more integrated visualization component.
67-67: LGTM! Simplified rendering with proper parameter passing.Delegating the complex data processing and rendering to the
render_gridfunction improves separation of concerns and code maintainability.
76-76: LGTM! Consistent use ofrender_gridfor the second view with proper grid identifier.The side-by-side view implementation correctly passes appropriate parameters to distinguish it from the first view.
89-89: LGTM! Consistent implementation for single experiment view.The rendering approach is consistent with the other views, maintaining a clean separation of concerns.
108-108: LGTM! Appropriate parameterization for multiple experiment layouts.The multi-experiment rendering correctly identifies each grid with unique identifiers and handles experiment keys properly.
| def parseTnT(file_manager, dataset_id, deconv_mzML, anno_mzML, tag_tsv, protein_tsv, logger=None): | ||
|
|
||
| deconv_df, _, tolerance, _, _, = parseFLASHDeconvOutput( | ||
| anno_mzML, deconv_mzML | ||
| ) | ||
| tag_df, protein_df = parseFLASHTaggerOutput(tag_tsv, protein_tsv) | ||
|
|
||
| # Add tolerance as a new columnn, wasteful but gets the job done.. | ||
| deconv_df['tol'] = tolerance | ||
|
|
||
| return { | ||
| 'anno_dfs' : anno_df, | ||
| 'deconv_dfs' : deconv_df, | ||
| 'tag_dfs' : tag_df, | ||
| 'protein_dfs' : protein_df | ||
| # protein_table | ||
| protein_df['length'] = protein_df['DatabaseSequence'].apply(lambda x : len(x)) | ||
| protein_df = protein_df.rename( | ||
| columns={ | ||
| 'ProteoformIndex' : 'index', | ||
| 'ProteinAccession' : 'accession', | ||
| 'ProteinDescription' : 'description', | ||
| 'DatabaseSequence' : 'sequence' | ||
| } | ||
| ) | ||
| protein_df['description'] = protein_df['description'].apply( | ||
| lambda x: x[:50] + '...' if len(x) > 50 else x | ||
| ) | ||
| file_manager.store_data(dataset_id, 'protein_dfs', protein_df) | ||
|
|
||
| # tag_table | ||
|
|
||
| # Process tag df into a linear data format | ||
| new_tag_df = {c : [] for c in tag_df.columns} | ||
| for i, row in tag_df.iterrows(): | ||
| # No splitting if it is not recognized as string | ||
| if pd.isna(row['ProteoformIndex']): | ||
| row['ProteoformIndex'] = -1 | ||
| if isinstance(row['ProteoformIndex'], str) and (';' in row['ProteoformIndex']): | ||
| no_items = row['ProteoformIndex'].count(';') + 1 | ||
| for c in new_tag_df.keys(): | ||
| if (isinstance(row[c], str)) and (';' in row[c]): | ||
| new_tag_df[c] += row[c].split(';') | ||
| else: | ||
| new_tag_df[c] += [row[c]]*no_items | ||
| else: | ||
| for c in new_tag_df.keys(): | ||
| new_tag_df[c].append(row[c]) | ||
| tag_df = pd.DataFrame(new_tag_df) | ||
|
|
||
| tsv_buffer = StringIO() | ||
| tag_df.to_csv(tsv_buffer, sep='\t', index=False) | ||
| tsv_buffer.seek(0) | ||
| tag_df = pd.read_csv(tsv_buffer, sep='\t') | ||
|
|
||
| # Complete df | ||
| tag_df['StartPosition'] = tag_df['StartPosition'] - 1 | ||
| tag_df['EndPos'] = tag_df['StartPosition'] + tag_df['Length'] - 1 | ||
| tag_df = tag_df.rename( | ||
| columns={ | ||
| 'ProteoformIndex' : 'ProteinIndex', | ||
| 'DeNovoScore' : 'Score', | ||
| 'Masses' : 'mzs', | ||
| 'StartPosition' : 'StartPos' | ||
| } | ||
| ) | ||
| file_manager.store_data(dataset_id, 'tag_dfs', tag_df) | ||
| # sequence_view & internal_fragment_map | ||
| sequence_data = {} | ||
| internal_fragment_data = {} | ||
| # Compute coverage | ||
| for i, row in protein_df.iterrows(): | ||
| pid = row['index'] | ||
| sequence = row['sequence'] | ||
| coverage = np.zeros(len(sequence), dtype='float') | ||
| for i in range(len(sequence)): | ||
| coverage[i] = np.sum( | ||
| (tag_df['ProteinIndex'] == pid) & | ||
| (tag_df['StartPos'] <= i) & | ||
| (tag_df['EndPos'] >= i) | ||
| ) | ||
| p_cov = np.zeros(len(coverage)) | ||
| if np.max(coverage) > 0: | ||
| p_cov = coverage/np.max(coverage) | ||
|
|
||
| proteoform_start = row['StartPosition'] | ||
| proteoform_end = row['EndPosition'] | ||
| start_index = 0 if proteoform_start <= 0 else proteoform_start - 1 | ||
| end_index = len(sequence) - 1 if proteoform_end <= 0 else proteoform_end - 1 | ||
|
|
||
|
|
||
| if row['ModCount'] > 0: | ||
| mod_masses = [float(m) for m in str(row['ModMass']).split(';')] | ||
| mod_starts = [int(float(s)) for s in str(row['ModStart']).split(';')] | ||
| mod_ends = [int(float(s)) for s in str(row['ModEnd']).split(';')] | ||
| if pd.isna(row['ModID']): | ||
| mod_labels = [''] * row['ModCount'] | ||
| else: | ||
| mod_labels = [s[:-1].replace(',', '; ') for s in str(row['ModID']).split(';')] | ||
| else: | ||
| mod_masses = [] | ||
| mod_starts = [] | ||
| mod_ends = [] | ||
| mod_labels = [] | ||
| modifications = [] | ||
| for s, e, m in zip(mod_starts, mod_ends, mod_masses): | ||
| modifications.append((s-start_index, e-start_index, m)) | ||
|
|
||
| sequence = str(sequence) | ||
| sequence_data[pid] = getFragmentDataFromSeq( | ||
| str(sequence)[start_index:end_index+1], p_cov, np.max(coverage), | ||
| modifications | ||
| ) | ||
|
|
||
| sequence_data[pid]['sequence'] = list(sequence) | ||
| sequence_data[pid]['proteoform_start'] = proteoform_start - 1 | ||
| sequence_data[pid]['proteoform_end'] = proteoform_end - 1 | ||
| sequence_data[pid]['computed_mass'] = row['ProteoformMass'] | ||
| sequence_data[pid]['theoretical_mass'] = remove_ambigious(AASequence.fromString(sequence)).getMonoWeight() | ||
| sequence_data[pid]['modifications'] = [ | ||
| { | ||
| # Modfications are zero based | ||
| 'start' : s - 1, | ||
| 'end' : e - 1, | ||
| 'mass_diff' : m, | ||
| 'labels' : l | ||
| } for s, e, m, l in zip(mod_starts, mod_ends, mod_masses, mod_labels) | ||
| ] | ||
|
|
||
| internal_fragment_data[pid] = getInternalFragmentDataFromSeq( | ||
| str(sequence)[start_index:end_index+1], modifications | ||
| ) | ||
|
|
||
| file_manager.store_data(dataset_id, 'sequence_data', sequence_data) | ||
| file_manager.store_data( | ||
| dataset_id, 'internal_fragment_data', internal_fragment_data | ||
| ) | ||
|
|
||
| fragments = ['b', 'y'] | ||
| if file_manager.result_exists(dataset_id, 'FTnT_parameters_json'): | ||
| tnt_settings_file = file_manager.get_results( | ||
| dataset_id, ['FTnT_parameters_json'] | ||
| )['FTnT_parameters_json'] | ||
| with open(tnt_settings_file, 'r') as f: | ||
| tnt_settings = json.load(f) | ||
| if 'ion_type' in tnt_settings: | ||
| fragments = tnt_settings['ion_type'].split('\n') | ||
| settings = { | ||
| 'tolerance' : tolerance, | ||
| 'ion_types' : fragments | ||
| } | ||
| file_manager.store_data( | ||
| dataset_id, 'settings', settings | ||
| ) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider refactoring the large parseTnT function for better maintainability.
This function handles multiple responsibilities (parsing data, processing tags, computing coverage, generating fragments, reading settings). Consider breaking it down into smaller, focused functions for improved maintainability and testability.
For example, you could extract these logical sections into separate functions:
transform_protein_dataframe(protein_df)linearize_tag_dataframe(tag_df)compute_sequence_coverage(protein_df, tag_df)generate_fragment_data(protein_row, modifications)load_ion_type_settings(file_manager, dataset_id)
This would make the main function much cleaner and more maintainable:
def parseTnT(file_manager, dataset_id, deconv_mzML, anno_mzML, tag_tsv, protein_tsv, logger=None):
# Parse input data
deconv_df, _, tolerance, _, _, = parseFLASHDeconvOutput(anno_mzML, deconv_mzML)
tag_df, protein_df = parseFLASHTaggerOutput(tag_tsv, protein_tsv)
# Transform and store protein data
protein_df = transform_protein_dataframe(protein_df)
file_manager.store_data(dataset_id, 'protein_dfs', protein_df)
# Transform and store tag data
tag_df = linearize_tag_dataframe(tag_df)
file_manager.store_data(dataset_id, 'tag_dfs', tag_df)
# Generate and store sequence and fragment data
sequence_data, internal_fragment_data = generate_sequence_and_fragment_data(protein_df, tag_df)
file_manager.store_data(dataset_id, 'sequence_data', sequence_data)
file_manager.store_data(dataset_id, 'internal_fragment_data', internal_fragment_data)
# Load and store settings
settings = load_ion_type_settings(file_manager, dataset_id, tolerance)
file_manager.store_data(dataset_id, 'settings', settings)This improves readability, testability, and maintainability.
🧰 Tools
🪛 Ruff (0.8.2)
41-41: Loop control variable i not used within loop body
(B007)
47-47: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
53-53: Use key in dict instead of key in dict.keys()
Remove .keys()
(SIM118)
133-133: Ambiguous variable name: l
(E741)
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Chores
scipydependency for statistical computations and visualization support.Documentation