REFACTOR: Improve dna2protein bottleneck#422
Conversation
- also simplified plot_coverage_per_gene
- Removed functions to get the api calls - New functions to obtain the gff and filter it - Modified functions to process the data - Added docstrings
- Also modified the order - Removed unused functions
- Improved efficiency of find_exon by applying vectorized dataframe operations - Fix errors in docstrings and parameter definition - Add log information and remove info from matplotlib
- Added click option to script - Added definition to nextflow process
There was a problem hiding this comment.
Pull request overview
This PR refactors the DNA2PROTEINMAPPING step to reduce Ensembl REST API bottlenecks by switching to a bulk retrieval + local parsing approach using the Ensembl GFF3 (fetched from Ensembl FTP), and threads an Ensembl release parameter into the mapping step.
Changes:
- Add an optional
--ensembl-releaseargument wiring in the Nextflow module. - Replace transcript/exon/CDS retrieval from Ensembl REST with streaming download + local filtering of Ensembl GFF3 and DataFrame-based downstream processing.
- Refactor exon-ID assignment and related plotting/coverage helpers.
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
modules/local/dna2protein/main.nf |
Adds --ensembl-release argument passing to the mapping script. |
bin/panels_computedna2protein.py |
Implements GFF3 streaming retrieval/parsing and refactors coordinate mapping, exon lookup, and coverage plotting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
FerriolCalvet
left a comment
There was a problem hiding this comment.
all good! great Marta!
FerriolCalvet
left a comment
There was a problem hiding this comment.
very minor changes from testing
There was a problem hiding this comment.
when testing this file needs to have more memory, particularly for bigger panels, add the process_medium_high_memory label and then merge
There was a problem hiding this comment.
This will not be added in this branch but in ferriol-updates branch.
Summary
This PR optimizes the
DNA2PROTEINMAPPINGstep by reducing its bottleneck. Previously, the pipeline executed multiple serial requests to Ensembl REST APIs (roughly 2× the number of genes in a panel), which frequently led to rate-limiting blocks and significant latency for large panels.Based on suggestions from @FerriolCalvet , the individual calls were replaced with a bulk-retrieval strategy. We now download the full GFF annotation via FTP and process it locally.
Main Changes
closes #229