REFACTOR: Improve dna2protein bottleneck by m-huertasp · Pull Request #422 · bbglab/deepCSA

m-huertasp · 2026-02-24T13:24:30Z

Summary

This PR optimizes the DNA2PROTEINMAPPING step by reducing its bottleneck. Previously, the pipeline executed multiple serial requests to Ensembl REST APIs (roughly 2× the number of genes in a panel), which frequently led to rate-limiting blocks and significant latency for large panels.

Based on suggestions from @FerriolCalvet , the individual calls were replaced with a bulk-retrieval strategy. We now download the full GFF annotation via FTP and process it locally.

Main Changes

New logic: Replaced multiple REST API calls with a single FTP parsing of the Ensembl GFF. The document is not downloaded but filtered on the fly and briefly kept in memory until transformation into a dataframe.
Refactor: Updated downstream functions to handle DataFrames instead of the JSON objects previously returned by the API.
Flexibility: Linked the Ensembl release version to the vep_cache_version Nextflow parameter, allowing users to specify the genomic release dynamically.
Documentation: Added docstrings and internal comments to explain the new and old parsing logic.

closes #229

- also simplified plot_coverage_per_gene

- Removed functions to get the api calls - New functions to obtain the gff and filter it - Modified functions to process the data - Added docstrings

- Also modified the order - Removed unused functions

- Improved efficiency of find_exon by applying vectorized dataframe operations - Fix errors in docstrings and parameter definition - Add log information and remove info from matplotlib

- Added click option to script - Added definition to nextflow process

Copilot

Pull request overview

This PR refactors the DNA2PROTEINMAPPING step to reduce Ensembl REST API bottlenecks by switching to a bulk retrieval + local parsing approach using the Ensembl GFF3 (fetched from Ensembl FTP), and threads an Ensembl release parameter into the mapping step.

Changes:

Add an optional --ensembl-release argument wiring in the Nextflow module.
Replace transcript/exon/CDS retrieval from Ensembl REST with streaming download + local filtering of Ensembl GFF3 and DataFrame-based downstream processing.
Refactor exon-ID assignment and related plotting/coverage helpers.

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`modules/local/dna2protein/main.nf`	Adds `--ensembl-release` argument passing to the mapping script.
`bin/panels_computedna2protein.py`	Implements GFF3 streaming retrieval/parsing and refactors coordinate mapping, exon lookup, and coverage plotting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

modules/local/dna2protein/main.nf

FerriolCalvet

all good! great Marta!

bin/panels_computedna2protein.py

conf/modules.config

FerriolCalvet

very minor changes from testing

FerriolCalvet · 2026-03-03T08:31:23Z

modules/local/dna2protein/main.nf

when testing this file needs to have more memory, particularly for bigger panels, add the process_medium_high_memory label and then merge

This will not be added in this branch but in ferriol-updates branch.

m-huertasp added 6 commits February 23, 2026 14:49

documentation: add docstrings (#229)

ada77eb

- also simplified plot_coverage_per_gene

refactor: get gff through ftp instead of api calls (#229)

d538127

- Removed functions to get the api calls - New functions to obtain the gff and filter it - Modified functions to process the data - Added docstrings

documentation: add docstrings to rest of functions (#229)

683d50f

- Also modified the order - Removed unused functions

refactor: separate parse_cds_coord into exon and cds parsers (#229)

e9eb510

refactor: vectorized find_exon (#229)

bca9ab1

- Improved efficiency of find_exon by applying vectorized dataframe operations - Fix errors in docstrings and parameter definition - Add log information and remove info from matplotlib

refactor: add option to change ensembl release (#229)

385f3a9

- Added click option to script - Added definition to nextflow process

m-huertasp requested a review from Copilot February 24, 2026 13:24

m-huertasp self-assigned this Feb 24, 2026

m-huertasp added efficiency-related code-review 👩‍💻 Tasks associated with the code-review labels Feb 24, 2026

m-huertasp added this to the Current iteration milestone Feb 24, 2026

Copilot started reviewing on behalf of m-huertasp February 24, 2026 13:25 View session

m-huertasp linked an issue Feb 24, 2026 that may be closed by this pull request

use panel annotation to define already the protein position of each CDS position #229

Closed

Copilot AI reviewed Feb 24, 2026

View reviewed changes

modules/local/dna2protein/main.nf Outdated Show resolved Hide resolved

modules/local/dna2protein/main.nf Outdated Show resolved Hide resolved

m-huertasp added 2 commits February 24, 2026 14:43

fix: forgot to add the configuration (#229)

821537a

fix: indent problem (#229)

6efbf68

FerriolCalvet requested changes Feb 24, 2026

View reviewed changes

bin/panels_computedna2protein.py Outdated Show resolved Hide resolved

conf/modules.config Show resolved Hide resolved

refactor: add species and genome as parameters (#229)

5c2460f

m-huertasp requested a review from FerriolCalvet March 2, 2026 08:07

FerriolCalvet reviewed Mar 3, 2026

View reviewed changes

FerriolCalvet merged commit a7f8e9c into dev Mar 3, 2026

FerriolCalvet deleted the feat/229-improve-dna2protein-bottleneck branch March 3, 2026 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REFACTOR: Improve dna2protein bottleneck#422

REFACTOR: Improve dna2protein bottleneck#422
FerriolCalvet merged 9 commits intodevfrom
feat/229-improve-dna2protein-bottleneck

m-huertasp commented Feb 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

FerriolCalvet left a comment

Uh oh!

Uh oh!

Uh oh!

FerriolCalvet left a comment

Uh oh!

FerriolCalvet Mar 3, 2026

Uh oh!

m-huertasp Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

m-huertasp commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Main Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

FerriolCalvet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

FerriolCalvet left a comment

Choose a reason for hiding this comment

Uh oh!

FerriolCalvet Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

m-huertasp Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

m-huertasp commented Feb 24, 2026 •

edited

Loading