Skip to content

FLARE extract tracts speedup and added a safety check to make sure the vcf file isn't empty#53

Open
jtb324 wants to merge 3 commits intoAtkinson-Lab:masterfrom
jtb324:master
Open

FLARE extract tracts speedup and added a safety check to make sure the vcf file isn't empty#53
jtb324 wants to merge 3 commits intoAtkinson-Lab:masterfrom
jtb324:master

Conversation

@jtb324
Copy link
Copy Markdown

@jtb324 jtb324 commented Jan 21, 2026

Original Pull request (1/21/26):

The change addresses issue #52. The parse_vcf_filepath function additionally checks if the file is empty. This function was already validating that the file exists so now it also validates that the file is not empty. If the file is empty than a message is logged and a ValueError is raises since the program assumes that the file should have content. This change just gives one more safe guard to the user to make sure everything is being behaving properly.

*Additionally ran the code through the black formatter to get consistent spacing and pythonic style

This has been tested on a Linux Ubuntu 24 LTS server with python 3.11

Additionally Pull Request (4/3/26):

It was observed that the extract_tracts_flare.py script was taking a long time to run (chromosome 1-12 were getting time killed after 10 days) when we were trying to extract the tracts from FLARE run on 250k participants. Profiling revealed that the issue was the string concatenation being used to generate each row in the output file. This string concatenation procedure required many new string allocations which are slow and become slower as the string grows bigger and bigger. The proposed changed showed a 10x speedup in testing with minimal memory increases

Solution:
(lines 207-240) - We restructured the output_lines dictionary so the keys are the ancestry groups and the values are list. We added the allele count, the ancestry haplotype count, and the vcf genotype output to these list and then join the list into strings right before writing. This approach reduces the number of new string allocations that python has to make throughout the runtime. This change also removed a lot of the branching by checking if values are true or false and then just adding the boolean values together.

(lines 195-201) - Replace the re.split call with 2 separate string split calls since we know the guarenteed format of the genotype call. It has been shown that the string split calls are significantly faster than re.split because they don't have the overhead from the regex expression. Since we know the structure we don't need to use the regex expression.

(line 99) - Fixed a bug in the if statement "if compress_output == True" where the log statement was never printed even if the user provided the compress_output flag because the statement was indented 1 level too far.

Testing:
I tested this on a test set of ~500 variants for 250k individuals and compared to the original version of the script. The CPU usage and memory are listed below. Everything was tested a Ubuntu 24 LTS server with python 3.14 and profiled using scalene (v2.2.1)

  • Original: 4h:58m:58.186s with 171 MB of memory
  • New: 28m:12.165s with 314 MB of memory

---
On branch speedup_flare_tracts_extraction
Changes to be committed:
	modified:   scripts/extract_tracts_flare.py

Refactored code to increase performance:
---

*Issue*:
It was observed that the extract_tracts_flare.py script was taking a
long time to run. Profiling revealed that the issue was the string
concatenation being used to generate each row in hthe output file. This
string concatenation procedure required many new string allocations
which are slow and become slower as the string grows bigger and bigger.

*Solution*:
(lines 207-240) - We restructured the output_lines dictionary so the keys are the ancestry
groups and the values are list. We added the allele count, the ancestry haplotype
count, and the vcf genotype output to these list and then join the list
into strings right before writing. This approach reduces the number of
new string allocations that python has to make throughout the runtime.
We removed a lot of the branching by check if values are true or false
and then just adding the boolean values together.

(lines 195-201) - Replace the re.split call with 2 separate string split calls since we know the guarenteed format of the genotype call. It has been shown that the string split calls are significantly faster than re.split because they don't have the overhead from the regex expression. Since we know the structure we don't need to use the regex expression.

(line 99) - Fixed a bug in the if statement "if compress_output == True" where the log statement was never printed even if the user provided the compress_output flag because the statement was indented 1 level too far.
…ing newline when teh ancdos line was beingv added to the output file
@jtb324 jtb324 changed the title added a safety check to make sure the vcf file isn't empty FLARE extract tracts speedup and added a safety check to make sure the vcf file isn't empty Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant