FLARE extract tracts speedup and added a safety check to make sure the vcf file isn't empty by jtb324 · Pull Request #53 · Atkinson-Lab/Tractor

jtb324 · 2026-01-21T21:47:14Z

Original Pull request (1/21/26):

The change addresses issue #52. The parse_vcf_filepath function additionally checks if the file is empty. This function was already validating that the file exists so now it also validates that the file is not empty. If the file is empty than a message is logged and a ValueError is raises since the program assumes that the file should have content. This change just gives one more safe guard to the user to make sure everything is being behaving properly.

*Additionally ran the code through the black formatter to get consistent spacing and pythonic style

This has been tested on a Linux Ubuntu 24 LTS server with python 3.11

Additionally Pull Request (4/3/26):

It was observed that the extract_tracts_flare.py script was taking a long time to run (chromosome 1-12 were getting time killed after 10 days) when we were trying to extract the tracts from FLARE run on 250k participants. Profiling revealed that the issue was the string concatenation being used to generate each row in the output file. This string concatenation procedure required many new string allocations which are slow and become slower as the string grows bigger and bigger. The proposed changed showed a 10x speedup in testing with minimal memory increases

Solution:
(lines 207-240) - We restructured the output_lines dictionary so the keys are the ancestry groups and the values are list. We added the allele count, the ancestry haplotype count, and the vcf genotype output to these list and then join the list into strings right before writing. This approach reduces the number of new string allocations that python has to make throughout the runtime. This change also removed a lot of the branching by checking if values are true or false and then just adding the boolean values together.

(lines 195-201) - Replace the re.split call with 2 separate string split calls since we know the guarenteed format of the genotype call. It has been shown that the string split calls are significantly faster than re.split because they don't have the overhead from the regex expression. Since we know the structure we don't need to use the regex expression.

(line 99) - Fixed a bug in the if statement "if compress_output == True" where the log statement was never printed even if the user provided the compress_output flag because the statement was indented 1 level too far.

Testing:
I tested this on a test set of ~500 variants for 250k individuals and compared to the original version of the script. The CPU usage and memory are listed below. Everything was tested a Ubuntu 24 LTS server with python 3.14 and profiled using scalene (v2.2.1)

Original: 4h:58m:58.186s with 171 MB of memory
New: 28m:12.165s with 314 MB of memory

--- On branch speedup_flare_tracts_extraction Changes to be committed: modified: scripts/extract_tracts_flare.py Refactored code to increase performance: --- *Issue*: It was observed that the extract_tracts_flare.py script was taking a long time to run. Profiling revealed that the issue was the string concatenation being used to generate each row in hthe output file. This string concatenation procedure required many new string allocations which are slow and become slower as the string grows bigger and bigger. *Solution*: (lines 207-240) - We restructured the output_lines dictionary so the keys are the ancestry groups and the values are list. We added the allele count, the ancestry haplotype count, and the vcf genotype output to these list and then join the list into strings right before writing. This approach reduces the number of new string allocations that python has to make throughout the runtime. We removed a lot of the branching by check if values are true or false and then just adding the boolean values together. (lines 195-201) - Replace the re.split call with 2 separate string split calls since we know the guarenteed format of the genotype call. It has been shown that the string split calls are significantly faster than re.split because they don't have the overhead from the regex expression. Since we know the structure we don't need to use the regex expression. (line 99) - Fixed a bug in the if statement "if compress_output == True" where the log statement was never printed even if the user provided the compress_output flag because the statement was indented 1 level too far.

…ing newline when teh ancdos line was beingv added to the output file

jtb324 added 3 commits January 21, 2026 15:41

added a safety check to make sure the vcf file isn't empty

4442d1c

fix(extract_tracts_flare.py) - fixed an issue where there was a missi…

cf9f8eb

…ing newline when teh ancdos line was beingv added to the output file

jtb324 changed the title ~~added a safety check to make sure the vcf file isn't empty~~ FLARE extract tracts speedup and added a safety check to make sure the vcf file isn't empty Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLARE extract tracts speedup and added a safety check to make sure the vcf file isn't empty#53

FLARE extract tracts speedup and added a safety check to make sure the vcf file isn't empty#53
jtb324 wants to merge 3 commits intoAtkinson-Lab:masterfrom
jtb324:master

jtb324 commented Jan 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jtb324 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Original Pull request (1/21/26):

Additionally Pull Request (4/3/26):

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jtb324 commented Jan 21, 2026 •

edited

Loading