FLARE extract tracts speedup and added a safety check to make sure the vcf file isn't empty#53
Open
jtb324 wants to merge 3 commits intoAtkinson-Lab:masterfrom
Open
FLARE extract tracts speedup and added a safety check to make sure the vcf file isn't empty#53jtb324 wants to merge 3 commits intoAtkinson-Lab:masterfrom
jtb324 wants to merge 3 commits intoAtkinson-Lab:masterfrom
Conversation
--- On branch speedup_flare_tracts_extraction Changes to be committed: modified: scripts/extract_tracts_flare.py Refactored code to increase performance: --- *Issue*: It was observed that the extract_tracts_flare.py script was taking a long time to run. Profiling revealed that the issue was the string concatenation being used to generate each row in hthe output file. This string concatenation procedure required many new string allocations which are slow and become slower as the string grows bigger and bigger. *Solution*: (lines 207-240) - We restructured the output_lines dictionary so the keys are the ancestry groups and the values are list. We added the allele count, the ancestry haplotype count, and the vcf genotype output to these list and then join the list into strings right before writing. This approach reduces the number of new string allocations that python has to make throughout the runtime. We removed a lot of the branching by check if values are true or false and then just adding the boolean values together. (lines 195-201) - Replace the re.split call with 2 separate string split calls since we know the guarenteed format of the genotype call. It has been shown that the string split calls are significantly faster than re.split because they don't have the overhead from the regex expression. Since we know the structure we don't need to use the regex expression. (line 99) - Fixed a bug in the if statement "if compress_output == True" where the log statement was never printed even if the user provided the compress_output flag because the statement was indented 1 level too far.
…ing newline when teh ancdos line was beingv added to the output file
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Original Pull request (1/21/26):
The change addresses issue #52. The parse_vcf_filepath function additionally checks if the file is empty. This function was already validating that the file exists so now it also validates that the file is not empty. If the file is empty than a message is logged and a ValueError is raises since the program assumes that the file should have content. This change just gives one more safe guard to the user to make sure everything is being behaving properly.
*Additionally ran the code through the black formatter to get consistent spacing and pythonic style
This has been tested on a Linux Ubuntu 24 LTS server with python 3.11
Additionally Pull Request (4/3/26):
It was observed that the extract_tracts_flare.py script was taking a long time to run (chromosome 1-12 were getting time killed after 10 days) when we were trying to extract the tracts from FLARE run on 250k participants. Profiling revealed that the issue was the string concatenation being used to generate each row in the output file. This string concatenation procedure required many new string allocations which are slow and become slower as the string grows bigger and bigger. The proposed changed showed a 10x speedup in testing with minimal memory increases
Solution:
(lines 207-240) - We restructured the output_lines dictionary so the keys are the ancestry groups and the values are list. We added the allele count, the ancestry haplotype count, and the vcf genotype output to these list and then join the list into strings right before writing. This approach reduces the number of new string allocations that python has to make throughout the runtime. This change also removed a lot of the branching by checking if values are true or false and then just adding the boolean values together.
(lines 195-201) - Replace the re.split call with 2 separate string split calls since we know the guarenteed format of the genotype call. It has been shown that the string split calls are significantly faster than re.split because they don't have the overhead from the regex expression. Since we know the structure we don't need to use the regex expression.
(line 99) - Fixed a bug in the if statement "if compress_output == True" where the log statement was never printed even if the user provided the compress_output flag because the statement was indented 1 level too far.
Testing:
I tested this on a test set of ~500 variants for 250k individuals and compared to the original version of the script. The CPU usage and memory are listed below. Everything was tested a Ubuntu 24 LTS server with python 3.14 and profiled using scalene (v2.2.1)