Refactor parse_vcf_ref #3 #5 #8

EvaLiyt · 2025-10-08T23:38:00Z

Support nucleotide or nd16 reference FASTA by specifying --ref for reference file path and --refType for reference file data type.
Fill missing VCF sites with reference bases.
Other changes:

modify parse_vcf_ref
- remove SeqIO and write a method parse_fasta for lighter dependency
- extract methods for translating vcf
- remove write fasta code and use existing write_fasta method, modify this method to allow each sequence for one line
make debug mode log a file, change specific data debug to random index for suiting all dataset
remove prints and unhelpful comments, throw exceptions directly
throw exception when vcf comes from a different reference sequence
allow multiple chromosomes in reference fasta and vcf
move ref_sequence_str = str(ref_sequence) after for loop for simplicity
use Guard statements for check and quit for printing errors
write unit tests for all new functions
- translate_genome allow haploid reference passing in, comment the unit test
- test parse_vcf_ref
  - test parse_fasta function
  - 1. recognise multiple reference sequences correctly, throw error when sequences with same chromosome name but conflict with each other
  - 1. throw error when file not found or empty
  - test translate_fasta
  - 1. copy when refType nd16
  - 1. consider homozygous for refType nucleotide
  - test translate_vcf
  - 1. work for nd16 and binary
  - test vcf2fasta.py
  - 1. output correct number of sequences and length
  - 1. output bases are correct when using two reference sequences file

J-Moravec

Thank you for your PR. I greatly appreciate it!

There are some few general improvements regarding code structure that would make the code quite bit more maintainable. We want to minimize the time required for someone new to jump in. After all, next year we will be someone new.

If I am reading correctly, there is some code duplication happening. There are some simplification especially around deeply nested conditions that could be improved.
These improvements, along with docstrings would greatly improve the code.
Consider also type hinting, knowing which types are various parameters greatly improves understanding.

Finally, I think that the debugging code is necessarily complicated and we don't need it at this point. If we want to include it, we should talk about it. Is the debugging dev only? Do we need dev-only debugging? We can modify the code to add print statements. And the code is short, does only a few things, so tests should provide enough understanding and security.

Maybe is the debugging for users? If the code would randomly fail on invalid input (not a lot of checks currently present in the code), what would happen? How would it look like? Would errors be clear or do we need debugs for that?

src/test/test_parse_vcf_ref.py

J-Moravec · 2025-10-14T22:05:25Z

src/vcf2fasta.py


 if __name__ == "__main__":
    args = parse_args()
+    debug_mode = args.debug


Fix: Move anything that is not argument parsing into the vcf2fasta function.

User-interfaces are notoriously hard to debug. CLI is still fine, but you see that to do that, you need to run Python in a subprocess and catch its output.

A better way to do this is to isolate any functionality into a module (vcf2fasta()), and have CLI as a thin wrapper. This thin wrapper can be user-tested or whatever, but only thin it does is parsing command-line arguments and sending them to the isolated module.

Since the isolated module now encompasses all other functionality, it can be tested without doing any shenanigans because, well, its just a function (or class) without other dependencies other than some input arguments.

src/vcf2fasta.py

J-Moravec · 2025-10-14T22:26:17Z

src/vcf2fasta.py

Tasks:

Split big function into subfunctions with clearly defined parameters (see comments).

Streamline functions where possible, functions should be easy to read and understand!

Create documentation (docstring) to document what the function does and the business problem they are solving.

Functions should be hyperfocused at doing "one thing". What is the "one thing" is the difficult part and its the art of choosing the right abstraction. Big picture functions should read "Do this, change this, and then do that". This makes easy to see what is the big picture that the function is solving. Smaller functions are then defined as "do this" or "change this", their purpose is known, and they can be more detailed. At the same time, functions should not do any things at once, because many things then need to be balanced at the same time. Its fine if they do many things serially (first do this, then do that), or if really required to be done at the same time (e.g., in single iteration), the details should be wrapped in functions, if the details are complicated enough to take away from understanding the big picture.

Basically the same thing I said before. Streamlining. Try to not do to much "if if if", makes it hard to understand what is going on. Guard clauses (you used them) with early exit tend to be better. Often "if if if" is a code smell that tells you that you are doing many things at once and should think about decreasing the number of things you do, because you need to remember every single thing you do, keep it in your memory, and understand it. It is good to minimize things one need to read, remember, and understand, to understand a certain piece of code.

I think the code grew substantially to the degree that we need docstrings. I am getting lost in some of the functions I wrote myself, which suggest that they were supposed to be there in the first place.

Feel free to refactor some of the original functions as well. Now that I am reading it, getting translated VCF requires quite a substantial chain of functions. They are all very clear and hypefocused, so I kind of understand their purpose and don't see how inlining them would make stuff easier to read (since some of it is just maps).

I think typehinting would definitely make reading it so much better.

J-Moravec · 2025-10-14T22:30:11Z

src/test/test_parse_vcf_ref.py

Notes: I have two general notes:

It's up to you if you want to do parametrized tests. Some of the test cases are not even handled differently by the code, so I think it would make a great sense, but there is some beauty in having repeated test looking quite same compared to the significantly uglier parametrized tests.

unit-tests should do exactly one thing. There are few tests that try to test multiple things at once. You should split them and simplify them. Use fixtures to establish setups if needed (fixtures themselves should be parametrized). Definitely remove subprocess. No need that.

src/test/test_parse_vcf_ref.py

tests/test_parse_vcf_ref.py

EvaLiyt added 4 commits October 8, 2025 18:08

reconstruct vcf2fasta.py with unit tests

8ae87bb

remove comments don't need

929a8eb

test multiple reference sequences

0a10ff8

remove unused import

dc2dda1

J-Moravec requested changes Oct 14, 2025

View reviewed changes

EvaLiyt added 3 commits October 15, 2025 18:03

construct the repo with make

f132fa2

improve tests

93f91de

uncomment check length of genome

e28b2c8

J-Moravec reviewed Oct 15, 2025

View reviewed changes

tests/test_parse_vcf_ref.py Show resolved Hide resolved

EvaLiyt added 5 commits October 16, 2025 14:57

add helper function to check genotype at specific position for test

a171e9c

pass make check

b0e6a32

remove logging for now

a0e525f

improve vcf2fasta

05c3b08

add docstrings and typehinting

cdb3a13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor parse_vcf_ref #3 #5 #8

Refactor parse_vcf_ref #3 #5 #8

Uh oh!

EvaLiyt commented Oct 8, 2025

Uh oh!

J-Moravec left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

J-Moravec Oct 14, 2025

Uh oh!

Uh oh!

J-Moravec Oct 14, 2025

Uh oh!

J-Moravec Oct 14, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor parse_vcf_ref #3 #5 #8

Are you sure you want to change the base?

Refactor parse_vcf_ref #3 #5 #8

Uh oh!

Conversation

EvaLiyt commented Oct 8, 2025

Uh oh!

J-Moravec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

J-Moravec Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

J-Moravec Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

J-Moravec Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants