Skip to content

Conversation

@EvaLiyt
Copy link

@EvaLiyt EvaLiyt commented Oct 8, 2025

Support nucleotide or nd16 reference FASTA by specifying --ref for reference file path and --refType for reference file data type.
Fill missing VCF sites with reference bases.
Other changes:

  1. modify parse_vcf_ref
    • remove SeqIO and write a method parse_fasta for lighter dependency
    • extract methods for translating vcf
    • remove write fasta code and use existing write_fasta method, modify this method to allow each sequence for one line
  2. make debug mode log a file, change specific data debug to random index for suiting all dataset
  3. remove prints and unhelpful comments, throw exceptions directly
  4. throw exception when vcf comes from a different reference sequence
  5. allow multiple chromosomes in reference fasta and vcf
  6. move ref_sequence_str = str(ref_sequence) after for loop for simplicity
  7. use Guard statements for check and quit for printing errors
  8. write unit tests for all new functions
    • translate_genome allow haploid reference passing in, comment the unit test
    • test parse_vcf_ref
      • test parse_fasta function
        1. recognise multiple reference sequences correctly, throw error when sequences with same chromosome name but conflict with each other
        1. throw error when file not found or empty
      • test translate_fasta
        1. copy when refType nd16
        1. consider homozygous for refType nucleotide
      • test translate_vcf
        1. work for nd16 and binary
      • test vcf2fasta.py
        1. output correct number of sequences and length
        1. output bases are correct when using two reference sequences file

Copy link
Member

@J-Moravec J-Moravec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR. I greatly appreciate it!

There are some few general improvements regarding code structure that would make the code quite bit more maintainable. We want to minimize the time required for someone new to jump in. After all, next year we will be someone new.

If I am reading correctly, there is some code duplication happening. There are some simplification especially around deeply nested conditions that could be improved.
These improvements, along with docstrings would greatly improve the code.
Consider also type hinting, knowing which types are various parameters greatly improves understanding.

Finally, I think that the debugging code is necessarily complicated and we don't need it at this point. If we want to include it, we should talk about it. Is the debugging dev only? Do we need dev-only debugging? We can modify the code to add print statements. And the code is short, does only a few things, so tests should provide enough understanding and security.

Maybe is the debugging for users? If the code would randomly fail on invalid input (not a lot of checks currently present in the code), what would happen? How would it look like? Would errors be clear or do we need debugs for that?

src/vcf2fasta.py Outdated

if __name__ == "__main__":
args = parse_args()
debug_mode = args.debug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix: Move anything that is not argument parsing into the vcf2fasta function.


User-interfaces are notoriously hard to debug. CLI is still fine, but you see that to do that, you need to run Python in a subprocess and catch its output.

A better way to do this is to isolate any functionality into a module (vcf2fasta()), and have CLI as a thin wrapper. This thin wrapper can be user-tested or whatever, but only thin it does is parsing command-line arguments and sending them to the isolated module.

Since the isolated module now encompasses all other functionality, it can be tested without doing any shenanigans because, well, its just a function (or class) without other dependencies other than some input arguments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks:

  1. Split big function into subfunctions with clearly defined parameters (see comments).
  2. Streamline functions where possible, functions should be easy to read and understand!
  3. Create documentation (docstring) to document what the function does and the business problem they are solving.

  1. Functions should be hyperfocused at doing "one thing". What is the "one thing" is the difficult part and its the art of choosing the right abstraction. Big picture functions should read "Do this, change this, and then do that". This makes easy to see what is the big picture that the function is solving. Smaller functions are then defined as "do this" or "change this", their purpose is known, and they can be more detailed. At the same time, functions should not do any things at once, because many things then need to be balanced at the same time. Its fine if they do many things serially (first do this, then do that), or if really required to be done at the same time (e.g., in single iteration), the details should be wrapped in functions, if the details are complicated enough to take away from understanding the big picture.

  2. Basically the same thing I said before. Streamlining. Try to not do to much "if if if", makes it hard to understand what is going on. Guard clauses (you used them) with early exit tend to be better. Often "if if if" is a code smell that tells you that you are doing many things at once and should think about decreasing the number of things you do, because you need to remember every single thing you do, keep it in your memory, and understand it. It is good to minimize things one need to read, remember, and understand, to understand a certain piece of code.

  3. I think the code grew substantially to the degree that we need docstrings. I am getting lost in some of the functions I wrote myself, which suggest that they were supposed to be there in the first place.

Feel free to refactor some of the original functions as well. Now that I am reading it, getting translated VCF requires quite a substantial chain of functions. They are all very clear and hypefocused, so I kind of understand their purpose and don't see how inlining them would make stuff easier to read (since some of it is just maps).

I think typehinting would definitely make reading it so much better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes: I have two general notes:

  1. It's up to you if you want to do parametrized tests. Some of the test cases are not even handled differently by the code, so I think it would make a great sense, but there is some beauty in having repeated test looking quite same compared to the significantly uglier parametrized tests.

  2. unit-tests should do exactly one thing. There are few tests that try to test multiple things at once. You should split them and simplify them. Use fixtures to establish setups if needed (fixtures themselves should be parametrized). Definitely remove subprocess. No need that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants