-
Notifications
You must be signed in to change notification settings - Fork 1
Refactor parse_vcf_ref #3 #5 #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
J-Moravec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your PR. I greatly appreciate it!
There are some few general improvements regarding code structure that would make the code quite bit more maintainable. We want to minimize the time required for someone new to jump in. After all, next year we will be someone new.
If I am reading correctly, there is some code duplication happening. There are some simplification especially around deeply nested conditions that could be improved.
These improvements, along with docstrings would greatly improve the code.
Consider also type hinting, knowing which types are various parameters greatly improves understanding.
Finally, I think that the debugging code is necessarily complicated and we don't need it at this point. If we want to include it, we should talk about it. Is the debugging dev only? Do we need dev-only debugging? We can modify the code to add print statements. And the code is short, does only a few things, so tests should provide enough understanding and security.
Maybe is the debugging for users? If the code would randomly fail on invalid input (not a lot of checks currently present in the code), what would happen? How would it look like? Would errors be clear or do we need debugs for that?
src/vcf2fasta.py
Outdated
|
|
||
| if __name__ == "__main__": | ||
| args = parse_args() | ||
| debug_mode = args.debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: Move anything that is not argument parsing into the vcf2fasta function.
User-interfaces are notoriously hard to debug. CLI is still fine, but you see that to do that, you need to run Python in a subprocess and catch its output.
A better way to do this is to isolate any functionality into a module (vcf2fasta()), and have CLI as a thin wrapper. This thin wrapper can be user-tested or whatever, but only thin it does is parsing command-line arguments and sending them to the isolated module.
Since the isolated module now encompasses all other functionality, it can be tested without doing any shenanigans because, well, its just a function (or class) without other dependencies other than some input arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tasks:
- Split big function into subfunctions with clearly defined parameters (see comments).
- Streamline functions where possible, functions should be easy to read and understand!
- Create documentation (docstring) to document what the function does and the business problem they are solving.
-
Functions should be hyperfocused at doing "one thing". What is the "one thing" is the difficult part and its the art of choosing the right abstraction. Big picture functions should read "Do this, change this, and then do that". This makes easy to see what is the big picture that the function is solving. Smaller functions are then defined as "do this" or "change this", their purpose is known, and they can be more detailed. At the same time, functions should not do any things at once, because many things then need to be balanced at the same time. Its fine if they do many things serially (first do this, then do that), or if really required to be done at the same time (e.g., in single iteration), the details should be wrapped in functions, if the details are complicated enough to take away from understanding the big picture.
-
Basically the same thing I said before. Streamlining. Try to not do to much "if if if", makes it hard to understand what is going on. Guard clauses (you used them) with early exit tend to be better. Often "if if if" is a code smell that tells you that you are doing many things at once and should think about decreasing the number of things you do, because you need to remember every single thing you do, keep it in your memory, and understand it. It is good to minimize things one need to read, remember, and understand, to understand a certain piece of code.
-
I think the code grew substantially to the degree that we need docstrings. I am getting lost in some of the functions I wrote myself, which suggest that they were supposed to be there in the first place.
Feel free to refactor some of the original functions as well. Now that I am reading it, getting translated VCF requires quite a substantial chain of functions. They are all very clear and hypefocused, so I kind of understand their purpose and don't see how inlining them would make stuff easier to read (since some of it is just maps).
I think typehinting would definitely make reading it so much better.
src/test/test_parse_vcf_ref.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes: I have two general notes:
-
It's up to you if you want to do parametrized tests. Some of the test cases are not even handled differently by the code, so I think it would make a great sense, but there is some beauty in having repeated test looking quite same compared to the significantly uglier parametrized tests.
-
unit-tests should do exactly one thing. There are few tests that try to test multiple things at once. You should split them and simplify them. Use fixtures to establish setups if needed (fixtures themselves should be parametrized). Definitely remove subprocess. No need that.
Support nucleotide or nd16 reference FASTA by specifying
--reffor reference file path and--refTypefor reference file data type.Fill missing VCF sites with reference bases.
Other changes:
parse_vcf_refparse_fastafor lighter dependencywrite_fastamethod, modify this method to allow each sequence for one lineref_sequence_str = str(ref_sequence)after for loop for simplicitytranslate_genomeallow haploid reference passing in, comment the unit testparse_vcf_reftranslate_fastarefType nd16refType nucleotidetranslate_vcfvcf2fasta.py