Conversation
eseraygun
left a comment
There was a problem hiding this comment.
Summary: Let's make this PR about unit testing and drop any additions to the existing algorithms. I explained my reasoning in the comments.
| seq_alignment.second_indices = indices2 | ||
| return seq_alignment | ||
|
|
||
| class SmithWatermanAligner(object): |
There was a problem hiding this comment.
This is just another implementation of an already provided algorithm. Furthermore, the implementation does not follow any of the interfaces/principles currently provided by this library. (It just reuses some method names such as computeAlignmentMatrix.)
I'm against turning this library into a bag of separately-maintained alignment-related stuff. So, I say let's address the issues that you brought up and maybe release a backwards incompatible version 2.0 with a completely new interface.
There was a problem hiding this comment.
This was just temporarily. I wanted to leave the old implementations there while this one is in progress. Once we are happy with one implementation the others should be changed accordingly (I had something related to that in the TODO section). And the SequenceAlignment should change as well.
alignment/sequencealigner.py
Outdated
| return f | ||
|
|
||
| def _traceback(self, score_matrix, start_locs): | ||
| pending_roots = [ |
There was a problem hiding this comment.
If I'm not mistaken, this is a breadth-first search of optimal alignments meaning that its space complexity is proportional to the number of optimal alignments. I doubt that this is ideal as it's hard to put a bound on the number of optimal alignments and the memory can easily can go out of control.
Depth-first guarantees that at most O(N) space will be needed. Obviously my recursive implementation is pretty bad as it is abusing the program stack and it is returning a list of all optimal alignments. The former can be fixed by using a heap-based stack. The latter can be fixed by using iterators: That would give the client a chance to stop early.
There was a problem hiding this comment.
When I first wrote it I thought it might pick different branches on the way but now I changed it to find just one path per starting point so it could be changed to just do it for each root separately. I don't know whether it is desirable to go up and left at the same time (assuming they have the same score and diagonally is worse). If it isn't then moving that loop outside might be preferable and it could become a generator instead.
There was a problem hiding this comment.
I have to admit it was a bit late when I first wrote it and I just wanted to understand the algorithm properly. I've changed that now. It's depth first, it's going both up and left if they have the same score (there is a new test for that now). And it's a generator (although the align method isn't and has a limit parameter)
| @@ -0,0 +1,254 @@ | |||
| """ | |||
There was a problem hiding this comment.
Thanks a lot for this. I have objections about the new additions to the library. So, what would you say about keeping only the tests about the existing modules and leaving the rest for a future PR?
There was a problem hiding this comment.
I don't mind so much. But maybe we could get the other code into the right shape. It was still work in progress.
| # Unit test / coverage reports | ||
| .coverage | ||
| .tox | ||
| .cache |
There was a problem hiding this comment.
Out of curiosity: Which tool creates a .cache folder?
There was a problem hiding this comment.
It was pytest, it added test results to that folder - I wasn't quite sure about that either.
This isn't ready yet.
I've re-implemented the Smith Waterman algorithm in a non-recursive way. It creates SequenceAlignment only once but still supports multiple alignments. It would probably be good to add a limit if the caller is only interested in one alignment.
The other thing is that it doesn't creates EncodedSequence for the input if the passed in sequence was an EncodedSequence, otherwise it uses Sequence. It would also work if it only creates Sequence - at least all of the tests pass. In fact I am not sure why we need to wrap it in a Sequence object at all. Python's list is fine. And the EncodedSequence can be list like. There isn't much of an advantage using EncodedSequence in the response - the list has already been created.
The algorithm works with any sequence. The gap element to use can be passed in.
Indicies are available as first_indicies and second_indicies - a list of integers with None for gaps.
Some TODOs:
Anyway, let me know what you think.