Skip to content

Non recursive impl with indicies#7

Open
de-code wants to merge 11 commits intoeseraygun:masterfrom
de-code:non-recursive-impl-with-indicies
Open

Non recursive impl with indicies#7
de-code wants to merge 11 commits intoeseraygun:masterfrom
de-code:non-recursive-impl-with-indicies

Conversation

@de-code
Copy link
Contributor

@de-code de-code commented May 5, 2017

This isn't ready yet.

I've re-implemented the Smith Waterman algorithm in a non-recursive way. It creates SequenceAlignment only once but still supports multiple alignments. It would probably be good to add a limit if the caller is only interested in one alignment.

The other thing is that it doesn't creates EncodedSequence for the input if the passed in sequence was an EncodedSequence, otherwise it uses Sequence. It would also work if it only creates Sequence - at least all of the tests pass. In fact I am not sure why we need to wrap it in a Sequence object at all. Python's list is fine. And the EncodedSequence can be list like. There isn't much of an advantage using EncodedSequence in the response - the list has already been created.
The algorithm works with any sequence. The gap element to use can be passed in.

Indicies are available as first_indicies and second_indicies - a list of integers with None for gaps.

Some TODOs:

  • apply this method to all aligners and remove the extra SmithWatermanAligner
  • currently it is setting properties on SequenceAlignment, which should be passed to the constructor
  • remove push and pop from SequenceAlignment
  • performance

Anyway, let me know what you think.

Copy link
Owner

@eseraygun eseraygun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: Let's make this PR about unit testing and drop any additions to the existing algorithms. I explained my reasoning in the comments.

seq_alignment.second_indices = indices2
return seq_alignment

class SmithWatermanAligner(object):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just another implementation of an already provided algorithm. Furthermore, the implementation does not follow any of the interfaces/principles currently provided by this library. (It just reuses some method names such as computeAlignmentMatrix.)

I'm against turning this library into a bag of separately-maintained alignment-related stuff. So, I say let's address the issues that you brought up and maybe release a backwards incompatible version 2.0 with a completely new interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just temporarily. I wanted to leave the old implementations there while this one is in progress. Once we are happy with one implementation the others should be changed accordingly (I had something related to that in the TODO section). And the SequenceAlignment should change as well.

return f

def _traceback(self, score_matrix, start_locs):
pending_roots = [
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, this is a breadth-first search of optimal alignments meaning that its space complexity is proportional to the number of optimal alignments. I doubt that this is ideal as it's hard to put a bound on the number of optimal alignments and the memory can easily can go out of control.

Depth-first guarantees that at most O(N) space will be needed. Obviously my recursive implementation is pretty bad as it is abusing the program stack and it is returning a list of all optimal alignments. The former can be fixed by using a heap-based stack. The latter can be fixed by using iterators: That would give the client a chance to stop early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first wrote it I thought it might pick different branches on the way but now I changed it to find just one path per starting point so it could be changed to just do it for each root separately. I don't know whether it is desirable to go up and left at the same time (assuming they have the same score and diagonally is worse). If it isn't then moving that loop outside might be preferable and it could become a generator instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit it was a bit late when I first wrote it and I just wanted to understand the algorithm properly. I've changed that now. It's depth first, it's going both up and left if they have the same score (there is a new test for that now). And it's a generator (although the align method isn't and has a limit parameter)

@@ -0,0 +1,254 @@
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this. I have objections about the new additions to the library. So, what would you say about keeping only the tests about the existing modules and leaving the rest for a future PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind so much. But maybe we could get the other code into the right shape. It was still work in progress.

# Unit test / coverage reports
.coverage
.tox
.cache
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: Which tool creates a .cache folder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was pytest, it added test results to that folder - I wasn't quite sure about that either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants