Skip to content

Handle all ambiguity codes + add "strict" types#22

Merged
lynn merged 22 commits intomainfrom
ambiguity-types
Nov 14, 2022
Merged

Handle all ambiguity codes + add "strict" types#22
lynn merged 22 commits intomainfrom
ambiguity-types

Conversation

@lynn
Copy link
Copy Markdown
Contributor

@lynn lynn commented Nov 11, 2022

Closes #18.

Changes

Types

  • Nucleotide no longer has N in it, and just represents one of ACTG.
  • NucleotideAmbiguous represents ACTG or one of the 11 ambiguity codes WMRYSKBVDHN.
  • There is a trait NucleotideLike for common behavior between the two, like "to ASCII" and "complement".
  • Similarly, Codon (3x Nucleotide) and CodonAmbiguous (3x NucleotideAmbiguous) are different types now.
  • The ambiguous types have possibilities() methods for iterating over the possible realizations.

Translation

  • Ambiguity codes are properly handled instead of mapping them all to N.
  • DnaSequence is generic over the type of the contained nucleotides. So, DnaSequence::<Nucleotide>::from_str(s) is our "strict mode", and DnaSequence::<NucleotideAmbiguous>::from_str(s) is the lax mode.

Tests

  • I added a test test_dna_parses_strict, which verifies that this "strict mode" indeed only accepts "aAtTcCgG \t".
  • I added a test test_translate_ambiguous, which verifies that TTRTTV maps to protein LX:
          // R means "A or G" and both {TTA,TTG} map to L (Leucine).
          // Thus, "TTR" should map to L.
          //
          // But V means "A or G or C", and TTC maps to F (Phenylalanine).
          // Thus, "TTV" is truly ambiguous and maps to X.
    

@lynn lynn requested a review from vgel November 11, 2022 00:26
@lynn lynn changed the title Introduce NucleotideAmbiguous and CodonAmbiguous types Handle all ambiguity codes + add "strict" types Nov 11, 2022
Comment thread src/nucleotide.rs
Copy link
Copy Markdown
Contributor

@vgel vgel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good, couple questions + we can absolve my prior unsafe crimes. You were right about the bitfields, worked out great :-)

Comment thread src/fasta.rs
Comment thread src/nucleotide.rs Outdated
Comment thread src/nucleotide.rs
Comment thread src/nucleotide.rs Outdated
Comment thread src/nucleotide.rs Outdated
Comment thread src/rust_api.rs Outdated
Comment thread src/rust_api.rs
Comment on lines +291 to 292
fn dna(dna: &str) -> DnaSequence<NucleotideAmbiguous> {
DnaSequence::from_str(dna).unwrap()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to say it since it'll be a pain, but I think the tests that use dna should be migrated to test on both Nucleotide and NucleotideAmbiguous to ensure the behavior is consistent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure if you've used quickcheck, but if you have or are interested in trying it, I think it could be useful to have a property test that tests behavior is consistent in the general case for DnaSequence<Nucleotide> and DnaSequence<NucleotideAmbiguous> where each NucleotideAmbiguous is A, T, C, or G.

If you want to write it and haven't used quickcheck before, LMK and I can point you at our property tests in the private repos and give some pointers.

Comment thread src/trans_table.rs
Comment thread src/trans_table.rs Outdated
Comment thread src/trans_table.rs
Comment thread quickdna/__init__.py Outdated
@vgel
Copy link
Copy Markdown
Contributor

vgel commented Nov 11, 2022

Test failures are because you also need to add the new functions to the #[pymodule] at the bottom of python_api.rs:

#[pymodule]
fn quickdna(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(_check_table, m)?)?;
    m.add_function(wrap_pyfunction!(_translate, m)?)?;
    m.add_function(wrap_pyfunction!(_reverse_complement, m)?)?;
    // here
    Ok(())
}

Comment thread src/nucleotide.rs
Comment on lines -26 to -28
pub const M_AMBIGUITY: [Self; 2] = [Self::A, Self::C];
pub const R_AMBIGUITY: [Self; 2] = Self::PURINES;
pub const W_AMBIGUITY: [Self; 2] = [Self::A, Self::T];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing these will break all upstream libraries that use these. It will be pretty annoying to bump the version of quickdna.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep them for backwards compatibility, but migrating should just be a matter of going from M_AMBIGUITY to NucleotideAmbiguous::M::possibilities()

Copy link
Copy Markdown
Contributor

@vgel vgel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, ready to merge IMO, waiting to approve until we resolve whether we want to keep the *_AMBIGUITY fields in for backwards compatibility.

Comment thread tests/test_wrapper.py Outdated
def test_translate():
assert DnaSequence("AAAGGGAAA").translate(table=1) == ProteinSequence("KGK")
assert DnaSequence("AAAGGGAAA").translate(
table=1) == ProteinSequence("KGK")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird formatting here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: also add python formatter and linter to this project. Given that there is some amount of python code. So far we only have Rust.

Copy link
Copy Markdown
Contributor Author

@lynn lynn Nov 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the result of my Python formatter (autopep8 I think). I don't like it either though. I like black which is more like cargo fmt (i.e. quite strict/opinionated).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use black, maybe it outputs slightly different formats

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#23

@mkysel
Copy link
Copy Markdown
Contributor

mkysel commented Nov 14, 2022

you will have to rebase this to get #23. Otherwise CI wont let you merge.

@mkysel
Copy link
Copy Markdown
Contributor

mkysel commented Nov 14, 2022

can you fix the README? It currently says:

It doesn't support the (rarer) IUPAC ambiguity codes like B for non-A nucleotides, instead only supporting the general N ambiguity code.
If support for these codes is important to you, please make an issue! It may be possible to support them, it just isn't a priority right now.

@lynn lynn merged commit 5b09162 into main Nov 14, 2022
@vgel vgel deleted the ambiguity-types branch December 15, 2022 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Strict vs. ambiguous types for nucleotides

3 participants