Handle all ambiguity codes + add "strict" types by lynn · Pull Request #22 · SecureDNA/quickdna

lynn · 2022-11-11T00:26:27Z

Closes #18.

Changes

Types

Nucleotide no longer has N in it, and just represents one of ACTG.
NucleotideAmbiguous represents ACTG or one of the 11 ambiguity codes WMRYSKBVDHN.
There is a trait NucleotideLike for common behavior between the two, like "to ASCII" and "complement".
Similarly, Codon (3x Nucleotide) and CodonAmbiguous (3x NucleotideAmbiguous) are different types now.
The ambiguous types have possibilities() methods for iterating over the possible realizations.

Translation

Ambiguity codes are properly handled instead of mapping them all to N.
DnaSequence is generic over the type of the contained nucleotides. So, DnaSequence::<Nucleotide>::from_str(s) is our "strict mode", and DnaSequence::<NucleotideAmbiguous>::from_str(s) is the lax mode.

Tests

I added a test test_dna_parses_strict, which verifies that this "strict mode" indeed only accepts "aAtTcCgG \t".

I added a test test_translate_ambiguous, which verifies that TTRTTV maps to protein LX:

      // R means "A or G" and both {TTA,TTG} map to L (Leucine).
      // Thus, "TTR" should map to L.
      //
      // But V means "A or G or C", and TTC maps to F (Phenylalanine).
      // Thus, "TTV" is truly ambiguous and maps to X.

vgel

Looks generally good, couple questions + we can absolve my prior unsafe crimes. You were right about the bitfields, worked out great :-)

vgel · 2022-11-11T01:32:01Z

+    fn dna(dna: &str) -> DnaSequence<NucleotideAmbiguous> {
        DnaSequence::from_str(dna).unwrap()
    }


I hate to say it since it'll be a pain, but I think the tests that use dna should be migrated to test on both Nucleotide and NucleotideAmbiguous to ensure the behavior is consistent.

Also not sure if you've used quickcheck, but if you have or are interested in trying it, I think it could be useful to have a property test that tests behavior is consistent in the general case for DnaSequence<Nucleotide> and DnaSequence<NucleotideAmbiguous> where each NucleotideAmbiguous is A, T, C, or G.

If you want to write it and haven't used quickcheck before, LMK and I can point you at our property tests in the private repos and give some pointers.

vgel · 2022-11-11T01:59:23Z

Test failures are because you also need to add the new functions to the #[pymodule] at the bottom of python_api.rs:

#[pymodule]
fn quickdna(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(_check_table, m)?)?;
    m.add_function(wrap_pyfunction!(_translate, m)?)?;
    m.add_function(wrap_pyfunction!(_reverse_complement, m)?)?;
    // here
    Ok(())
}

mkysel · 2022-11-11T15:06:35Z

-    pub const M_AMBIGUITY: [Self; 2] = [Self::A, Self::C];
-    pub const R_AMBIGUITY: [Self; 2] = Self::PURINES;
-    pub const W_AMBIGUITY: [Self; 2] = [Self::A, Self::T];


removing these will break all upstream libraries that use these. It will be pretty annoying to bump the version of quickdna.

We could keep them for backwards compatibility, but migrating should just be a matter of going from M_AMBIGUITY to NucleotideAmbiguous::M::possibilities()

vgel

Looks great, ready to merge IMO, waiting to approve until we resolve whether we want to keep the *_AMBIGUITY fields in for backwards compatibility.

vgel · 2022-11-11T21:35:28Z

 def test_translate():
-    assert DnaSequence("AAAGGGAAA").translate(table=1) == ProteinSequence("KGK")
+    assert DnaSequence("AAAGGGAAA").translate(
+        table=1) == ProteinSequence("KGK")


Weird formatting here?

note to self: also add python formatter and linter to this project. Given that there is some amount of python code. So far we only have Rust.

This is actually the result of my Python formatter (autopep8 I think). I don't like it either though. I like black which is more like cargo fmt (i.e. quite strict/opinionated).

we use black, maybe it outputs slightly different formats

mkysel · 2022-11-14T17:08:23Z

you will have to rebase this to get #23. Otherwise CI wont let you merge.

mkysel · 2022-11-14T19:30:56Z

can you fix the README? It currently says:

It doesn't support the (rarer) IUPAC ambiguity codes like B for non-A nucleotides, instead only supporting the general N ambiguity code.
If support for these codes is important to you, please make an issue! It may be possible to support them, it just isn't a priority right now.

lynn requested a review from vgel November 11, 2022 00:26

lynn changed the title ~~Introduce NucleotideAmbiguous and CodonAmbiguous types~~ Handle all ambiguity codes + add "strict" types Nov 11, 2022

lynn commented Nov 11, 2022

View reviewed changes

Comment thread src/nucleotide.rs

vgel reviewed Nov 11, 2022

View reviewed changes

Comment thread quickdna/__init__.py Outdated

mkysel reviewed Nov 11, 2022

View reviewed changes

vgel reviewed Nov 11, 2022

View reviewed changes

lynn added 18 commits November 14, 2022 23:27

Introduce NucleotideAmbiguous and CodonAmbiguous types

6eeeca7

Fix a typo, clarify test comment

cff7f09

clippy

6257475

Add strict Python API

9929d39

Actually add the Python API to Python

2827a9c

Expose strict functions to Python

eaa4463

Update .gitignore

86b5ac4

Get rid of unsafe

65a2e55

Undo reverse_complement_bytes perf regression

40a117a

Always inline TryFrom<NucleotideAmbiguous> for Nucleotide

426987e

Allow converting NucleotideLike into char

c522c8e

Add FastaParser tests

deda41a

Add dna_strict versions of various dna() tests

7e01375

impl From<Nucleotide> for NucleotideAmbiguous

b8dafd4

cargo fmt

bf79a80

Add remaining ambiguity codes for amino acids

7d39030

Add Python tests for strictness errors and reverse complement

a1c0df2

Specify which Python error is raised in strict mode

2421562

lynn force-pushed the ambiguity-types branch from ace44d1 to 2421562 Compare November 14, 2022 22:27

Run poetry run black .

13c660b

lynn added 3 commits November 14, 2022 23:30

Restore Nucleotide::X_AMBIGUITY constants

5d623bf

Update README

523750d

Add a pre-1.0 warning to README

9722b1f

mkysel approved these changes Nov 14, 2022

View reviewed changes

lynn merged commit 5b09162 into main Nov 14, 2022

lynn mentioned this pull request Nov 23, 2022

Allow non-owned Sequence types #19

Open

vgel deleted the ambiguity-types branch December 15, 2022 20:11

Conversation

lynn commented Nov 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Types

Translation

Tests

Uh oh!

Uh oh!

vgel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vgel commented Nov 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vgel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lynn Nov 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkysel commented Nov 14, 2022

Uh oh!

mkysel commented Nov 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lynn commented Nov 11, 2022 •

edited

Loading

vgel left a comment •

edited

Loading

vgel commented Nov 11, 2022 •

edited

Loading

lynn Nov 12, 2022 •

edited

Loading