Update FASTA parser to be configurable, generic, track line numbers, and other changes by vgel · Pull Request #9 · SecureDNA/quickdna

vgel · 2022-07-22T22:22:12Z

Apologies for the large diff! Much of it is comments and tests, though.

These changes make the FASTA parser more flexible, and will allow us to handle FASTA files the way we need for SecureDNA (SDNA employees, refer to the "Notes" section in [[Alpha API endpoint]] on the internal wiki).

Changes

Adds 2 new settings: allow_preceding_comment and concatenate_headers. These are described in the doc comments, but briefly:

allow_preceding_comment = true will ignore any content preceding the first header, like before. If false, it will emit that content as a headerless (header = "") record.
concatenate_headers = true will glue successive headers with no content together with newlines. If false, it will use the old behavior and emit content = "" records. See the doc comment for a doctest that explains this.

The return value of FastaParser is now a FastaRecord with header and contents fields, instead of a 2-tuple. It's also grown a line_range field which gives the line numbers of the record (including header, 1-indexed, [start, end)).

Removes DnaFastaParser and SimpleFastaParser in favor of a single FastaParser type which is generic over FromStr types. Since String has an (infalliable) FromStr impl, SimpleFastaParser is now FastaParser<String>, and DnaFastaParser is now FastaParser<DnaSequence>. We didn't have this type before, but a hypothetical ProteinFastaParser is now FastaParser<ProteinSequence>, and you could even have FastaParser<u64> if you wanted that.

FASTA parsing is now state-machine-driven, which should make a non-allocating API easier to add in the future if we want, and also simplified the parsing code since it was getting hairy handling the new special cases in the old loop. Luckily most of @mkysel 's parsing logic was relatively straightforward to port into the state machine without too many changes besides handling the new settings.

The old tests are all redone to use the new API (and to test all 4 settings combinations, where it makes sense), and a bunch of new tests are added.

Misc changes:

parse now explicitly asks for a BufRead instead of taking Read and wrapping it internally in BufReader. I checked the ecosystem using SourceGraph and that seemed to be the way most Rust stuff works, so this brings our API in line with that.

There's a new parse_str method that takes advantage of &[u8] having a BufRead impl. That's part of the reason for the other change, too, since before that would have needlessly allocated a BufReader buffer which would bother me :-)

also removes the distinction between a DnaFastaParser and a "regular" SimpleFastaParser -- everything is "FastaParser" now, with a phantom generic FromStr type that will be parsed (String has an infalliable FromStr impl) So SimpleFastaParser is now FastaParser<String>, and DnaFastaParser is now FastaParser<DnaSequence>. ProteinFastaParser didn't exist before, but it's now FastaParser<ProteinSequence>. A hypothetical integer fasta format could be FastaParser<u64>, if you wanted that for some weird reason. Fasta parsing is now state-machine-driven, which should make a non-allocating API easier to add in the future, and also simplified the parsing code since it was getting hairy handling the new special cases in the old loop.

Ravna

I like the tests, though probably the protein-specific tests should use things that can't possibly be nuceleotides, like maybe L, K, R, S, etc so we're sure they're not being parsed as DNA.

I didn't notice a test with nonunique header lines. Probably this can't cause a problem, but good to have.

Similarly use of Unicode in a header line, which should not fail. (I should probably specify "UTF-8" in the spec, actually; "Unicode" is too vague and allows atrocities like UTF-16 and byte-order marks and all kinds of other weird things Windows might do.)

Ditto showing that case is ignored in sequences.

Also, do we document (and test) that N or other ambiguity codes are/are not allowed? That would probably be a useful option to have available, so a parse error for something that must not have ambiguity happens as early as possible.

mkysel · 2022-07-25T13:06:59Z

-                .unwrap();
-        assert_eq!(r.len(), 0);
+    fn test_comment_fasta_with_no_allow_preceding_comment() {
+        assert_parse_with_allow_preceding_comment(


I don't think that this behavior is correct for the default case.

Could you elaborate? Looking at it, the setting says "don't allow ignoring comments, parse it as a headerless record" and that's what the code seems to be doing, but I might be missing something (or the name might be misleading?)

So, this test variant is testing a situation in which we do not allow preceding comments.

The existing code ignores anything that precedes the first header. In this case, everything.

I don't like the idea of assuming that anything without a header is some content.

If there is no >, it should not parse.

Ohh, I see what you mean -- only allow a comment if there's at least one record. Hmm.

I guess my feeling is, we allow parsing an empty file, right? And comment with no records = an empty file? Like, if you had some system that spit out a comment like:

// sequenced-on: 2022-07-30

on every record, and some sequencing result came back empty, then you'd want to parse that as an empty file. And if you don't want comments, you can use strict() mode instead. But I see what you mean that this could bite you if you expected comments on non-empty files, but accidentally omit a header and then lose a record on something that shouldn't be an empty file. Argh. FASTA is a sucky format :-(

(We could introduce an option like "comments must start with this prefix" to prevent this, but that's non-standard -- if hewing to the FASTA "standard" even makes sense! :-p )

So, this test variant is testing a situation in which we do not allow preceding comments.

The existing code ignores anything that precedes the first header. In this case, everything.

I don't like the idea of assuming that anything without a header is some content.

If there is no >, it should not parse.

Nope, disagree, for security reasons, which is why I specified this behavior in the API.

Here's the problem: We can't be sure that some provider, or some benchtop, won't accept a sequence without a header. And if anyone in the world does that, it provides the most trivial bypass possible for screening: just never include a header in your order, and we won't screen it, but the provider will make it. And if some provider insists on seeing some header but still considers stuff before the header as something to be manufactured, then a malicious customer submits "malicious stuff, header, benign stuff" and still escapes screening.

This way, we put the onus, both technical and legal, directly on the provider who is giving us the task of screening. If they consider stuff before a header to be a comment, then they are required to remove it before passing it to us. That guarantees that the behavior they think they want and the behavior they give to us will match. They are free to either ignore it (and remove it before handing it to us) or complain to the customer and make them remove it, but anything they pass to us will be screened, and we aren't going to try to heuristicate either whether pre-any-header stuff is a comment or is valid content: by treating it all as content, we force the provider or their UI to make that choice, which is a trivial amount of programming for them either way.

Here is my immediate thought. The provider still has to somehow match the result of the screening to the sequences in the order, right? For each sequence you should verify whether it is a hazard or not. Not having any reply for a sequence is clearly a problem that someone would notice, no?

I also just noticed some of the upthread commentary on this re comments.

The deal is that anything which is not a sequence is a comment in FASTA. That means any line which begins with > or ; (though ; is semi-obsolete). Anything else is parse error. By definition, at least as far as we're concerned, comments are *opaque blobs" with no semantic content whatsoever, because there is no standard about what goes in there. Particular consumers of such files, such as NCBI, may have standards for what they accept (since in NCBI's case they're using them to create database keys for submissions), but in general there's just no way at all to predict what's going on and no enforcement even if there was a standard, which there isn't. All we're doing with preserving these > or ; lines is making it slightly easier to make a UI that tells the customer where in the file we're objecting to something. That's it. So we really don't care about comments at all. And the only fly in that ointment is the practice in some FASTA files of assuming that anything before the first > is also ignored. We're enabling that with a switch, but we ourselves never intend to enable that for screening---you have something that could plausibly be a sequence before the first > and you don't want it synthesized or screened, then you put a ; or a > in front of each line of it. Or if you're a provider and you want to allow that, you take it out before handing it to us. Then it's clear who's responsible if you, as a provider, deleted that stuff before you gave it to us but then went ahead and made it anyway.

(Also the way to deal with "you accidentally created a null record and didn't notice it" is not to overload the commenting/screening system into pointing out your error. Either you as a customer need to run QA first on what you're ordering, or the provider may have its own rules/helpful hints and tells you, but that's not our job. And it can't be, because valid FASTA files exist in the wild with multiple consecutive header lines and we can't tell that from an error.)

(We could introduce an option like "comments must start with this prefix" to prevent this, but that's non-standard -- if hewing to the FASTA "standard" even makes sense! :-p )

No, let's not; nobody will do this. :) What they do instead is to just pack all kinds of trash into the > line, including random lab notes like "sequencing added ACCTTGGACA to the previous record" and all sorts of other random asides. If they want something more sophisticated, they should be using tools that maintain their sequences in a DB (for example) and then spit out valid FASTA files with the extra info suppressed, especially if that info is just going into the maw of a synth, which is certainly not reading those comments anyway. Trying to put all this metadata into the FASTA file itself is like (true story!) what I saw once where the exec dir of a ~$1M/yr nonprofit thought the right way to issue PO's was to just create them directly as PDF's from a template and was amazed to hear that actual software existed in the world which tracked all that stuff in a DB and could then issue PO's & invoices & so forth downstream on demand.

mkysel · 2022-07-25T13:20:38Z

There's a new parse_str method that takes advantage of &[u8] having a BufRead impl. That's part of the reason for the other change, too, since before that would have needlessly allocated a BufReader buffer which would bother me :-)

I suspect @hwchen will have opinions here :)

hwchen · 2022-07-25T15:54:10Z

re bufreader:

if it's not in the hotloop, it probably doesn't matter if there's an extra allocation for &[u8] (looks like csv takes this approach), but I understand the principle of it :)
if it's in the hotloop, perhaps it's better to just read the whole file instead of using bufreader? I don't know how large the FASTAs we expect to parse are, though, and perhaps this doesn't really make a difference in perf. edit: I guess this is the approach that parse_str allows

Those are my opinions :) but I don't have a problem with this approach. There's definitely less hidden allocation and I appreciate that.

Ravna · 2022-07-25T17:35:55Z

My one concern with reading the whole file at once is what happens if someone tries to shove a gigabyte down our throats. OTOH, if something else will break upstream, maybe that's less of a concern. We probably want an optional parameter somewhere (the first place we might otherwise allocate a user-controlled and indefinitely-large object) which is a maximum-size limit, so we/providers can supply that with something reasonable. I'd guess quickdna is probably actually not a bad place to put that, because it's the first stop for untrusted data. And then we set that to something reasonable at our end and give providers the ability to override it if for some reason they want to accept a Gn input or something and know they can handle it.. [I'm less concerned about giant allocs on a benchtop because you're only hurting yourself, but we still probably want to have the BT just reject gracefully and not have some C/Java-side thing inside it blow up instead, and their developers may want this limit anyway.]

mkysel · 2022-07-25T18:10:27Z

The rocket server already rejects too large requests. But it only applies to users that use the REST API endpoint, rather than using it as a library.

Ravna · 2022-07-25T18:16:00Z

We won't be using Rocket in production in general---we could be getting some giant thing uploaded by a customer to a provider's web page or non-web API and then handing it to us as a library. And we probably want this available for, e.g., a benchtop into which someone plugs a flash drive. :) In general we should be as as paranoid as possible about user-supplied inputs, and this is an easy one for users to supply. (We also need to be paranoid about network inputs, but that's a different issue.)

vgel · 2022-07-26T18:33:26Z

@hwchen re: BufReader, my feeling is both a) I prefer to not silently allocate, but more b) depending on system you may want to tune your bufreader size to optimal for your disk / ram / etc., and by hiding our bufreader inside the other one we make it harder for people to do that without stacking bufreaders and wasting memory (which may be important on a benchtop that only has <4kb of RAM...). Glad you agree :-)

@Ravna re: DOS with large file, I agree but don't want to address that yet in this PR. I think the Vec-interface is working OK for alpha, and the state-machine-based parser will make implementing a streaming parser a lot easier in the future, and this PR is a bit big already...

Ravna · 2022-07-26T19:07:48Z

Sure, re DoS, leave it the way it is, but can you make an issue so we don't forget? (I may already have this in the testing zoo but should put it there if I don't. Reading FASTA input is pretty much the only input we'll take where we don't know up front how big it's supposed to be anyway; I'm planning for the network protocol to give an up-front count anyway for everything where we control both ends.)

benwr

Looks good overall; thanks for pretty thorough test suite!

benwr · 2022-07-26T19:29:15Z

 }

 #[cfg(test)]
 mod tests {


oh man this is just screaming for quickcheck tests

Hmm, maybe? I don't see how this could fail / what quickcheck would uncover (I guess that's the point of property testing :-) ). I actually realized in the latest commit that the trim_end_matches here is pointless, so the line is just Some(&line[1..]) now, which I really really don't think can fail :-)

Oh, I didn't mean this specific snippet, just this module in general.

benwr · 2022-07-26T19:47:20Z

-                        result.push((existing_title, lines.join("")));
-                        lines.clear();
-                    }
+enum ParserState {


doing this as an FST seems pretty reasonable, but it is a lot of code to check. I wonder if using parser combinators might have been easier.

Hmm, interesting point. I leaning towards the state machine approach since I had already written a similar one (that I need to rip out and replace at some point) in our pipeline code, and I wasn't sure the best parser combinator libraries for Rust -- I'd played with nom and had mixed experiences with it.

benwr · 2022-07-26T19:53:40Z

+enum ParserState {
+    StartOfFile {
+        /// only non-empty if settings.allow_preceding_comment is false
+        contents: String,


my string concatenation speed alarms went off here, but actually I think this is basically fine. Might be slightly faster to accumulate, like, a Vec<&str> or something and then join them at the end, but 🤷

Yeah I thought about that as well, but I think the quadratic behavior won't manifest here. Maybe I should add 10,000 line comment, concatenated header, and sequence body checks to be safe :-)

Also I don't think we can hold a Vec<&str> because lifetimes™, we get chunks of data fed into us from a BufRead instead of a single string we can hold onto and reference chunks out of :-(

Yeah, you're right :/

(I agree re not quadratic, since Strings in Rust are mutable)

Added some tests for quadratic-ness in most recent commit, using a panic_after helper so they'll panic if the runtime ever grows significantly.

vgel requested review from benwr and mkysel July 22, 2022 22:22

Ravna reviewed Jul 22, 2022

View reviewed changes

mkysel reviewed Jul 25, 2022

View reviewed changes

vgel mentioned this pull request Jul 26, 2022

Streaming FASTA parser #10

Open

benwr approved these changes Jul 26, 2022

View reviewed changes

More tests and remove pointless line-stripping

b1485a1

vgel merged commit 1322910 into main Jul 26, 2022

vgel deleted the configurable-generic-fasta-parser branch July 26, 2022 21:40

Conversation

vgel commented Jul 22, 2022

Changes

Misc changes:

Uh oh!

Ravna left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkysel Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mkysel commented Jul 25, 2022

Uh oh!

hwchen commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ravna commented Jul 25, 2022

Uh oh!

mkysel commented Jul 25, 2022

Uh oh!

Ravna commented Jul 25, 2022

Uh oh!

vgel commented Jul 26, 2022

Uh oh!

Ravna commented Jul 26, 2022

Uh oh!

benwr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vgel Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mkysel Jul 26, 2022 •

edited

Loading

hwchen commented Jul 25, 2022 •

edited

Loading

vgel Jul 26, 2022 •

edited

Loading