Conversation
also removes the distinction between a DnaFastaParser and a "regular" SimpleFastaParser -- everything is "FastaParser" now, with a phantom generic FromStr type that will be parsed (String has an infalliable FromStr impl) So SimpleFastaParser is now FastaParser<String>, and DnaFastaParser is now FastaParser<DnaSequence>. ProteinFastaParser didn't exist before, but it's now FastaParser<ProteinSequence>. A hypothetical integer fasta format could be FastaParser<u64>, if you wanted that for some weird reason. Fasta parsing is now state-machine-driven, which should make a non-allocating API easier to add in the future, and also simplified the parsing code since it was getting hairy handling the new special cases in the old loop.
Ravna
left a comment
There was a problem hiding this comment.
I like the tests, though probably the protein-specific tests should use things that can't possibly be nuceleotides, like maybe L, K, R, S, etc so we're sure they're not being parsed as DNA.
I didn't notice a test with nonunique header lines. Probably this can't cause a problem, but good to have.
Similarly use of Unicode in a header line, which should not fail. (I should probably specify "UTF-8" in the spec, actually; "Unicode" is too vague and allows atrocities like UTF-16 and byte-order marks and all kinds of other weird things Windows might do.)
Ditto showing that case is ignored in sequences.
Also, do we document (and test) that N or other ambiguity codes are/are not allowed? That would probably be a useful option to have available, so a parse error for something that must not have ambiguity happens as early as possible.
| .unwrap(); | ||
| assert_eq!(r.len(), 0); | ||
| fn test_comment_fasta_with_no_allow_preceding_comment() { | ||
| assert_parse_with_allow_preceding_comment( |
There was a problem hiding this comment.
I don't think that this behavior is correct for the default case.
There was a problem hiding this comment.
Could you elaborate? Looking at it, the setting says "don't allow ignoring comments, parse it as a headerless record" and that's what the code seems to be doing, but I might be missing something (or the name might be misleading?)
There was a problem hiding this comment.
So, this test variant is testing a situation in which we do not allow preceding comments.
The existing code ignores anything that precedes the first header. In this case, everything.
I don't like the idea of assuming that anything without a header is some content.
If there is no >, it should not parse.
There was a problem hiding this comment.
Ohh, I see what you mean -- only allow a comment if there's at least one record. Hmm.
I guess my feeling is, we allow parsing an empty file, right? And comment with no records = an empty file? Like, if you had some system that spit out a comment like:
// sequenced-on: 2022-07-30
on every record, and some sequencing result came back empty, then you'd want to parse that as an empty file. And if you don't want comments, you can use strict() mode instead. But I see what you mean that this could bite you if you expected comments on non-empty files, but accidentally omit a header and then lose a record on something that shouldn't be an empty file. Argh. FASTA is a sucky format :-(
There was a problem hiding this comment.
(We could introduce an option like "comments must start with this prefix" to prevent this, but that's non-standard -- if hewing to the FASTA "standard" even makes sense! :-p )
There was a problem hiding this comment.
So, this test variant is testing a situation in which we do not allow preceding comments.
The existing code ignores anything that precedes the first header. In this case, everything.
I don't like the idea of assuming that anything without a header is some content.
If there is no
>, it should not parse.
Nope, disagree, for security reasons, which is why I specified this behavior in the API.
Here's the problem: We can't be sure that some provider, or some benchtop, won't accept a sequence without a header. And if anyone in the world does that, it provides the most trivial bypass possible for screening: just never include a header in your order, and we won't screen it, but the provider will make it. And if some provider insists on seeing some header but still considers stuff before the header as something to be manufactured, then a malicious customer submits "malicious stuff, header, benign stuff" and still escapes screening.
This way, we put the onus, both technical and legal, directly on the provider who is giving us the task of screening. If they consider stuff before a header to be a comment, then they are required to remove it before passing it to us. That guarantees that the behavior they think they want and the behavior they give to us will match. They are free to either ignore it (and remove it before handing it to us) or complain to the customer and make them remove it, but anything they pass to us will be screened, and we aren't going to try to heuristicate either whether pre-any-header stuff is a comment or is valid content: by treating it all as content, we force the provider or their UI to make that choice, which is a trivial amount of programming for them either way.
There was a problem hiding this comment.
Here is my immediate thought. The provider still has to somehow match the result of the screening to the sequences in the order, right? For each sequence you should verify whether it is a hazard or not. Not having any reply for a sequence is clearly a problem that someone would notice, no?
There was a problem hiding this comment.
I also just noticed some of the upthread commentary on this re comments.
The deal is that anything which is not a sequence is a comment in FASTA. That means any line which begins with > or ; (though ; is semi-obsolete). Anything else is parse error. By definition, at least as far as we're concerned, comments are *opaque blobs" with no semantic content whatsoever, because there is no standard about what goes in there. Particular consumers of such files, such as NCBI, may have standards for what they accept (since in NCBI's case they're using them to create database keys for submissions), but in general there's just no way at all to predict what's going on and no enforcement even if there was a standard, which there isn't. All we're doing with preserving these > or ; lines is making it slightly easier to make a UI that tells the customer where in the file we're objecting to something. That's it. So we really don't care about comments at all. And the only fly in that ointment is the practice in some FASTA files of assuming that anything before the first > is also ignored. We're enabling that with a switch, but we ourselves never intend to enable that for screening---you have something that could plausibly be a sequence before the first > and you don't want it synthesized or screened, then you put a ; or a > in front of each line of it. Or if you're a provider and you want to allow that, you take it out before handing it to us. Then it's clear who's responsible if you, as a provider, deleted that stuff before you gave it to us but then went ahead and made it anyway.
There was a problem hiding this comment.
(Also the way to deal with "you accidentally created a null record and didn't notice it" is not to overload the commenting/screening system into pointing out your error. Either you as a customer need to run QA first on what you're ordering, or the provider may have its own rules/helpful hints and tells you, but that's not our job. And it can't be, because valid FASTA files exist in the wild with multiple consecutive header lines and we can't tell that from an error.)
There was a problem hiding this comment.
(We could introduce an option like "comments must start with this prefix" to prevent this, but that's non-standard -- if hewing to the FASTA "standard" even makes sense! :-p )
No, let's not; nobody will do this. :) What they do instead is to just pack all kinds of trash into the > line, including random lab notes like "sequencing added ACCTTGGACA to the previous record" and all sorts of other random asides. If they want something more sophisticated, they should be using tools that maintain their sequences in a DB (for example) and then spit out valid FASTA files with the extra info suppressed, especially if that info is just going into the maw of a synth, which is certainly not reading those comments anyway. Trying to put all this metadata into the FASTA file itself is like (true story!) what I saw once where the exec dir of a ~$1M/yr nonprofit thought the right way to issue PO's was to just create them directly as PDF's from a template and was amazed to hear that actual software existed in the world which tracked all that stuff in a DB and could then issue PO's & invoices & so forth downstream on demand.
I suspect @hwchen will have opinions here :) |
|
re bufreader:
Those are my opinions :) but I don't have a problem with this approach. There's definitely less hidden allocation and I appreciate that. |
|
My one concern with reading the whole file at once is what happens if someone tries to shove a gigabyte down our throats. OTOH, if something else will break upstream, maybe that's less of a concern. We probably want an optional parameter somewhere (the first place we might otherwise allocate a user-controlled and indefinitely-large object) which is a maximum-size limit, so we/providers can supply that with something reasonable. I'd guess quickdna is probably actually not a bad place to put that, because it's the first stop for untrusted data. And then we set that to something reasonable at our end and give providers the ability to override it if for some reason they want to accept a Gn input or something and know they can handle it.. [I'm less concerned about giant allocs on a benchtop because you're only hurting yourself, but we still probably want to have the BT just reject gracefully and not have some C/Java-side thing inside it blow up instead, and their developers may want this limit anyway.] |
|
The rocket server already rejects too large requests. But it only applies to users that use the REST API endpoint, rather than using it as a library. |
|
We won't be using Rocket in production in general---we could be getting some giant thing uploaded by a customer to a provider's web page or non-web API and then handing it to us as a library. And we probably want this available for, e.g., a benchtop into which someone plugs a flash drive. :) In general we should be as as paranoid as possible about user-supplied inputs, and this is an easy one for users to supply. (We also need to be paranoid about network inputs, but that's a different issue.) |
|
@hwchen re: BufReader, my feeling is both a) I prefer to not silently allocate, but more b) depending on system you may want to tune your bufreader size to optimal for your disk / ram / etc., and by hiding our bufreader inside the other one we make it harder for people to do that without stacking bufreaders and wasting memory (which may be important on a benchtop that only has <4kb of RAM...). Glad you agree :-) @Ravna re: DOS with large file, I agree but don't want to address that yet in this PR. I think the Vec-interface is working OK for alpha, and the state-machine-based parser will make implementing a streaming parser a lot easier in the future, and this PR is a bit big already... |
|
Sure, re DoS, leave it the way it is, but can you make an issue so we don't forget? (I may already have this in the testing zoo but should put it there if I don't. Reading FASTA input is pretty much the only input we'll take where we don't know up front how big it's supposed to be anyway; I'm planning for the network protocol to give an up-front count anyway for everything where we control both ends.) |
benwr
left a comment
There was a problem hiding this comment.
Looks good overall; thanks for pretty thorough test suite!
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
oh man this is just screaming for quickcheck tests
There was a problem hiding this comment.
Hmm, maybe? I don't see how this could fail / what quickcheck would uncover (I guess that's the point of property testing :-) ). I actually realized in the latest commit that the trim_end_matches here is pointless, so the line is just Some(&line[1..]) now, which I really really don't think can fail :-)
There was a problem hiding this comment.
Oh, I didn't mean this specific snippet, just this module in general.
| result.push((existing_title, lines.join(""))); | ||
| lines.clear(); | ||
| } | ||
| enum ParserState { |
There was a problem hiding this comment.
doing this as an FST seems pretty reasonable, but it is a lot of code to check. I wonder if using parser combinators might have been easier.
There was a problem hiding this comment.
Hmm, interesting point. I leaning towards the state machine approach since I had already written a similar one (that I need to rip out and replace at some point) in our pipeline code, and I wasn't sure the best parser combinator libraries for Rust -- I'd played with nom and had mixed experiences with it.
| enum ParserState { | ||
| StartOfFile { | ||
| /// only non-empty if settings.allow_preceding_comment is false | ||
| contents: String, |
There was a problem hiding this comment.
my string concatenation speed alarms went off here, but actually I think this is basically fine. Might be slightly faster to accumulate, like, a Vec<&str> or something and then join them at the end, but 🤷
There was a problem hiding this comment.
Yeah I thought about that as well, but I think the quadratic behavior won't manifest here. Maybe I should add 10,000 line comment, concatenated header, and sequence body checks to be safe :-)
There was a problem hiding this comment.
Also I don't think we can hold a Vec<&str> because lifetimes™, we get chunks of data fed into us from a BufRead instead of a single string we can hold onto and reference chunks out of :-(
There was a problem hiding this comment.
(I agree re not quadratic, since Strings in Rust are mutable)
There was a problem hiding this comment.
Added some tests for quadratic-ness in most recent commit, using a panic_after helper so they'll panic if the runtime ever grows significantly.
Apologies for the large diff! Much of it is comments and tests, though.
These changes make the FASTA parser more flexible, and will allow us to handle FASTA files the way we need for SecureDNA (SDNA employees, refer to the "Notes" section in [[Alpha API endpoint]] on the internal wiki).
Changes
Adds 2 new settings:
allow_preceding_commentandconcatenate_headers. These are described in the doc comments, but briefly:allow_preceding_comment = truewill ignore any content preceding the first header, like before. If false, it will emit that content as a headerless (header = "") record.concatenate_headers = truewill glue successive headers with no content together with newlines. Iffalse, it will use the old behavior and emitcontent = ""records. See the doc comment for a doctest that explains this.The return value of
FastaParseris now aFastaRecordwithheaderandcontentsfields, instead of a 2-tuple. It's also grown aline_rangefield which gives the line numbers of the record (including header, 1-indexed, [start, end)).Removes
DnaFastaParserandSimpleFastaParserin favor of a singleFastaParsertype which is generic overFromStrtypes. SinceStringhas an (infalliable)FromStrimpl,SimpleFastaParseris nowFastaParser<String>, andDnaFastaParseris nowFastaParser<DnaSequence>. We didn't have this type before, but a hypotheticalProteinFastaParseris nowFastaParser<ProteinSequence>, and you could even haveFastaParser<u64>if you wanted that.FASTA parsing is now state-machine-driven, which should make a non-allocating API easier to add in the future if we want, and also simplified the parsing code since it was getting hairy handling the new special cases in the old loop. Luckily most of @mkysel 's parsing logic was relatively straightforward to port into the state machine without too many changes besides handling the new settings.
The old tests are all redone to use the new API (and to test all 4 settings combinations, where it makes sense), and a bunch of new tests are added.
Misc changes:
parsenow explicitly asks for aBufReadinstead of takingReadand wrapping it internally inBufReader. I checked the ecosystem using SourceGraph and that seemed to be the way most Rust stuff works, so this brings our API in line with that.There's a new
parse_strmethod that takes advantage of&[u8]having aBufReadimpl. That's part of the reason for the other change, too, since before that would have needlessly allocated aBufReaderbuffer which would bother me :-)