Skip to content

Update FASTA parser to be configurable, generic, track line numbers, and other changes#9

Merged
vgel merged 2 commits intomainfrom
configurable-generic-fasta-parser
Jul 26, 2022
Merged

Update FASTA parser to be configurable, generic, track line numbers, and other changes#9
vgel merged 2 commits intomainfrom
configurable-generic-fasta-parser

Conversation

@vgel
Copy link
Copy Markdown
Contributor

@vgel vgel commented Jul 22, 2022

Apologies for the large diff! Much of it is comments and tests, though.

These changes make the FASTA parser more flexible, and will allow us to handle FASTA files the way we need for SecureDNA (SDNA employees, refer to the "Notes" section in [[Alpha API endpoint]] on the internal wiki).

Changes

Adds 2 new settings: allow_preceding_comment and concatenate_headers. These are described in the doc comments, but briefly:

  • allow_preceding_comment = true will ignore any content preceding the first header, like before. If false, it will emit that content as a headerless (header = "") record.
  • concatenate_headers = true will glue successive headers with no content together with newlines. If false, it will use the old behavior and emit content = "" records. See the doc comment for a doctest that explains this.

The return value of FastaParser is now a FastaRecord with header and contents fields, instead of a 2-tuple. It's also grown a line_range field which gives the line numbers of the record (including header, 1-indexed, [start, end)).

Removes DnaFastaParser and SimpleFastaParser in favor of a single FastaParser type which is generic over FromStr types. Since String has an (infalliable) FromStr impl, SimpleFastaParser is now FastaParser<String>, and DnaFastaParser is now FastaParser<DnaSequence>. We didn't have this type before, but a hypothetical ProteinFastaParser is now FastaParser<ProteinSequence>, and you could even have FastaParser<u64> if you wanted that.

FASTA parsing is now state-machine-driven, which should make a non-allocating API easier to add in the future if we want, and also simplified the parsing code since it was getting hairy handling the new special cases in the old loop. Luckily most of @mkysel 's parsing logic was relatively straightforward to port into the state machine without too many changes besides handling the new settings.

The old tests are all redone to use the new API (and to test all 4 settings combinations, where it makes sense), and a bunch of new tests are added.

Misc changes:

parse now explicitly asks for a BufRead instead of taking Read and wrapping it internally in BufReader. I checked the ecosystem using SourceGraph and that seemed to be the way most Rust stuff works, so this brings our API in line with that.

There's a new parse_str method that takes advantage of &[u8] having a BufRead impl. That's part of the reason for the other change, too, since before that would have needlessly allocated a BufReader buffer which would bother me :-)

also removes the distinction between a DnaFastaParser and a "regular"
SimpleFastaParser -- everything is "FastaParser" now, with a phantom
generic FromStr type that will be parsed (String has an infalliable FromStr impl)
So SimpleFastaParser is now FastaParser<String>, and DnaFastaParser is now
FastaParser<DnaSequence>. ProteinFastaParser didn't exist before, but it's now
FastaParser<ProteinSequence>. A hypothetical integer fasta format could be
FastaParser<u64>, if you wanted that for some weird reason.

Fasta parsing is now state-machine-driven, which should make a non-allocating
API easier to add in the future, and also simplified the parsing code since it was
getting hairy handling the new special cases in the old loop.
@vgel vgel requested review from benwr and mkysel July 22, 2022 22:22
Copy link
Copy Markdown
Contributor

@Ravna Ravna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the tests, though probably the protein-specific tests should use things that can't possibly be nuceleotides, like maybe L, K, R, S, etc so we're sure they're not being parsed as DNA.

I didn't notice a test with nonunique header lines. Probably this can't cause a problem, but good to have.

Similarly use of Unicode in a header line, which should not fail. (I should probably specify "UTF-8" in the spec, actually; "Unicode" is too vague and allows atrocities like UTF-16 and byte-order marks and all kinds of other weird things Windows might do.)

Ditto showing that case is ignored in sequences.

Also, do we document (and test) that N or other ambiguity codes are/are not allowed? That would probably be a useful option to have available, so a parse error for something that must not have ambiguity happens as early as possible.

Comment thread src/fasta.rs
.unwrap();
assert_eq!(r.len(), 0);
fn test_comment_fasta_with_no_allow_preceding_comment() {
assert_parse_with_allow_preceding_comment(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this behavior is correct for the default case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Looking at it, the setting says "don't allow ignoring comments, parse it as a headerless record" and that's what the code seems to be doing, but I might be missing something (or the name might be misleading?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this test variant is testing a situation in which we do not allow preceding comments.

The existing code ignores anything that precedes the first header. In this case, everything.

I don't like the idea of assuming that anything without a header is some content.

If there is no >, it should not parse.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I see what you mean -- only allow a comment if there's at least one record. Hmm.

I guess my feeling is, we allow parsing an empty file, right? And comment with no records = an empty file? Like, if you had some system that spit out a comment like:

// sequenced-on: 2022-07-30

on every record, and some sequencing result came back empty, then you'd want to parse that as an empty file. And if you don't want comments, you can use strict() mode instead. But I see what you mean that this could bite you if you expected comments on non-empty files, but accidentally omit a header and then lose a record on something that shouldn't be an empty file. Argh. FASTA is a sucky format :-(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We could introduce an option like "comments must start with this prefix" to prevent this, but that's non-standard -- if hewing to the FASTA "standard" even makes sense! :-p )

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this test variant is testing a situation in which we do not allow preceding comments.

The existing code ignores anything that precedes the first header. In this case, everything.

I don't like the idea of assuming that anything without a header is some content.

If there is no >, it should not parse.

Nope, disagree, for security reasons, which is why I specified this behavior in the API.

Here's the problem: We can't be sure that some provider, or some benchtop, won't accept a sequence without a header. And if anyone in the world does that, it provides the most trivial bypass possible for screening: just never include a header in your order, and we won't screen it, but the provider will make it. And if some provider insists on seeing some header but still considers stuff before the header as something to be manufactured, then a malicious customer submits "malicious stuff, header, benign stuff" and still escapes screening.

This way, we put the onus, both technical and legal, directly on the provider who is giving us the task of screening. If they consider stuff before a header to be a comment, then they are required to remove it before passing it to us. That guarantees that the behavior they think they want and the behavior they give to us will match. They are free to either ignore it (and remove it before handing it to us) or complain to the customer and make them remove it, but anything they pass to us will be screened, and we aren't going to try to heuristicate either whether pre-any-header stuff is a comment or is valid content: by treating it all as content, we force the provider or their UI to make that choice, which is a trivial amount of programming for them either way.

Copy link
Copy Markdown
Contributor

@mkysel mkysel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my immediate thought. The provider still has to somehow match the result of the screening to the sequences in the order, right? For each sequence you should verify whether it is a hazard or not. Not having any reply for a sequence is clearly a problem that someone would notice, no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also just noticed some of the upthread commentary on this re comments.

The deal is that anything which is not a sequence is a comment in FASTA. That means any line which begins with > or ; (though ; is semi-obsolete). Anything else is parse error. By definition, at least as far as we're concerned, comments are *opaque blobs" with no semantic content whatsoever, because there is no standard about what goes in there. Particular consumers of such files, such as NCBI, may have standards for what they accept (since in NCBI's case they're using them to create database keys for submissions), but in general there's just no way at all to predict what's going on and no enforcement even if there was a standard, which there isn't. All we're doing with preserving these > or ; lines is making it slightly easier to make a UI that tells the customer where in the file we're objecting to something. That's it. So we really don't care about comments at all. And the only fly in that ointment is the practice in some FASTA files of assuming that anything before the first > is also ignored. We're enabling that with a switch, but we ourselves never intend to enable that for screening---you have something that could plausibly be a sequence before the first > and you don't want it synthesized or screened, then you put a ; or a > in front of each line of it. Or if you're a provider and you want to allow that, you take it out before handing it to us. Then it's clear who's responsible if you, as a provider, deleted that stuff before you gave it to us but then went ahead and made it anyway.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also the way to deal with "you accidentally created a null record and didn't notice it" is not to overload the commenting/screening system into pointing out your error. Either you as a customer need to run QA first on what you're ordering, or the provider may have its own rules/helpful hints and tells you, but that's not our job. And it can't be, because valid FASTA files exist in the wild with multiple consecutive header lines and we can't tell that from an error.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We could introduce an option like "comments must start with this prefix" to prevent this, but that's non-standard -- if hewing to the FASTA "standard" even makes sense! :-p )

No, let's not; nobody will do this. :) What they do instead is to just pack all kinds of trash into the > line, including random lab notes like "sequencing added ACCTTGGACA to the previous record" and all sorts of other random asides. If they want something more sophisticated, they should be using tools that maintain their sequences in a DB (for example) and then spit out valid FASTA files with the extra info suppressed, especially if that info is just going into the maw of a synth, which is certainly not reading those comments anyway. Trying to put all this metadata into the FASTA file itself is like (true story!) what I saw once where the exec dir of a ~$1M/yr nonprofit thought the right way to issue PO's was to just create them directly as PDF's from a template and was amazed to hear that actual software existed in the world which tracked all that stuff in a DB and could then issue PO's & invoices & so forth downstream on demand.

@mkysel
Copy link
Copy Markdown
Contributor

mkysel commented Jul 25, 2022

There's a new parse_str method that takes advantage of &[u8] having a BufRead impl. That's part of the reason for the other change, too, since before that would have needlessly allocated a BufReader buffer which would bother me :-)

I suspect @hwchen will have opinions here :)

@hwchen
Copy link
Copy Markdown
Contributor

hwchen commented Jul 25, 2022

re bufreader:

  • if it's not in the hotloop, it probably doesn't matter if there's an extra allocation for &[u8] (looks like csv takes this approach), but I understand the principle of it :)
  • if it's in the hotloop, perhaps it's better to just read the whole file instead of using bufreader? I don't know how large the FASTAs we expect to parse are, though, and perhaps this doesn't really make a difference in perf. edit: I guess this is the approach that parse_str allows

Those are my opinions :) but I don't have a problem with this approach. There's definitely less hidden allocation and I appreciate that.

@Ravna
Copy link
Copy Markdown
Contributor

Ravna commented Jul 25, 2022

My one concern with reading the whole file at once is what happens if someone tries to shove a gigabyte down our throats. OTOH, if something else will break upstream, maybe that's less of a concern. We probably want an optional parameter somewhere (the first place we might otherwise allocate a user-controlled and indefinitely-large object) which is a maximum-size limit, so we/providers can supply that with something reasonable. I'd guess quickdna is probably actually not a bad place to put that, because it's the first stop for untrusted data. And then we set that to something reasonable at our end and give providers the ability to override it if for some reason they want to accept a Gn input or something and know they can handle it.. [I'm less concerned about giant allocs on a benchtop because you're only hurting yourself, but we still probably want to have the BT just reject gracefully and not have some C/Java-side thing inside it blow up instead, and their developers may want this limit anyway.]

@mkysel
Copy link
Copy Markdown
Contributor

mkysel commented Jul 25, 2022

The rocket server already rejects too large requests. But it only applies to users that use the REST API endpoint, rather than using it as a library.

@Ravna
Copy link
Copy Markdown
Contributor

Ravna commented Jul 25, 2022

We won't be using Rocket in production in general---we could be getting some giant thing uploaded by a customer to a provider's web page or non-web API and then handing it to us as a library. And we probably want this available for, e.g., a benchtop into which someone plugs a flash drive. :) In general we should be as as paranoid as possible about user-supplied inputs, and this is an easy one for users to supply. (We also need to be paranoid about network inputs, but that's a different issue.)

@vgel
Copy link
Copy Markdown
Contributor Author

vgel commented Jul 26, 2022

@hwchen re: BufReader, my feeling is both a) I prefer to not silently allocate, but more b) depending on system you may want to tune your bufreader size to optimal for your disk / ram / etc., and by hiding our bufreader inside the other one we make it harder for people to do that without stacking bufreaders and wasting memory (which may be important on a benchtop that only has <4kb of RAM...). Glad you agree :-)

@Ravna re: DOS with large file, I agree but don't want to address that yet in this PR. I think the Vec-interface is working OK for alpha, and the state-machine-based parser will make implementing a streaming parser a lot easier in the future, and this PR is a bit big already...

@Ravna
Copy link
Copy Markdown
Contributor

Ravna commented Jul 26, 2022

Sure, re DoS, leave it the way it is, but can you make an issue so we don't forget? (I may already have this in the testing zoo but should put it there if I don't. Reading FASTA input is pretty much the only input we'll take where we don't know up front how big it's supposed to be anyway; I'm planning for the network protocol to give an up-front count anyway for everything where we control both ends.)

@vgel vgel mentioned this pull request Jul 26, 2022
Copy link
Copy Markdown
Collaborator

@benwr benwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall; thanks for pretty thorough test suite!

Comment thread src/fasta.rs
}

#[cfg(test)]
mod tests {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh man this is just screaming for quickcheck tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe? I don't see how this could fail / what quickcheck would uncover (I guess that's the point of property testing :-) ). I actually realized in the latest commit that the trim_end_matches here is pointless, so the line is just Some(&line[1..]) now, which I really really don't think can fail :-)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't mean this specific snippet, just this module in general.

Comment thread src/fasta.rs
result.push((existing_title, lines.join("")));
lines.clear();
}
enum ParserState {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing this as an FST seems pretty reasonable, but it is a lot of code to check. I wonder if using parser combinators might have been easier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting point. I leaning towards the state machine approach since I had already written a similar one (that I need to rip out and replace at some point) in our pipeline code, and I wasn't sure the best parser combinator libraries for Rust -- I'd played with nom and had mixed experiences with it.

Comment thread src/fasta.rs
enum ParserState {
StartOfFile {
/// only non-empty if settings.allow_preceding_comment is false
contents: String,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my string concatenation speed alarms went off here, but actually I think this is basically fine. Might be slightly faster to accumulate, like, a Vec<&str> or something and then join them at the end, but 🤷

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about that as well, but I think the quadratic behavior won't manifest here. Maybe I should add 10,000 line comment, concatenated header, and sequence body checks to be safe :-)

Copy link
Copy Markdown
Contributor Author

@vgel vgel Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't think we can hold a Vec<&str> because lifetimes™, we get chunks of data fed into us from a BufRead instead of a single string we can hold onto and reference chunks out of :-(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right :/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I agree re not quadratic, since Strings in Rust are mutable)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some tests for quadratic-ness in most recent commit, using a panic_after helper so they'll panic if the runtime ever grows significantly.

@vgel vgel merged commit 1322910 into main Jul 26, 2022
@vgel vgel deleted the configurable-generic-fasta-parser branch July 26, 2022 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants