Improve performance of rispy's parser and refactor parser#66
Improve performance of rispy's parser and refactor parser#66shapiromatron merged 20 commits intoMrTango:mainfrom
Conversation
There is no need for a Baseparser. This lib is on parsing RIS, so the RisParser should be leading.
|
The changes I made for the multiline parsing impacted the API. I chose to deprecate I took this freedom to break some other parts of the API and simplify things. I removed the regexp and deprecated BaseParser in favor of RisParser (no need to offer multiple options for the same thing). I'm happy to discuss these API changes with you. Would this impact you as a user, and how? |
|
I'm planning to make a PubMed parser (subclass of RisParser). I will use this as an experiment to see if this PR's API design is flexible enough to implement such a format (example https://github.com/asreview/citation-file-formatting/tree/main/Datasets/RIS). |
|
Implemented suggestions and checks by @PeterLombaers. Updated benchmark results: Compared to main branch: |
|
Tested against this repo with many example exports: https://github.com/asreview/citation-file-formatting/tree/ecddcceff24791f5454f53f64515d5d0990649aa/Datasets/RIS. This PR will add support for Cochrane RIS exports (#61). All other example RIS files work except from outlier Pubmed. |
shapiromatron
left a comment
There was a problem hiding this comment.
Looks great, a complete overhaul like you said but I think the code is more readable simpler. I had a few suggestions about adding type annotations if it's easy to do; I think they're largely just str so hopefully it's not too bad to add.
Excited to get this one in finally!
|
@J535D165 Please let me know when you're ready to merge it in and I'll go ahead and do so. |
|
@shapiromatron Thanks for the feedback, all checks are green now. Feel free to merge. |
|
Received a real world example RIS file today of 650MB: old: I hope we can speed this up further in the near future (for example, with Rust). |
Still a nice speedup, this will be valuable even w/o migrating to 🦀 |
|
Thanks for merging. I will continue working on the PubMed parser now. Hopefully, I can implement it without API changes. If that works out, we are ready to prepare v1. |
This PR proposes to refactor and improve performance. Although breaking changes are not necessary, I would recommend changing parts of the API as well. Happy to hear your thoughts
Performance
I used PR #65 to test the performance of rispy on my machine.
Current main branch:
This PR:
According to this benchmark, this implementation is at least two times faster. This seems to hold for both small and larger datasets.
No more regexp
This PR removes the need for regular expressions. Everything between the start and end tags is a tag with a specific format. So, you don't need to apply a regular expression to each line. This saves a lot of time and potential mistakes.
I left the current regexp class attribute definition, but I would suggest to remove it.
Issues resolved
This PR also resolves:
API changes
We can simplify, improve, and refactor a bit more. I'll leave that for a moment until the community has looked over this proposal.
Other
I added PR #64's changes to this PR for convenience. They can/will be changed after a decision has been made on PR #64.