Improve performance of rispy's parser and refactor parser by J535D165 · Pull Request #66 · MrTango/rispy

J535D165 · 2024-08-27T13:09:08Z

This PR proposes to refactor and improve performance. Although breaking changes are not necessary, I would recommend changing parts of the API as well. Happy to hear your thoughts

Performance

I used PR #65 to test the performance of rispy on my machine.

Current main branch:

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               484.5452 (1.0)      517.4370 (1.0)      502.2159 (1.0)      14.4051 (1.0)      499.9848 (1.0)      25.6566 (1.0)           2;0  1.9912 (1.0)           5           1
test_benchmark_rispy_large_multiline     569.1948 (1.17)     626.2228 (1.21)     592.6684 (1.18)     27.2004 (1.89)     581.3098 (1.16)     50.2680 (1.96)          1;0  1.6873 (0.85)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

This PR:

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               227.7945 (1.0)      288.0316 (1.0)      258.4050 (1.0)      21.6520 (1.07)     258.2270 (1.0)      23.2613 (1.0)           2;0  3.8699 (1.0)           5           1
test_benchmark_rispy_large_multiline     263.7817 (1.16)     313.3475 (1.09)     282.7213 (1.09)     20.1495 (1.0)      274.9604 (1.06)     29.4744 (1.27)          1;0  3.5371 (0.91)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

According to this benchmark, this implementation is at least two times faster. This seems to hold for both small and larger datasets.

No more regexp

This PR removes the need for regular expressions. Everything between the start and end tags is a tag with a specific format. So, you don't need to apply a regular expression to each line. This saves a lot of time and potential mistakes.

I left the current regexp class attribute definition, but I would suggest to remove it.

Issues resolved

This PR also resolves:

Handling tags with empty values #62 and Handling tags with empty values #63 by @holub008. No extra changes were needed to make this work on the test defined by @holub008 in PR Handling tags with empty values #63.
Ignore tags outside of a record #61 by @holub008. Resolved by design.

API changes

We can simplify, improve, and refactor a bit more. I'll leave that for a moment until the community has looked over this proposal.

Other

I added PR #64's changes to this PR for convenience. They can/will be changed after a decision has been made on PR #64.

There is no need for a Baseparser. This lib is on parsing RIS, so the RisParser should be leading.

J535D165 · 2024-08-28T08:07:18Z

The changes I made for the multiline parsing impacted the API. I chose to deprecate get_tag and get_content because there where zero guarantees that they actually return the tag and the content (the just slice the line).

I took this freedom to break some other parts of the API and simplify things. I removed the regexp and deprecated BaseParser in favor of RisParser (no need to offer multiple options for the same thing).

I'm happy to discuss these API changes with you. Would this impact you as a user, and how?

J535D165 · 2024-08-28T08:11:27Z

I'm planning to make a PubMed parser (subclass of RisParser). I will use this as an experiment to see if this PR's API design is flexible enough to implement such a format (example https://github.com/asreview/citation-file-formatting/tree/main/Datasets/RIS).

J535D165 · 2024-08-29T08:38:14Z

Implemented suggestions and checks by @PeterLombaers.

Updated benchmark results:

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               241.2698 (1.0)      294.8977 (1.0)      259.4689 (1.0)      21.0500 (1.0)      252.0566 (1.0)      23.0623 (1.0)           1;0  3.8540 (1.0)           5           1
test_benchmark_rispy_large_multiline     282.0263 (1.17)     345.4905 (1.17)     300.7980 (1.16)     26.1683 (1.24)     288.5437 (1.14)     28.6715 (1.24)          1;0  3.3245 (0.86)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

Compared to main branch:

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               484.5452 (1.0)      517.4370 (1.0)      502.2159 (1.0)      14.4051 (1.0)      499.9848 (1.0)      25.6566 (1.0)           2;0  1.9912 (1.0)           5           1
test_benchmark_rispy_large_multiline     569.1948 (1.17)     626.2228 (1.21)     592.6684 (1.18)     27.2004 (1.89)     581.3098 (1.16)     50.2680 (1.96)          1;0  1.6873 (0.85)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

J535D165 · 2024-09-02T15:56:18Z

Tested against this repo with many example exports: https://github.com/asreview/citation-file-formatting/tree/ecddcceff24791f5454f53f64515d5d0990649aa/Datasets/RIS. This PR will add support for Cochrane RIS exports (#61). All other example RIS files work except from outlier Pubmed.

shapiromatron

Looks great, a complete overhaul like you said but I think the code is more readable simpler. I had a few suggestions about adding type annotations if it's easy to do; I think they're largely just str so hopefully it's not too bad to add.

Excited to get this one in finally!

shapiromatron · 2025-04-17T02:17:18Z

@J535D165 Please let me know when you're ready to merge it in and I'll go ahead and do so.

J535D165 · 2025-05-20T17:31:20Z

@shapiromatron Thanks for the feedback, all checks are green now. Feel free to merge.

J535D165 · 2025-05-20T17:59:53Z

Received a real world example RIS file today of 650MB:

-------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in s)                              Min      Max     Mean  StdDev   Median      IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_big_example_file     16.7548  32.5059  25.0655  6.7390  27.9454  11.2026       2;0  0.0399       5           1
-----------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

old:

------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in s)                              Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_big_example_file     30.5232  43.5556  35.9390  5.2012  35.9779  7.8023       2;0  0.0278       5           1
----------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

I hope we can speed this up further in the near future (for example, with Rust).

shapiromatron · 2025-05-22T14:48:15Z

Received a real world example RIS file today of 650MB:

-------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in s)                              Min      Max     Mean  StdDev   Median      IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_big_example_file     16.7548  32.5059  25.0655  6.7390  27.9454  11.2026       2;0  0.0399       5           1
-----------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

old:

------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in s)                              Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_big_example_file     30.5232  43.5556  35.9390  5.2012  35.9779  7.8023       2;0  0.0278       5           1
----------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

I hope we can speed this up further in the near future (for example, with Rust).

Still a nice speedup, this will be valuable even w/o migrating to 🦀

J535D165 · 2025-05-22T14:56:28Z

Thanks for merging. I will continue working on the PubMed parser now. Hopefully, I can implement it without API changes. If that works out, we are ready to prepare v1.

J535D165 added 7 commits August 26, 2024 23:15

Revert strip UTF-8 BOM strip

c0e8614

Improve performance and API of rispy parser

6edcb4a

Fix test

05426a6

Fix doctest

ed804a1

Add test from PR MrTango#63

af5fdaf

Happy lint

ee20d34

Merge branch 'revert-bom-removal' into improve-performance

14fafb1

This was referenced Aug 27, 2024

Add benchmark for rispy #65

Merged

Revert strip UTF-8 BOM strip #64

Merged

PeterLombaers reviewed Aug 27, 2024

View reviewed changes

Comment thread rispy/parser.py Outdated

PeterLombaers reviewed Aug 27, 2024

View reviewed changes

Comment thread rispy/parser.py Outdated

J535D165 added 6 commits August 28, 2024 09:07

Remove BaseParser in favor of RisParser

5bc9917

There is no need for a Baseparser. This lib is on parsing RIS, so the RisParser should be leading.

Deprecate get_tag and get_content in favor of parse_line

0138384

Refactor, no changes

be54a1a

Add test that tests multiple multiline formats

b624849

Add support for more complex multiline RIS tags

93d5f27

Happy lint

92e523d

PeterLombaers reviewed Aug 28, 2024

View reviewed changes

Comment thread rispy/parser.py Outdated

PeterLombaers reviewed Aug 28, 2024

View reviewed changes

Comment thread rispy/parser.py Outdated

J535D165 added 3 commits August 29, 2024 10:24

Add changes as suggested by Peter

82defc8

Fix other issue discussed by Peter

38a25ce

Add more challenges to multiline test

b2aa55a

This was referenced Sep 2, 2024

Add test for loading RIS files from citation-file-formatting asreview/asreview#1824

Merged

Add support for PubMed RIS files asreview/asreview#1825

Open

J535D165 mentioned this pull request Mar 10, 2025

Add (example of) PubMed parser #68

Open

J535D165 requested a review from shapiromatron March 10, 2025 14:56

shapiromatron self-assigned this Mar 26, 2025

shapiromatron mentioned this pull request Apr 11, 2025

Handling tags with empty values #63

Closed

shapiromatron approved these changes Apr 17, 2025

View reviewed changes

Comment thread rispy/parser.py Outdated

Comment thread rispy/parser.py Outdated

shapiromatron mentioned this pull request Apr 17, 2025

Add support for Python 3.13 and remove Python 3.8 #67

Merged

shapiromatron assigned J535D165 and unassigned shapiromatron Apr 17, 2025

J535D165 added 3 commits May 20, 2025 19:11

Add type annotation

1848ac4

Merge branch 'main' into improve-performance

4e056d7

Fix failing type annotation for Python 3.8

fea8ee9

Merge branch 'main' into improve-performance

d431843

shapiromatron merged commit 2fe60b7 into MrTango:main May 22, 2025
5 checks passed

J535D165 deleted the improve-performance branch May 22, 2025 14:53

Conversation

J535D165 commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

No more regexp

Issues resolved

API changes

Other

Uh oh!

Uh oh!

Uh oh!

J535D165 commented Aug 28, 2024

Uh oh!

J535D165 commented Aug 28, 2024

Uh oh!

Uh oh!

Uh oh!

J535D165 commented Aug 29, 2024

Uh oh!

J535D165 commented Sep 2, 2024

Uh oh!

shapiromatron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shapiromatron commented Apr 17, 2025

Uh oh!

J535D165 commented May 20, 2025

Uh oh!

J535D165 commented May 20, 2025

Uh oh!

shapiromatron commented May 22, 2025

Uh oh!

Uh oh!

J535D165 commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

J535D165 commented Aug 27, 2024 •

edited

Loading