Skip to content

Improve performance of rispy's parser and refactor parser#66

Merged
shapiromatron merged 20 commits intoMrTango:mainfrom
J535D165:improve-performance
May 22, 2025
Merged

Improve performance of rispy's parser and refactor parser#66
shapiromatron merged 20 commits intoMrTango:mainfrom
J535D165:improve-performance

Conversation

@J535D165
Copy link
Copy Markdown
Collaborator

@J535D165 J535D165 commented Aug 27, 2024

This PR proposes to refactor and improve performance. Although breaking changes are not necessary, I would recommend changing parts of the API as well. Happy to hear your thoughts

Performance

I used PR #65 to test the performance of rispy on my machine.

Current main branch:

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               484.5452 (1.0)      517.4370 (1.0)      502.2159 (1.0)      14.4051 (1.0)      499.9848 (1.0)      25.6566 (1.0)           2;0  1.9912 (1.0)           5           1
test_benchmark_rispy_large_multiline     569.1948 (1.17)     626.2228 (1.21)     592.6684 (1.18)     27.2004 (1.89)     581.3098 (1.16)     50.2680 (1.96)          1;0  1.6873 (0.85)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

This PR:

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               227.7945 (1.0)      288.0316 (1.0)      258.4050 (1.0)      21.6520 (1.07)     258.2270 (1.0)      23.2613 (1.0)           2;0  3.8699 (1.0)           5           1
test_benchmark_rispy_large_multiline     263.7817 (1.16)     313.3475 (1.09)     282.7213 (1.09)     20.1495 (1.0)      274.9604 (1.06)     29.4744 (1.27)          1;0  3.5371 (0.91)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

According to this benchmark, this implementation is at least two times faster. This seems to hold for both small and larger datasets.

No more regexp

This PR removes the need for regular expressions. Everything between the start and end tags is a tag with a specific format. So, you don't need to apply a regular expression to each line. This saves a lot of time and potential mistakes.

I left the current regexp class attribute definition, but I would suggest to remove it.

Issues resolved

This PR also resolves:

API changes

We can simplify, improve, and refactor a bit more. I'll leave that for a moment until the community has looked over this proposal.

Other

I added PR #64's changes to this PR for convenience. They can/will be changed after a decision has been made on PR #64.

This was referenced Aug 27, 2024
Comment thread rispy/parser.py Outdated
Comment thread rispy/parser.py Outdated
@J535D165
Copy link
Copy Markdown
Collaborator Author

The changes I made for the multiline parsing impacted the API. I chose to deprecate get_tag and get_content because there where zero guarantees that they actually return the tag and the content (the just slice the line).

I took this freedom to break some other parts of the API and simplify things. I removed the regexp and deprecated BaseParser in favor of RisParser (no need to offer multiple options for the same thing).

I'm happy to discuss these API changes with you. Would this impact you as a user, and how?

@J535D165
Copy link
Copy Markdown
Collaborator Author

I'm planning to make a PubMed parser (subclass of RisParser). I will use this as an experiment to see if this PR's API design is flexible enough to implement such a format (example https://github.com/asreview/citation-file-formatting/tree/main/Datasets/RIS).

Comment thread rispy/parser.py Outdated
Comment thread rispy/parser.py Outdated
@J535D165
Copy link
Copy Markdown
Collaborator Author

Implemented suggestions and checks by @PeterLombaers.

Updated benchmark results:

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               241.2698 (1.0)      294.8977 (1.0)      259.4689 (1.0)      21.0500 (1.0)      252.0566 (1.0)      23.0623 (1.0)           1;0  3.8540 (1.0)           5           1
test_benchmark_rispy_large_multiline     282.0263 (1.17)     345.4905 (1.17)     300.7980 (1.16)     26.1683 (1.24)     288.5437 (1.14)     28.6715 (1.24)          1;0  3.3245 (0.86)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

Compared to main branch:

--------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------
Name (time in ms)                             Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_large               484.5452 (1.0)      517.4370 (1.0)      502.2159 (1.0)      14.4051 (1.0)      499.9848 (1.0)      25.6566 (1.0)           2;0  1.9912 (1.0)           5           1
test_benchmark_rispy_large_multiline     569.1948 (1.17)     626.2228 (1.21)     592.6684 (1.18)     27.2004 (1.89)     581.3098 (1.16)     50.2680 (1.96)          1;0  1.6873 (0.85)          5           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

@J535D165
Copy link
Copy Markdown
Collaborator Author

J535D165 commented Sep 2, 2024

Tested against this repo with many example exports: https://github.com/asreview/citation-file-formatting/tree/ecddcceff24791f5454f53f64515d5d0990649aa/Datasets/RIS. This PR will add support for Cochrane RIS exports (#61). All other example RIS files work except from outlier Pubmed.

Copy link
Copy Markdown
Collaborator

@shapiromatron shapiromatron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, a complete overhaul like you said but I think the code is more readable simpler. I had a few suggestions about adding type annotations if it's easy to do; I think they're largely just str so hopefully it's not too bad to add.

Excited to get this one in finally!

Comment thread rispy/parser.py Outdated
Comment thread rispy/parser.py Outdated
@shapiromatron
Copy link
Copy Markdown
Collaborator

@J535D165 Please let me know when you're ready to merge it in and I'll go ahead and do so.

@J535D165
Copy link
Copy Markdown
Collaborator Author

@shapiromatron Thanks for the feedback, all checks are green now. Feel free to merge.

@J535D165
Copy link
Copy Markdown
Collaborator Author

Received a real world example RIS file today of 650MB:

-------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in s)                              Min      Max     Mean  StdDev   Median      IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_big_example_file     16.7548  32.5059  25.0655  6.7390  27.9454  11.2026       2;0  0.0399       5           1
-----------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

old:

------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in s)                              Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_big_example_file     30.5232  43.5556  35.9390  5.2012  35.9779  7.8023       2;0  0.0278       5           1
----------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

I hope we can speed this up further in the near future (for example, with Rust).

@shapiromatron
Copy link
Copy Markdown
Collaborator

Received a real world example RIS file today of 650MB:

-------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in s)                              Min      Max     Mean  StdDev   Median      IQR  Outliers     OPS  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_big_example_file     16.7548  32.5059  25.0655  6.7390  27.9454  11.2026       2;0  0.0399       5           1
-----------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

old:

------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in s)                              Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------
test_benchmark_rispy_big_example_file     30.5232  43.5556  35.9390  5.2012  35.9779  7.8023       2;0  0.0278       5           1
----------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

I hope we can speed this up further in the near future (for example, with Rust).

Still a nice speedup, this will be valuable even w/o migrating to 🦀

@shapiromatron shapiromatron merged commit 2fe60b7 into MrTango:main May 22, 2025
5 checks passed
@J535D165 J535D165 deleted the improve-performance branch May 22, 2025 14:53
@J535D165
Copy link
Copy Markdown
Collaborator Author

Thanks for merging. I will continue working on the PubMed parser now. Hopefully, I can implement it without API changes. If that works out, we are ready to prepare v1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants