Conversation
Why these changes are being introduced: * Alma MARCXML lacks namespaces in the collection element which are required for validation by POD How this addresses that need: * Insert namespaces with replace string method * Add fixture and unit test for new functionality Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/ENSY-70
hakbailey
reviewed
May 25, 2022
Contributor
|
I just tried to run this on Dev1 against our actual file exports and we hit a memory error. Can you investigate explicitly setting the buffer size for reading from the BytesIO object in the add namespaces function. And instead of reading the whole object into memory, read chunks of it into a StringIO object that gets yielded from the function. |
Contributor
Author
|
Yes, I'll take a look at that after I finish these other changes |
* Add context manager to mocked_s3 fixture * Add empty tar file fixture * Update fixtures to match expected format of MARCXML files * Add context manager to lambda_handler * Change dash to underscore for lambda_handler output due to Step Function requirements * Add exception for failed tar file extraction * Update add_namespace_to_alma_marcxml for more efficient processing * Add test for an empty tar file * Add context managers to tests
hakbailey
suggested changes
May 26, 2022
Contributor
hakbailey
left a comment
There was a problem hiding this comment.
See inline comment for an example of how to read and write in chunks to avoid the memory issue.
* Add invalid XML fixture * Update add_namespaces_to_xml function with streaming chunks to avoid memory issues and change output to BytesIO * Update unit test for new approach * Add test for invalid xml
hakbailey
approved these changes
May 27, 2022
Contributor
hakbailey
left a comment
There was a problem hiding this comment.
Looks good, ran fine on the full Alma export in Dev1!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Helpful background context
I looked at several different approaches and this seems to be the best in terms of minimizing code and new dependencies. Other approaches I explored (
readlines,ET.fromstring+.iter()) did involve loading the file into memory and if we have to do that, it seemed best to just keep it simple withreplace. I'm happy to be wrong about this if there are better ways though.How can a reviewer manually see the effects of these changes?
Local testing functionality will be added as a part of ENSY-85
What are the relevant tickets?
Developer
Code Reviewer
(not just this pull request message)
Includes new or updated dependencies?
NO