Skip to content

ENSY-70-insert-namespaces#3

Merged
ehanson8 merged 3 commits intomainfrom
ENSY-70-insert-namespaces
May 27, 2022
Merged

ENSY-70-insert-namespaces#3
ehanson8 merged 3 commits intomainfrom
ENSY-70-insert-namespaces

Conversation

@ehanson8
Copy link
Contributor

@ehanson8 ehanson8 commented May 24, 2022

Helpful background context

I looked at several different approaches and this seems to be the best in terms of minimizing code and new dependencies. Other approaches I explored (readlines, ET.fromstring + .iter()) did involve loading the file into memory and if we have to do that, it seemed best to just keep it simple with replace. I'm happy to be wrong about this if there are better ways though.

How can a reviewer manually see the effects of these changes?

Local testing functionality will be added as a part of ENSY-85

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Includes new or updated dependencies?

NO

Why these changes are being introduced:
* Alma MARCXML lacks namespaces in the collection element which are required for validation by POD

How this addresses that need:
* Insert namespaces with replace string method
* Add fixture and unit test for new functionality

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/ENSY-70
@ehanson8 ehanson8 requested a review from hakbailey May 24, 2022 18:17
@hakbailey
Copy link
Contributor

I just tried to run this on Dev1 against our actual file exports and we hit a memory error. Can you investigate explicitly setting the buffer size for reading from the BytesIO object in the add namespaces function. And instead of reading the whole object into memory, read chunks of it into a StringIO object that gets yielded from the function.

@ehanson8
Copy link
Contributor Author

Yes, I'll take a look at that after I finish these other changes

* Add context manager to mocked_s3 fixture
* Add empty tar file fixture
* Update fixtures to match expected format of MARCXML files
* Add context manager to lambda_handler
* Change dash to underscore for lambda_handler output due to Step Function requirements
* Add exception for failed tar file extraction
* Update add_namespace_to_alma_marcxml for more efficient processing
* Add test for an empty tar file
* Add context managers to tests
Copy link
Contributor

@hakbailey hakbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment for an example of how to read and write in chunks to avoid the memory issue.

* Add invalid XML fixture
* Update add_namespaces_to_xml function with streaming chunks to avoid memory issues and change output to BytesIO
* Update unit test for new approach
* Add test for invalid xml
@ehanson8 ehanson8 requested a review from hakbailey May 27, 2022 16:21
Copy link
Contributor

@hakbailey hakbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, ran fine on the full Alma export in Dev1!

@ehanson8 ehanson8 merged commit 63a14c1 into main May 27, 2022
@ehanson8 ehanson8 deleted the ENSY-70-insert-namespaces branch May 27, 2022 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants