Search before asking
Motivation
The current implementation of ObjectRefresh first collects a list of all files under object-location, then writes them out to the table. This requires the driver node to have memory enough to reside the entire object listing.
Another problem is that the current implementation generates a new commit for each file in the listing. This can result in an enormous amount of snapshots and poor refresh performance.
Solution
- Use
FileIO#listFilesIterative to load file listing into memory in batches.
The final effect of memory saving will depend on the actual implementation of the FileIO, but the worst case it can fallback to is what we already have now.
- Periodically issue commits for writes of a certain batch size.
Anything else?
No response
Are you willing to submit a PR?