Skip to content

[Feature] Optimize ObjectRefresh for lower memory usage and better performance #4971

@smdsbz

Description

@smdsbz

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

The current implementation of ObjectRefresh first collects a list of all files under object-location, then writes them out to the table. This requires the driver node to have memory enough to reside the entire object listing.

Another problem is that the current implementation generates a new commit for each file in the listing. This can result in an enormous amount of snapshots and poor refresh performance.

Solution

  1. Use FileIO#listFilesIterative to load file listing into memory in batches.
    The final effect of memory saving will depend on the actual implementation of the FileIO, but the worst case it can fallback to is what we already have now.
  2. Periodically issue commits for writes of a certain batch size.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions