Skip to content

Conversation

@carloea2
Copy link
Contributor

@carloea2 carloea2 commented Dec 18, 2025

What changes were proposed in this PR?

  • DB / schema

    • Add dataset_upload_session to track multipart upload sessions, including:

      • (uid, did, file_path) as the primary key
      • upload_id (UNIQUE), physical_address
      • num_parts_requested to enforce expected part count
    • Add dataset_upload_session_part to track per-part completion for a multipart upload:

      • (upload_id, part_number) as the primary key
      • etag (TEXT NOT NULL DEFAULT '') to persist per-part ETags for finalize
      • CHECK (part_number > 0) for sanity
      • FOREIGN KEY (upload_id) REFERENCES dataset_upload_session(upload_id) ON DELETE CASCADE
  • Backend (DatasetResource)

    • Multipart upload API (server-side streaming to S3, LakeFS manages multipart state):

      • POST /dataset/multipart-upload?type=init

        • Validates permissions and input.
        • Creates a LakeFS multipart upload session.
        • Inserts a DB session row including num_parts_requested.
        • Pre-creates placeholder rows in dataset_upload_session_part for part numbers 1..num_parts_requested with etag = '' (enables deterministic per-part locking and simple completeness checks).
        • Rejects init if a session already exists for (uid, did, file_path) (409 Conflict). Race is handled via PK/duplicate handling + best-effort LakeFS abort for the losing initializer.
      • POST /dataset/multipart-upload/part?filePath=...&partNumber=...

        • Requires dataset write access and an existing upload session.
        • Requires Content-Length for streaming uploads.
        • Enforces partNumber <= num_parts_requested.
        • Per-part locking: locks the (upload_id, part_number) row using SELECT … FOR UPDATE NOWAIT to prevent concurrent uploads of the same part.
        • Uploads the part to S3 and persists the returned ETag into dataset_upload_session_part.etag (upsert/overwrite for retries).
        • Implements idempotency for retries by returning success if the ETag is already present for that part.
      • POST /dataset/multipart-upload?type=finish

        • Locks the session row using SELECT … FOR UPDATE NOWAIT to prevent concurrent finalize/abort.

        • Validates completeness using DB state:

          • Confirms the part table has num_parts_requested rows for the upload_id.
          • Confirms all parts have non-empty ETags (no missing parts).
          • Optionally surfaces a bounded list of missing part numbers (without relying on error-message asserts in tests).
        • Fetches (part_number, etag) ordered by part_number from DB and completes multipart upload via LakeFS.

        • Deletes the DB session row; part rows are cleaned up via ON DELETE CASCADE.

        • NOWAIT lock contention is handled (mapped to “already being finalized/aborted”, 409).

      • POST /dataset/multipart-upload?type=abort

        • Locks the session row using SELECT … FOR UPDATE NOWAIT.
        • Aborts the multipart upload via LakeFS and deletes the DB session row (parts cascade-delete).
        • NOWAIT lock contention is handled similarly to finish.
    • Access control and dataset permissions remain enforced on all endpoints.

  • Frontend service (dataset.service.ts)

    • multipartUpload(...) updated to reflect the server flow and return values (ETag persistence is server-side; frontend does not need to track ETags).
  • Frontend component (dataset-detail.component.ts)

    • Uses the same init/part/finish flow.
    • Abort triggers backend type=abort to clean up the upload session.

Any related issues, documentation, discussions?

Closes #4110


How was this PR tested?

  • Unit tests added/updated (multipart upload spec):

    • Init validation (invalid numParts, invalid filePath, permission denied).
    • Upload part validation (missing/invalid Content-Length, partNumber bounds, minimum size enforcement for non-final parts).
    • Per-part lock behavior under contention (no concurrent streams for the same part; deterministic assertions).
    • Finish/abort locking behavior (NOWAIT contention returns 409).
    • Successful end-to-end path (init → upload parts → finish) with DB cleanup assertions.
    • Integrity checks: positive + negative SHA-256 tests by downloading the finalized object and verifying it matches (or does not match) the expected concatenated bytes.
  • Manual testing via the dataset detail page (single and multiple uploads), verified:

    • Progress, speed, and ETA updates.
    • Abort behavior (UI state + DB session cleanup).
    • Successful completion path (all expected parts uploaded, LakeFS object present, dataset version creation works).

Was this PR authored or co-authored using generative AI tooling?

GPT partial use.

@github-actions github-actions bot added ddl-change Changes to the TexeraDB DDL refactor Refactor the code frontend Changes related to the frontend GUI service common labels Dec 18, 2025
@carloea2 carloea2 marked this pull request as ready for review December 19, 2025 23:10
Copy link
Contributor

@aicam aicam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@xuang7 xuang7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have tested, main functionality is working for different scale of file sizes. There is one issue of uploading the same file; both uploads were canceled due to previous approach on the frontend. Left few comments.

@chenlica
Copy link
Contributor

chenlica commented Jan 3, 2026

@carloea2 Please resolve those finished conversations.

@carloea2
Copy link
Contributor Author

carloea2 commented Jan 3, 2026

@carloea2 Please resolve those finished conversations.
There is no conversation that should be resolved at the moment

@carloea2
Copy link
Contributor Author

carloea2 commented Jan 3, 2026

@chenlica can we run the workflows to see if they pass?

@carloea2 carloea2 requested a review from aicam January 3, 2026 23:56
Copy link
Contributor

@chenlica chenlica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left comments. Please check.

Copy link
Contributor

@chenlica chenlica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments.

Copy link
Contributor

@chenlica chenlica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new comments.

@chenlica chenlica merged commit 253409a into apache:main Jan 5, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common ddl-change Changes to the TexeraDB DDL frontend Changes related to the frontend GUI refactor Refactor the code service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

task(dataset): Redirect multipart upload through File Service

4 participants