-
Notifications
You must be signed in to change notification settings - Fork 113
refactor(dataset): Redirect multipart upload through File Service #4136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(dataset): Redirect multipart upload through File Service #4136
Conversation
aicam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala
Outdated
Show resolved
Hide resolved
xuang7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I have tested, main functionality is working for different scale of file sizes. There is one issue of uploading the same file; both uploads were canceled due to previous approach on the frontend. Left few comments.
...flow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala
Show resolved
Hide resolved
common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala
Show resolved
Hide resolved
common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
.../app/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.ts
Outdated
Show resolved
Hide resolved
…ttps://github.com/carloea2/texera into refactor/multipart_upload_through_dataset_resource
|
@carloea2 Please resolve those finished conversations. |
|
…ttps://github.com/carloea2/texera into refactor/multipart_upload_through_dataset_resource
|
@chenlica can we run the workflows to see if they pass? |
chenlica
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left comments. Please check.
...flow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala
Show resolved
Hide resolved
.../app/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.ts
Outdated
Show resolved
Hide resolved
file-service/src/test/scala/org/apache/texera/service/MockLakeFS.scala
Outdated
Show resolved
Hide resolved
chenlica
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments.
...flow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala
Show resolved
Hide resolved
chenlica
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See new comments.
What changes were proposed in this PR?
DB / schema
Add
dataset_upload_sessionto track multipart upload sessions, including:(uid, did, file_path)as the primary keyupload_id(UNIQUE),physical_addressnum_parts_requestedto enforce expected part countAdd
dataset_upload_session_partto track per-part completion for a multipart upload:(upload_id, part_number)as the primary keyetag(TEXT NOT NULL DEFAULT '') to persist per-part ETags for finalizeCHECK (part_number > 0)for sanityFOREIGN KEY (upload_id) REFERENCES dataset_upload_session(upload_id) ON DELETE CASCADEBackend (
DatasetResource)Multipart upload API (server-side streaming to S3, LakeFS manages multipart state):
POST /dataset/multipart-upload?type=initnum_parts_requested.dataset_upload_session_partfor part numbers1..num_parts_requestedwithetag = ''(enables deterministic per-part locking and simple completeness checks).(uid, did, file_path)(409 Conflict). Race is handled via PK/duplicate handling + best-effort LakeFS abort for the losing initializer.POST /dataset/multipart-upload/part?filePath=...&partNumber=...Content-Lengthfor streaming uploads.partNumber <= num_parts_requested.(upload_id, part_number)row usingSELECT … FOR UPDATE NOWAITto prevent concurrent uploads of the same part.dataset_upload_session_part.etag(upsert/overwrite for retries).POST /dataset/multipart-upload?type=finishLocks the session row using
SELECT … FOR UPDATE NOWAITto prevent concurrent finalize/abort.Validates completeness using DB state:
num_parts_requestedrows for theupload_id.Fetches
(part_number, etag)ordered bypart_numberfrom DB and completes multipart upload via LakeFS.Deletes the DB session row; part rows are cleaned up via
ON DELETE CASCADE.NOWAIT lock contention is handled (mapped to “already being finalized/aborted”, 409).
POST /dataset/multipart-upload?type=abortSELECT … FOR UPDATE NOWAIT.finish.Access control and dataset permissions remain enforced on all endpoints.
Frontend service (
dataset.service.ts)multipartUpload(...)updated to reflect the server flow and return values (ETag persistence is server-side; frontend does not need to track ETags).Frontend component (
dataset-detail.component.ts)type=abortto clean up the upload session.Any related issues, documentation, discussions?
Closes #4110
How was this PR tested?
Unit tests added/updated (multipart upload spec):
Manual testing via the dataset detail page (single and multiple uploads), verified:
Was this PR authored or co-authored using generative AI tooling?
GPT partial use.