Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

bobbai00 · 2025-03-02T07:38:42Z

This PR introduces the FileService as another microservice parallel to WorkflowCompilingService, ComputingUnitMaster/Worker, and TexeraWebApplication.

Purpose of the FileService

We want to improve the performance of our current Git-based Dataset implementation;
We decide to go with LakeFS + S3, LakeFS for the version control metadata and S3 for data transfer; But LakeFS doesn't have access control layer
Therefore, we build the FileService, providing
- all the APIs related to versioned files in datasets
- access control

Architecture before and after adding FileService

Before:

After

Key Changes

A new service, FileService is introduced. All the dataset-related endpoints are hosted on FileService
Several configuration items related to LakeFS and S3 are introduced in the storage-config.yaml
Frontend UI updates to incorporate with new changes
For ComputingUnitMaster and ComputingUnitWorker, they will call FileService to read files, during which their access will be verified. So in the dynamic computing architecture (which will be introduced in Add computing unit manager service #3298), they will send requests along with current user's token. In single-machine architecture, they are bypassing the network requests by doing direct local function calls.
Python UDF can now directly read dataset's file by the following example code:

file = DatasetFileDocument("The URL of the file")
bytes = file.read_file() # return an io.Bytes object

You may refer to core/amber/src/main/python/pytexera/storage/dataset_file_document.py for implementation details. This feature is only available in the dynamic computing architecture.

How to migrate the previous datasets to the new datasets managed by the LakeFS

As we did quite some refactoring, two dataset implementations are NOT compatible with each others. To migrate the previous datasets to the latest implementation, you will need to re-upload the data via the new UI.

How to deploy new architecture

Step1. Deploy LakeFS & Minio

Use Docker (Highly recommended for local development)

Install Docker Desktop which contains both docker engine and docker compose. Do NOT solely use brew install docker because it will only install docker engine.
Go to directory: core/file-service/src/main/resources
Execute docker compose up -d at its directory

Use Binary (Recommended for production deployment)

Refer to https://docs.lakefs.io/howto/deploy/

Step2. Configure the `storage-config.yaml`

Use Docker

If you deployed lakeFS using docker compose in step 1, you don't need to change this file and can directly proceed with the given default values.

Use Binary

If you deployed lakeFS binary, you need to configure the below section in the storage-config.yaml:

  lakefs:
    endpoint: ""
    auth:
      api-secret: ""
      username: ""
      password: ""
    block-storage:
      type: ""
      bucket-name: ""

  s3:
    endpoint: ""
    auth:
      username: ""
      password: ""

Step3. Launch services

Launch FileService , in addition to TexeraWebApplication, WorkflowCompilingService and ComputingUnitMaster.

Future PRs after this one

Remove the Dataset related endpoints completely from amber package.
Incorporate the deployment of LakeFS+S3 in the Helm chart of K8s-based deployment.
Some optimizations:
- for small files, directly upload it instead of using multipart upload
- when doing result exports, use multipart upload as the result size can be quite big.
- support re-transmit for partially-uploaded files.

core/workflow-core/build.sbt

aglinxinyuan

Since installing LakeFS and Minio can be complex, could we add a frontend flag that allows developers to enable the user system without requiring LakeFS and Minio? This would let developers read files directly from their local file system when the user system is enabled.

core/workflow-core/src/main/resources/storage-config.yaml

bobbai00 · 2025-03-04T21:27:10Z

Since installing LakeFS and Minio can be complex, could we add a frontend flag that allows developers to enable the user system without requiring LakeFS and Minio? This would let developers read files directly from their local file system when the user system is enabled.

OK. I have added a flag in environment.default.ts

aglinxinyuan

LGTM!
Tested on both Windows and Mac. The setup is very smooth.
Please add more details on step3. For example, how can developers migrate to this from current master.

core/file-service/src/main/scala/edu/uci/ics/texera/service/util/S3StorageClient.scala

core/gui/package.json

core/amber/src/main/scala/edu/uci/ics/texera/web/service/ResultExportService.scala

core/file-service/src/main/scala/edu/uci/ics/texera/service/FileServiceConfiguration.scala

core/gui/src/app/dashboard/component/user/list-item/list-item.component.html

...pp/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.html

...pp/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.scss

core/workflow-core/build.sbt

…orage (#3296) This PR introduces the FileService as another microservice parallel to WorkflowCompilingService, ComputingUnitMaster/Worker, and TexeraWebApplication. ## Purpose of the FileService - We want to improve the performance of our current Git-based Dataset implementation; - We decide to go with `LakeFS` + `S3`, LakeFS for the version control metadata and S3 for data transfer; But `LakeFS` doesn't have access control layer - Therefore, we build the `FileService`, providing - all the APIs related to versioned files in datasets - access control ## Architecture before and after adding FileService Before: <img width="767" alt="Screenshot 2025-03-02 at 9 01 20 AM" src="https://github.com/user-attachments/assets/7d039f7a-49f2-4d48-9d15-bb9889a5c5ed" /> After <img width="827" alt="Screenshot 2025-03-02 at 8 53 32 AM" src="https://github.com/user-attachments/assets/28aa72b9-97b1-4789-b46f-2050e0dd8547" /> ## Key Changes - A new service, `FileService` is introduced. All the dataset-related endpoints are hosted on `FileService` - Several configuration items related to LakeFS and S3 are introduced in the `storage-config.yaml` - Frontend UI updates to incorporate with new changes - For `ComputingUnitMaster` and `ComputingUnitWorker`, they will call `FileService` to read files, during which their access will be verified. So in the dynamic computing architecture (which will be introduced in #3298), they will send requests along with current user's token. In single-machine architecture, they are bypassing the network requests by doing direct local function calls. - Python UDF can now directly read dataset's file by the following example code: ```python file = DatasetFileDocument("The URL of the file") bytes = file.read_file() # return an io.Bytes object ``` You may refer to `core/amber/src/main/python/pytexera/storage/dataset_file_document.py` for implementation details. **This feature is only available in the dynamic computing architecture**. ## How to migrate the previous datasets to the new datasets managed by the LakeFS As we did quite some refactoring, two dataset implementations are NOT compatible with each others. To migrate the previous datasets to the latest implementation, you will need to re-upload the data via the new UI. ## How to deploy new architecture ### Step1. Deploy LakeFS & Minio #### Use Docker (Highly recommended for local development) - Go to directory: `core/file-service/src/main/resources` - Execute ` docker-compose --profile local-lakefs up -d` at its directory #### Use Binary (Recommended for production deployment) Refer to https://docs.lakefs.io/howto/deploy/ ### Step2. Configure the `storage-config.yaml` Configure the below section in the `storage-config.yaml`: ``` lakefs: endpoint: "" auth: api-secret: "" username: "" password: "" block-storage: type: "" bucket-name: "" s3: endpoint: "" auth: username: "" password: "" ``` Here is the configuration you can directly use if you are using the `core/file-service/src/main/resources/docker-compose.yml` to install LakeFS & Minio: ``` lakefs: endpoint: "http://127.0.0.1:8000/api/v1" auth: api-secret: "random_string_for_lakefs" username: "AKIAIOSFOLKFSSAMPLES" password: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" block-storage: type: "s3" bucket-name: "texera-dataset" s3: endpoint: "http://localhost:9000" auth: username: "texera_minio" password: "password" ``` ### Step3. Launch services Launch `FileService` , in addition to `TexeraWebApplication`, `WorkflowCompilingService` and `ComputingUnitMaster`. ## Future PRs after this one - Remove the Dataset related endpoints completely from `amber` package. - Incorporate the deployment of LakeFS+S3 in the Helm chart of K8s-based deployment. - Some optimizations: - for small files, directly upload it instead of using multipart upload - when doing result exports, use multipart upload as the result size can be quite big. - support re-transmit for partially-uploaded files.

bobbai00 changed the title ~~Add FileService as a standalone microservice, and LakeFS+S3 as dataset storage~~ Add FileService as a standalone microservice, LakeFS+S3 as dataset storage Mar 2, 2025

bobbai00 marked this pull request as ready for review March 2, 2025 22:16

bobbai00 requested review from aglinxinyuan and shengquan-ni March 2, 2025 22:28

aglinxinyuan reviewed Mar 3, 2025

View reviewed changes

core/workflow-core/build.sbt Show resolved Hide resolved

bobbai00 self-assigned this Mar 3, 2025

bobbai00 force-pushed the jiadong-add-file-service branch from 5db607e to 0e7a9d8 Compare March 3, 2025 18:39

aglinxinyuan reviewed Mar 3, 2025

View reviewed changes

core/workflow-core/src/main/resources/storage-config.yaml Outdated Show resolved Hide resolved

bobbai00 force-pushed the jiadong-add-file-service branch from 6d101a1 to da96a2a Compare March 4, 2025 21:26

aglinxinyuan approved these changes Mar 5, 2025

View reviewed changes

bobbai00 force-pushed the jiadong-add-file-service branch 3 times, most recently from 2e38fad to 6accd5b Compare March 10, 2025 05:20

bobbai00 added 15 commits March 10, 2025 21:23

add initial lake fs based implementation

1a705ae

move lakefs logic to workflow core

978c528

add uri related and lake fs document

cfd84b3

fix bugs

ec3baf6

a compilable version

779f1df

a runnable version

a8bd16e

finish jwt auth

8ef7927

make the backend work

583ce58

keep refactoring the dataset resource

3e1d0d6

succinct the config parsing

c690e8c

test more APIs and closing to finish

2c93de6

fix dataset creation and version creation

dafa6e0

fix the presigned get

8ebccfb

closing to finish the upload

7e3ad39

refactor dataset frontend

b980628

bobbai00 added 21 commits March 10, 2025 21:23

recover gui changes

ae78df1

do the rebase

4f83db0

add the flag for controlling whether to select files from dataset

166b1f2

add default values for lakeFS+S3

ba9dfcf

fmt

b356e76

add file service to part of the scripts

0c26e27

resolve comments and fix the py udf document

1d71e74

fmt python

ee60cf9

fmt and fix the version of docker compose

de73637

try to fix the cors issue

2e0ab31

fmt py file

f9bc34e

add header for put

0c55cff

fmt UDF

fe03345

keep refining

efa49fe

update the docker compose

77467a5

remove the header in the dataset.service.ts fetch

cf8d460

improve the upload

b470e18

add the concurrency in the config

5620944

add more comments on the env

ab37cf0

add cancel feature

ef43ca9

add the bold

e294ac7

bobbai00 force-pushed the jiadong-add-file-service branch from db36804 to e294ac7 Compare March 11, 2025 04:24

fmt

bebc938

bobbai00 merged commit dceed87 into master Mar 11, 2025
8 checks passed

bobbai00 deleted the jiadong-add-file-service branch March 11, 2025 05:05

bobbai00 linked an issue Jun 13, 2025 that may be closed by this pull request

Large flie like pre-trained model cannot be properly loaded due to mongo size limit #2750

Closed

This was referenced Jun 13, 2025

Large flie like pre-trained model cannot be properly loaded due to mongo size limit #2750

Closed

Download Past Dataset Version has empty files #3113

Closed

Large data uploads #3234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

Uh oh!

bobbai00 commented Mar 2, 2025 •

edited by shengquan-ni

Loading

Uh oh!

Uh oh!

aglinxinyuan left a comment

Uh oh!

Uh oh!

bobbai00 commented Mar 4, 2025

Uh oh!

aglinxinyuan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

Uh oh!

Conversation

bobbai00 commented Mar 2, 2025 • edited by shengquan-ni Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the FileService

Architecture before and after adding FileService

Key Changes

How to migrate the previous datasets to the new datasets managed by the LakeFS

How to deploy new architecture

Step1. Deploy LakeFS & Minio

Use Docker (Highly recommended for local development)

Use Binary (Recommended for production deployment)

Step2. Configure the storage-config.yaml

Use Docker

Use Binary

Step3. Launch services

Future PRs after this one

Uh oh!

Uh oh!

aglinxinyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bobbai00 commented Mar 4, 2025

Uh oh!

aglinxinyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bobbai00 commented Mar 2, 2025 •

edited by shengquan-ni

Loading

Step2. Configure the `storage-config.yaml`