-
Notifications
You must be signed in to change notification settings - Fork 113
Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5db607e to
0e7a9d8
Compare
aglinxinyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since installing LakeFS and Minio can be complex, could we add a frontend flag that allows developers to enable the user system without requiring LakeFS and Minio? This would let developers read files directly from their local file system when the user system is enabled.
6d101a1 to
da96a2a
Compare
OK. I have added a flag in |
aglinxinyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Tested on both Windows and Mac. The setup is very smooth.
Please add more details on step3. For example, how can developers migrate to this from current master.
core/file-service/src/main/scala/edu/uci/ics/texera/service/util/S3StorageClient.scala
Outdated
Show resolved
Hide resolved
core/file-service/src/main/scala/edu/uci/ics/texera/service/util/S3StorageClient.scala
Show resolved
Hide resolved
core/file-service/src/main/scala/edu/uci/ics/texera/service/util/S3StorageClient.scala
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/web/service/ResultExportService.scala
Show resolved
Hide resolved
core/file-service/src/main/scala/edu/uci/ics/texera/service/FileServiceConfiguration.scala
Show resolved
Hide resolved
core/gui/src/app/dashboard/component/user/list-item/list-item.component.html
Show resolved
Hide resolved
...pp/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.html
Show resolved
Hide resolved
...pp/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.scss
Outdated
Show resolved
Hide resolved
2e38fad to
6accd5b
Compare
db36804 to
e294ac7
Compare
…orage (#3296) This PR introduces the FileService as another microservice parallel to WorkflowCompilingService, ComputingUnitMaster/Worker, and TexeraWebApplication. ## Purpose of the FileService - We want to improve the performance of our current Git-based Dataset implementation; - We decide to go with `LakeFS` + `S3`, LakeFS for the version control metadata and S3 for data transfer; But `LakeFS` doesn't have access control layer - Therefore, we build the `FileService`, providing - all the APIs related to versioned files in datasets - access control ## Architecture before and after adding FileService Before: <img width="767" alt="Screenshot 2025-03-02 at 9 01 20 AM" src="https://github.com/user-attachments/assets/7d039f7a-49f2-4d48-9d15-bb9889a5c5ed" /> After <img width="827" alt="Screenshot 2025-03-02 at 8 53 32 AM" src="https://github.com/user-attachments/assets/28aa72b9-97b1-4789-b46f-2050e0dd8547" /> ## Key Changes - A new service, `FileService` is introduced. All the dataset-related endpoints are hosted on `FileService` - Several configuration items related to LakeFS and S3 are introduced in the `storage-config.yaml` - Frontend UI updates to incorporate with new changes - For `ComputingUnitMaster` and `ComputingUnitWorker`, they will call `FileService` to read files, during which their access will be verified. So in the dynamic computing architecture (which will be introduced in #3298), they will send requests along with current user's token. In single-machine architecture, they are bypassing the network requests by doing direct local function calls. - Python UDF can now directly read dataset's file by the following example code: ```python file = DatasetFileDocument("The URL of the file") bytes = file.read_file() # return an io.Bytes object ``` You may refer to `core/amber/src/main/python/pytexera/storage/dataset_file_document.py` for implementation details. **This feature is only available in the dynamic computing architecture**. ## How to migrate the previous datasets to the new datasets managed by the LakeFS As we did quite some refactoring, two dataset implementations are NOT compatible with each others. To migrate the previous datasets to the latest implementation, you will need to re-upload the data via the new UI. ## How to deploy new architecture ### Step1. Deploy LakeFS & Minio #### Use Docker (Highly recommended for local development) - Go to directory: `core/file-service/src/main/resources` - Execute ` docker-compose --profile local-lakefs up -d` at its directory #### Use Binary (Recommended for production deployment) Refer to https://docs.lakefs.io/howto/deploy/ ### Step2. Configure the `storage-config.yaml` Configure the below section in the `storage-config.yaml`: ``` lakefs: endpoint: "" auth: api-secret: "" username: "" password: "" block-storage: type: "" bucket-name: "" s3: endpoint: "" auth: username: "" password: "" ``` Here is the configuration you can directly use if you are using the `core/file-service/src/main/resources/docker-compose.yml` to install LakeFS & Minio: ``` lakefs: endpoint: "http://127.0.0.1:8000/api/v1" auth: api-secret: "random_string_for_lakefs" username: "AKIAIOSFOLKFSSAMPLES" password: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" block-storage: type: "s3" bucket-name: "texera-dataset" s3: endpoint: "http://localhost:9000" auth: username: "texera_minio" password: "password" ``` ### Step3. Launch services Launch `FileService` , in addition to `TexeraWebApplication`, `WorkflowCompilingService` and `ComputingUnitMaster`. ## Future PRs after this one - Remove the Dataset related endpoints completely from `amber` package. - Incorporate the deployment of LakeFS+S3 in the Helm chart of K8s-based deployment. - Some optimizations: - for small files, directly upload it instead of using multipart upload - when doing result exports, use multipart upload as the result size can be quite big. - support re-transmit for partially-uploaded files.
This PR introduces the FileService as another microservice parallel to WorkflowCompilingService, ComputingUnitMaster/Worker, and TexeraWebApplication.
Purpose of the FileService
LakeFS+S3, LakeFS for the version control metadata and S3 for data transfer; ButLakeFSdoesn't have access control layerFileService, providingArchitecture before and after adding FileService
Before:

After

Key Changes
FileServiceis introduced. All the dataset-related endpoints are hosted onFileServicestorage-config.yamlComputingUnitMasterandComputingUnitWorker, they will callFileServiceto read files, during which their access will be verified. So in the dynamic computing architecture (which will be introduced in Add computing unit manager service #3298), they will send requests along with current user's token. In single-machine architecture, they are bypassing the network requests by doing direct local function calls.You may refer to
core/amber/src/main/python/pytexera/storage/dataset_file_document.pyfor implementation details. This feature is only available in the dynamic computing architecture.How to migrate the previous datasets to the new datasets managed by the LakeFS
As we did quite some refactoring, two dataset implementations are NOT compatible with each others. To migrate the previous datasets to the latest implementation, you will need to re-upload the data via the new UI.
How to deploy new architecture
Step1. Deploy LakeFS & Minio
Use Docker (Highly recommended for local development)
brew install dockerbecause it will only install docker engine.core/file-service/src/main/resourcesdocker compose up -dat its directoryUse Binary (Recommended for production deployment)
Refer to https://docs.lakefs.io/howto/deploy/
Step2. Configure the
storage-config.yamlUse Docker
If you deployed lakeFS using docker compose in step 1, you don't need to change this file and can directly proceed with the given default values.
Use Binary
If you deployed lakeFS binary, you need to configure the below section in the
storage-config.yaml:Step3. Launch services
Launch
FileService, in addition toTexeraWebApplication,WorkflowCompilingServiceandComputingUnitMaster.Future PRs after this one
amberpackage.