Skip to content

Conversation

@bobbai00
Copy link
Contributor

@bobbai00 bobbai00 commented Mar 2, 2025

This PR introduces the FileService as another microservice parallel to WorkflowCompilingService, ComputingUnitMaster/Worker, and TexeraWebApplication.

Purpose of the FileService

  • We want to improve the performance of our current Git-based Dataset implementation;
  • We decide to go with LakeFS + S3, LakeFS for the version control metadata and S3 for data transfer; But LakeFS doesn't have access control layer
  • Therefore, we build the FileService, providing
    • all the APIs related to versioned files in datasets
    • access control

Architecture before and after adding FileService

Before:
Screenshot 2025-03-02 at 9 01 20 AM

After
Screenshot 2025-03-02 at 8 53 32 AM

Key Changes

  • A new service, FileService is introduced. All the dataset-related endpoints are hosted on FileService
  • Several configuration items related to LakeFS and S3 are introduced in the storage-config.yaml
  • Frontend UI updates to incorporate with new changes
  • For ComputingUnitMaster and ComputingUnitWorker, they will call FileService to read files, during which their access will be verified. So in the dynamic computing architecture (which will be introduced in Add computing unit manager service #3298), they will send requests along with current user's token. In single-machine architecture, they are bypassing the network requests by doing direct local function calls.
  • Python UDF can now directly read dataset's file by the following example code:
file = DatasetFileDocument("The URL of the file")
bytes = file.read_file() # return an io.Bytes object

You may refer to core/amber/src/main/python/pytexera/storage/dataset_file_document.py for implementation details. This feature is only available in the dynamic computing architecture.

How to migrate the previous datasets to the new datasets managed by the LakeFS

As we did quite some refactoring, two dataset implementations are NOT compatible with each others. To migrate the previous datasets to the latest implementation, you will need to re-upload the data via the new UI.

How to deploy new architecture

Step1. Deploy LakeFS & Minio

Use Docker (Highly recommended for local development)

  • Install Docker Desktop which contains both docker engine and docker compose. Do NOT solely use brew install docker because it will only install docker engine.
  • Go to directory: core/file-service/src/main/resources
  • Execute docker compose up -d at its directory

Use Binary (Recommended for production deployment)

Refer to https://docs.lakefs.io/howto/deploy/

Step2. Configure the storage-config.yaml

Use Docker

If you deployed lakeFS using docker compose in step 1, you don't need to change this file and can directly proceed with the given default values.

Use Binary

If you deployed lakeFS binary, you need to configure the below section in the storage-config.yaml:

  lakefs:
    endpoint: ""
    auth:
      api-secret: ""
      username: ""
      password: ""
    block-storage:
      type: ""
      bucket-name: ""

  s3:
    endpoint: ""
    auth:
      username: ""
      password: ""

Step3. Launch services

Launch FileService , in addition to TexeraWebApplication, WorkflowCompilingService and ComputingUnitMaster.

Future PRs after this one

  • Remove the Dataset related endpoints completely from amber package.
  • Incorporate the deployment of LakeFS+S3 in the Helm chart of K8s-based deployment.
  • Some optimizations:
    • for small files, directly upload it instead of using multipart upload
    • when doing result exports, use multipart upload as the result size can be quite big.
    • support re-transmit for partially-uploaded files.

@bobbai00 bobbai00 changed the title Add FileService as a standalone microservice, and LakeFS+S3 as dataset storage Add FileService as a standalone microservice, LakeFS+S3 as dataset storage Mar 2, 2025
@bobbai00 bobbai00 marked this pull request as ready for review March 2, 2025 22:16
@bobbai00 bobbai00 self-assigned this Mar 3, 2025
@bobbai00 bobbai00 force-pushed the jiadong-add-file-service branch from 5db607e to 0e7a9d8 Compare March 3, 2025 18:39
Copy link
Contributor

@aglinxinyuan aglinxinyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since installing LakeFS and Minio can be complex, could we add a frontend flag that allows developers to enable the user system without requiring LakeFS and Minio? This would let developers read files directly from their local file system when the user system is enabled.

@bobbai00 bobbai00 force-pushed the jiadong-add-file-service branch from 6d101a1 to da96a2a Compare March 4, 2025 21:26
@bobbai00
Copy link
Contributor Author

bobbai00 commented Mar 4, 2025

Since installing LakeFS and Minio can be complex, could we add a frontend flag that allows developers to enable the user system without requiring LakeFS and Minio? This would let developers read files directly from their local file system when the user system is enabled.

OK. I have added a flag in environment.default.ts

Copy link
Contributor

@aglinxinyuan aglinxinyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Tested on both Windows and Mac. The setup is very smooth.
Please add more details on step3. For example, how can developers migrate to this from current master.

@bobbai00 bobbai00 force-pushed the jiadong-add-file-service branch 3 times, most recently from 2e38fad to 6accd5b Compare March 10, 2025 05:20
@bobbai00 bobbai00 force-pushed the jiadong-add-file-service branch from db36804 to e294ac7 Compare March 11, 2025 04:24
@bobbai00 bobbai00 merged commit dceed87 into master Mar 11, 2025
8 checks passed
@bobbai00 bobbai00 deleted the jiadong-add-file-service branch March 11, 2025 05:05
Ma77Ball pushed a commit that referenced this pull request Apr 2, 2025
…orage (#3296)

This PR introduces the FileService as another microservice parallel to
WorkflowCompilingService, ComputingUnitMaster/Worker, and
TexeraWebApplication.

## Purpose of the FileService

- We want to improve the performance of our current Git-based Dataset
implementation;
- We decide to go with `LakeFS` + `S3`, LakeFS for the version control
metadata and S3 for data transfer; But `LakeFS` doesn't have access
control layer
- Therefore, we build the `FileService`, providing 
   - all the APIs related to versioned files in datasets
   - access control

## Architecture before and after adding FileService
Before:
<img width="767" alt="Screenshot 2025-03-02 at 9 01 20 AM"
src="https://github.com/user-attachments/assets/7d039f7a-49f2-4d48-9d15-bb9889a5c5ed"
/>

After
<img width="827" alt="Screenshot 2025-03-02 at 8 53 32 AM"
src="https://github.com/user-attachments/assets/28aa72b9-97b1-4789-b46f-2050e0dd8547"
/>


## Key Changes
- A new service, `FileService` is introduced. All the dataset-related
endpoints are hosted on `FileService`
- Several configuration items related to LakeFS and S3 are introduced in
the `storage-config.yaml`
- Frontend UI updates to incorporate with new changes
- For `ComputingUnitMaster` and `ComputingUnitWorker`, they will call
`FileService` to read files, during which their access will be verified.
So in the dynamic computing architecture (which will be introduced in
#3298), they will send requests along with current user's token. In
single-machine architecture, they are bypassing the network requests by
doing direct local function calls.
- Python UDF can now directly read dataset's file by the following
example code:
```python
file = DatasetFileDocument("The URL of the file")
bytes = file.read_file() # return an io.Bytes object
```
You may refer to
`core/amber/src/main/python/pytexera/storage/dataset_file_document.py`
for implementation details. **This feature is only available in the
dynamic computing architecture**.

## How to migrate the previous datasets to the new datasets managed by
the LakeFS

As we did quite some refactoring, two dataset implementations are NOT
compatible with each others. To migrate the previous datasets to the
latest implementation, you will need to re-upload the data via the new
UI.

## How to deploy new architecture
### Step1. Deploy LakeFS & Minio
#### Use Docker (Highly recommended for local development)
- Go to directory: `core/file-service/src/main/resources`
- Execute ` docker-compose --profile local-lakefs up -d` at its
directory

#### Use Binary (Recommended for production deployment)
Refer to https://docs.lakefs.io/howto/deploy/


### Step2. Configure the `storage-config.yaml`

Configure the below section in the `storage-config.yaml`:
```
  lakefs:
    endpoint: ""
    auth:
      api-secret: ""
      username: ""
      password: ""
    block-storage:
      type: ""
      bucket-name: ""

  s3:
    endpoint: ""
    auth:
      username: ""
      password: ""
```

Here is the configuration you can directly use if you are using the
`core/file-service/src/main/resources/docker-compose.yml` to install
LakeFS & Minio:
```
  lakefs:
    endpoint: "http://127.0.0.1:8000/api/v1"
    auth:
      api-secret: "random_string_for_lakefs"
      username: "AKIAIOSFOLKFSSAMPLES"
      password: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    block-storage:
      type: "s3"
      bucket-name: "texera-dataset"

  s3:
    endpoint: "http://localhost:9000"
    auth:
      username: "texera_minio"
      password: "password"
```

### Step3. Launch services

Launch `FileService` , in addition to `TexeraWebApplication`,
`WorkflowCompilingService` and `ComputingUnitMaster`.


## Future PRs after this one

- Remove the Dataset related endpoints completely from `amber` package.
- Incorporate the deployment of LakeFS+S3 in the Helm chart of K8s-based
deployment.
- Some optimizations:
- for small files, directly upload it instead of using multipart upload
- when doing result exports, use multipart upload as the result size can
be quite big.
   - support re-transmit for partially-uploaded files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large flie like pre-trained model cannot be properly loaded due to mongo size limit

2 participants