Skip to content

Conversation

@shengquan-ni
Copy link
Contributor

WIP

kunwp1 and others added 30 commits December 3, 2024 18:28
This PR enhances the schema handling by retrieving schema information
directly from the backend, rather than having the frontend infer it.
Accurate schema retrieval is needed due to the binary data.

Previously, binary data was incorrectly treated as a string on the
frontend, leading to issues differentiating between string and binary
types. For instance, strings containing '0' and '1' were misinterpreted
as binary data, resulting in byte representations being displayed in the
result panel. This update ensures that the frontend receives the correct
schema information, improving data type accuracy and presentation.

By aligning the schema across the backend and frontend, this change
resolves existing issues and provides more reliable handling of binary
data.

<img width="1530" alt="Screenshot 2024-10-15 at 1 41 20 AM"
src="https://github.com/user-attachments/assets/d8a13536-34ea-4c2c-89a2-969b9ee1f8fb">

Co-authored-by: Kunwoo Park <kunwoopark@Kunwoos-MacBook-Pro.local>
Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
…2942)

This PR adds a new sub-project in sbt called `workflow-core`, under
micro-services.
It contains a duplicated codebase of workflow core dependencies,
including:
- `Tuple`, `Attribute`, `Schema`;
- `OperatorExecutor`;
- `PhysicalOp`, `PhysicalPlan`;
- Identities;
- and other utility functions.

### Migration plan:
After exporting workflow-core as a local dependency, we can build
workflow-compiling-service and workflow-execution-service based on it.
This PR fixes:
1. add a timeout to prevent the language server from connecting forever.
currently set to 1 second.
2. handle the diff editor in the case where no code is found. this could
happen when an older version contains an operator A but compared to the
latest version that has no A's code (A has been removed).
**Propose:**
Currently, opening the dashboard triggers some browser errors. Although
they do not affect normal functionality, they could potentially lead to
other unknown risks. This PR fixes these issues in the dashboard.

**Changes:**
1. Using `ChangeDetectorRef` to trigger view updates ensures that the UI
stays synchronized with asynchronous data changes.
2. Removed `nz-tooltip` due to conflicts with `nz-icon` causing the
`nz0100` error. Additionally, the usage of `nz-icon` overlaps with the
`title` attribute.

**Demo:**
Before:

![image](https://github.com/user-attachments/assets/154f08a5-77ff-4cf1-b58d-eceec798b922)

After:

![image](https://github.com/user-attachments/assets/5d2829d2-64ff-4005-b22a-8197058797fc)
…ion (#2954)

This PR fixes CI issues introduced in #2912. The symptom is frontend
test ci failed without failing the entire CI. There are a few reasons
underneath:

- **Alias dependency resolution issue.** The newly introduced
`Monaco-editor-wrapper` has peer dependency with alias
`@codingame/monaco-vscode-api@8.0.4`. Yarn@1 will install it as
`vscode@npm:@codingame/monaco-vscode-api@8.0.4` in the lock file.
However, during parsing the lock file, Nx reports cannot find the
package with name `npm:vscode@npm:@codingame/monaco-vscode-api` (note
for the `npm:` in the beginning).

> Solution (2 steps): 
> 1. Added resolution in package.json `monaco-editor:
npm:vscode@npm:@codingame/monaco-vscode-api` so that the generated lock
file has correct handle.
> 2. To be able to parse this added resolution field, Yarn@1 has to be
upgraded to Yarn@4.5.1.

- **multiple webpack resolution issue.** Our project uses a custom
webpack. Monaco also introduces their webpack dependency. There are
multiple webpacks, and the used version got resolved to be `@5.92.3`
which caused error when bundling frontend compile files.

> Solution: pin webpack version to use `@5.89.0` in resolution.

- **Nx not failing upon a failed execution.** In #2912, we also updated
the Nx from @18.1.3 to @18.2.0. The frontend test is failed to be
executed, but Nx reports success (did not return error).

> Solution: this seems to be a bug on Nx side. in release @18.2.0, a bug
was introduced so that errors reported from the project graph build
during the daemon process would not be rethrown, causing the main thread
to not fail properly. We updated it to version @20.0.3 and it is
confirmed to be solved. See
https://github.com/Texera/texera/actions/runs/11429004023 (it properly
failed the CI).


### Steps to migrate this PR:
Under `core` directory:
- You need to switch from Yarn@1 to Yarn@4.5.1: run `corepack enable &&
corepack prepare yarn@4.5.1 --activate && yarn --cwd gui set version
4.5.1`

Under `core/gui` directory:
- You can verify the Yarn version with `yarn --version`. It should show
4.5.1
- Remove all cached/dependency files: `rm -rf node_modules .angular`
- Reinstall with the upgraded Yarn: `yarn install`
This PR addresses a port handling issue in URL generation to prevent
invalid redirection. It introduces logic to check if the port is `-1`
(not specified), `80` (default HTTP port), or `443` (default HTTPS
port). When the port matches any of these conditions, it is omitted from
the URL.

Currently, even if the port number is not specified, it still appears in
the generated URL, leading to incorrect redirection. This fix ensures
that URLs are properly formatted to avoid such issues.

[The screenshot of the current issue]
<img width="740" alt="Screenshot 2024-10-15 at 11 48 55 AM"
src="https://github.com/user-attachments/assets/2da0d8ad-a58f-48bb-8c22-6f128d36021f">

Co-authored-by: Kunwoo Park <kunwoopark@Kunwoos-MacBook-Pro.local>
Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
This PR fixes #2951. Due to #2913, the change of the package for Tuple
and TupleLike should be reflected on the JavaUDF template.
This PR correctly handles an empty table in ArrowTupleProvider. In such
a case, the `next` function should raise `StopIteration` directly.
Upgrade pandas version to avoid the need for compiling wheels during CI
builds with Python 3.12. The previous version led to delays due to
time-consuming compilation.
This PR fixes the issue of `FileDocumentSpec`. Two test cases were
written incorrectly.

## The Behavior of the Bug

Two test cases occasionally fail in `FileDocumentSpec`. One is for
concurrently writing to the same file document, and the other is for
concurrently reading from one file document. The failure occurs due to
race conditions where multiple threads access the file without proper
synchronization.

## How the Fix is Done

- **Concurrent Writes**: The fix ensures all threads share the same
`FileDocument` instance, allowing `ReentrantReadWriteLock` to properly
synchronize write access, preventing race conditions.
  
- **Concurrent Reads**: The fix ensures that multiple threads can safely
read from the file after the write operation is complete, using the same
instance to coordinate access.
This PR improves the user interface by updating the message displayed
when a user selects files to create a dataset. Previously, the message
read "# files uploaded successfully!" after the files were selected from
the local machine. However, based on feedback from @chenlica, the term
"uploaded" was found to be misleading, as the files are not yet uploaded
to the server at this stage.

To clarify the process, this PR changes the wording from "uploaded" to
"selected," making it clear that the files have only been chosen locally
and have not yet been transferred to the server.

Additionally, I added a conditional expression to determine whether to
use "file" or "files" based on the count.

[Before the change]
<img width="711" alt="Screenshot 2024-10-24 at 11 17 02 AM"
src="https://github.com/user-attachments/assets/343b0218-d669-4cfd-9175-cf58f362416b">

[After the change]
<img width="550" alt="Screenshot 2024-10-24 at 11 20 57 AM"
src="https://github.com/user-attachments/assets/55f2f537-6bf7-4136-9957-47c964d08f0e">

---------

Co-authored-by: Kunwoo Park <kunwoopark@dhcp-172-31-218-249.mobile.uci.edu>
This PR addresses two issues related to modifying the workflow name and
description:

1. Write Access Issue: Previously, when a user with write access tried
to modify the workflow name or description, the backend would block the
action due to an unnecessary validation check. This PR removes the
redundant check, allowing users with write access to update the workflow
as expected.

2. Read-Only Access Issue: Users with read-only access were able to
modify the workflow name or description in the frontend, but the backend
would correctly reject these changes. However, the frontend continued to
display the modified values until the page was refreshed, leading to a
confusing user experience. This PR improves the frontend logic to revert
the workflow name and description to their original state if the backend
returns an error, ensuring a more consistent and user-friendly
experience.


https://github.com/user-attachments/assets/ed6aa9c4-67fe-46f3-8b33-64c68198d82a

---------

Co-authored-by: Kunwoo Park <kunwoopark@dhcp-v093-134.mobile.uci.edu>
This PR introduces an `nz-spin` loader to the workspace to indicate that
a workflow is being loaded. Previously, when uploading a large workflow,
the workspace would appear blank during the loading process, leaving
users unsure of what was happening. With this change, the `nz-spin`
loader ensures users are aware that the workflow loading is in progress,
thereby creating a more user-friendly experience.
<img width="1433" alt="Screenshot 2024-10-24 at 3 55 49 PM"
src="https://github.com/user-attachments/assets/d76a388c-a15d-48c6-9066-73dd2619cfd7">


https://github.com/user-attachments/assets/3888a5f8-a12c-40f3-9140-ed383db7a3c6

Co-authored-by: Kunwoo Park <kunwoopark@dhcp-172-31-218-249.mobile.uci.edu>
This PR adds the Udon UI for breakpoint-related operations, including
setting a breakpoint, removing a breakpoint, adding a condition to a
breakpoint, and hitting a breakpoint.

Adding/Removing breakpoints with optional conditions.
![2024-10-19 17 56
47](https://github.com/user-attachments/assets/829c9dca-b085-4a78-9df9-437730c08ccd)

Manually continue the execution and breakpoint hit:
![2024-10-19 18 02
09](https://github.com/user-attachments/assets/e01f2d50-3956-4094-a958-1ad6d7c72020)


### Limitations:
- It will pause the UDF upon any breakpoint-related operations. It will
not automatically continue/resume the execution.
- Currently, the breakpoints are only to be set during a running Python
UDF operator. Submitting a new execution will remove all debugging
states, including previously set breakpoints.
- Some states are stored in frontend (in UDFDebugService) only. They
need to be retrieved back from backend debugger.
As discussed in #2950, we plan to remove obsolete RPCs and reconfiguration-related RPCs in this PR. We will bring reconfiguration back after #2950.

Removed:
1. QueryCurrentInputTuple.
2. ShutdownDPThread.

Disabled, will be added later:
1. Reconfiguration.
2. UpdateExecutor.
3. UpdateMultipleExecutors.
This PR includes a few efforts to improve frontend CI on macOS.

1. Changed macOS CI to run on arm64 arch instead of x64. The is the main
cause of disconnecting issue.
2. Upgraded testing-related packages to the latest: 
- karma to 6.4.4 (applied a custom fix for Chrome > 128.0.0.0.0 on macOS
arm64, see my post
angular/angular-cli#28271 (comment))
    - jasmine-core to 5.4.0
3. Fixed many problematic test cases, including:
    - Empty test cases (no `it` cases).
- Wrong dependency injection (especially on HttpClient should use
`HttpClientTestingModule`).
    - Wrong global NzMessageModule import.
…thout depending on userSystemEnabled flag (#2969)

This PR changes the logic of scan source operator resolving a file. 

Specifically, scan source operator now no longer depends on
`userSystemEnabled` flag to decide if it is scanning a local file or a
file from dataset.

Instead, the resolving logic is:
Input: fileName (user friendly name provided by the user when setting
scan source operators)
- check if the file pointed by the fileName exist locally
   - If exists, resolve it as a local file
   - if not, check if the file exist in the dataset
- if exists, resolve it as a DatasetFileDocument(the file handle of the
dataset file)
      - if not, throw `SourceFileNotFound` error
…r User Interaction Tracking (#2902)

**Propose:**
Add the like button while refactoring the clone button.

**Changes:**
1. Added two new tables to the database to track likes and clones. Use
16.sql to update the database.
2. Added a like button to the list item, which will be displayed in the
hub interface, but disabled when the user is not logged in.
3. Added a like button and a clone button to the detail page, both of
which will be disabled when the user is not logged in. The clone button
will be disabled in the detail preview interface

**Demo:**
**Old:**
list item before login:
![old list item before
login](https://github.com/user-attachments/assets/2729d307-2564-43cc-a852-1fb869633ae8)

detail page before login:
![old detail page before
login](https://github.com/user-attachments/assets/296ae4d0-11df-4486-9c7f-074fbc022a07)

list item after login:
![old list item after
login](https://github.com/user-attachments/assets/0923770c-2e1a-4122-a31f-6c5410f64334)

detail page after login:
![old detail page after
login](https://github.com/user-attachments/assets/92a0f08c-e1db-4aca-90a4-819595ba3bf6)

**New:**
list item before login:
![new hub list item before
login](https://github.com/user-attachments/assets/cb5e8ea8-75e0-4241-a1f4-fd3c46cefd7a)

detail page before login:
![new detail page before
login](https://github.com/user-attachments/assets/17c2ebac-9799-44aa-acb1-bd3670f1a9b8)

list item after login:
![new hub list item after
login](https://github.com/user-attachments/assets/7253cce9-ed85-4916-b686-dc286c71b4e9)

detail page after login:
![new detail page after
login](https://github.com/user-attachments/assets/356cb92e-7d5e-4a5c-895e-2135b7843301)

---------

Co-authored-by: Kyuho (Kyu) Oh <80994706+sixsage@users.noreply.github.com>
## Purpose
The issues that this PR addresses:
1. The docked buttons for left panel and result panel went on top of the
udf panel when they are overlapped
2. The udf panel was able to be dragged outside of the workflow editor
workspace, while other panels can't.
## Changes
Change the z-index of the udf code editor to a higher index
Change the drag boundary of the udf code editor to the workflow editor
workspace.
## Demo
Before

![udf-z-index-before](https://github.com/user-attachments/assets/56a03da3-e631-47ac-9256-35bde8928bb4)
After

![udf-z-index-after](https://github.com/user-attachments/assets/b3e68709-aa90-421f-b4f7-83fe8658c4aa)

Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
## Purpose
Fix issue where it was not possible to scroll in the hub workflow
details page
Fix issue where it was not possible to scroll down to the actual bottom
of the quota page
## Change
Remove the overflow: hidden in styles.scss
Fix the scrollbar in quota page to properly go down to the bottom of the
page
## Demo
Quota Page Before

![chrome_LDAX8mGEUr](https://github.com/user-attachments/assets/2996d0a9-881a-4e7a-86c2-591124326055)
Quota Page After

![chrome_eqRK2111SC](https://github.com/user-attachments/assets/15022bb3-ecdd-49de-ac91-35ed26dcd532)
Hub Details Page Before

![image](https://github.com/user-attachments/assets/664052f9-fd5c-4e23-b843-858cf40e201b)
Hub Details Page After

![chrome_SLHzScl21R](https://github.com/user-attachments/assets/77060755-ae11-4e0f-8d82-f81362f97f9e)

---------

Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
This PR enhances the user experience by enabling the debugger frontend
to automatically send a "continue" command to resume execution after
receiving most breakpoint events, except for breakpoint hit and
exception events.

If the debugger is already in a hit state, no "continue" command will be
sent, ensuring the debugger pauses appropriately for user inspection.
## Purpose
Remove scrollbar bug in the workflow workspace
## Changes
Remove display: block for left panel
Change spinner-container height from 100vh to 100%
## Demo
Before

![image](https://github.com/user-attachments/assets/a66c9b34-48fd-4f3b-8dc0-264c4c151bd8)
After

![image](https://github.com/user-attachments/assets/be39176b-2a64-40a5-baf7-6bef3d86ec64)
…urce operators. (#2975)

This PR refines the logic of showing selected dataset and version in
scan source operators, if a file path is being selected.

It makes it no longer rely on the backend's API, a cleaner
implementation. This is the base of the following refactor PR like #2972
…he file system (#2974)

This PR fixes the size retrieving when constructing the DatasetFileNode.
As `PhysicalFileNode` already has the file size, this file size will be
directly reused, instead of using file system's API to read the size.

The bug that, when deleting some files to create a new dataset version,
the old version's file tree cannot retrieved correctly because the code
tries to retrieve the file size using file system's API, but the file
does not physically exist anymore.

---------

Co-authored-by: Chris <143021053+kunwp1@users.noreply.github.com>
This PR refactors the API of downloading a version of a dataset. The
purpose of this refactoring is for future's refactor PRs, like #2972 .

### New API

GET `/version-zip`
- did: dataset's ID. must be provided, specify which dataset
- dvid: dataset version's ID, optional. If provided, retrieve this
version; otherwise, retrieve the latest version.
This PR introduces the abstraction of a read-only resource document, and
an implementation representing a read-only local file.
## Purpose
Add view count to workflows
## Changes
1. Create a new table, workflow view count, in texera ddl. A new file
17.sql was also added for the table.
2. The view counts for a workflow increment whenever a user accesses the
hub-workflow-detail page or the workflow workspace for that particular
workflow.
3. The API call to increment the view is limited to be called only once
per second using throttleTime
## Demo
List item before

![image](https://github.com/user-attachments/assets/ede5e665-e50c-496e-a239-f270017f7407)
List item after

![image](https://github.com/user-attachments/assets/0e9274a8-d9c9-4165-8c33-bab204e74602)
hub-workflow-detail page before

![image](https://github.com/user-attachments/assets/e6e33e33-724f-4e52-a1e5-34f45f23ade9)
hub-workflow-detail page after
Default display

![image](https://github.com/user-attachments/assets/5aab223e-f419-43bf-9cfc-db72681d67d1)
After clicking on button (precise display)

![image](https://github.com/user-attachments/assets/486569e2-d19e-4d43-a624-2a5001331e41)

---------

Co-authored-by: gspikehalo <2318002579@qq.com>
Co-authored-by: GspikeHalo <109092664+GspikeHalo@users.noreply.github.com>
Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
**Propose:**
Add a landing page for Texera as the first page users see.

**Changes:**
1. Move the original homepage to the About page.
2. Add a brand new landing page as the home page.
3. Remove the GitHub-related content from the top bar and display it
only on the About page.
4. The like button will always be visible on list items, instead of only
appearing on hover.

**Demo:**
landing page:
![new landing
page](https://github.com/user-attachments/assets/7e8b2c60-9b48-4660-8cec-f03177f5797d)

about page:
![new about
page](https://github.com/user-attachments/assets/23a41de5-ff32-493f-945f-ef4c9f877b66)

new like button without login:
![like button without
login](https://github.com/user-attachments/assets/528efdeb-6c28-49ba-852e-c7f90123d2d4)

new like button after login:
![new like button after
login](https://github.com/user-attachments/assets/40c3bc51-947b-4b01-a217-1f7dd4879e96)

---------

Co-authored-by: Kyuho (Kyu) Oh <80994706+sixsage@users.noreply.github.com>
**Propose:**
Addresses some visual issues caused by the current layout when the user
hovers over dashboard entries.
fix #2963

**Changes:**
1. When the user hovers over the item, add transparency to the
resource-info (including dataset size, creation time, and edit time) to
reduce its visibility.
2. Change the background of the button group to transparent, and add a
darker background color and border to the buttons.

**Demo:**
old list item:

![image](https://github.com/user-attachments/assets/65579f2b-5165-4068-864d-a119f6c0d211)

new list item:

![image](https://github.com/user-attachments/assets/7f5013f6-cec5-499f-a4a5-d3e524c827ad)
bobbai00 added a commit that referenced this pull request Mar 11, 2025
…orage (#3296)

This PR introduces the FileService as another microservice parallel to
WorkflowCompilingService, ComputingUnitMaster/Worker, and
TexeraWebApplication.

## Purpose of the FileService

- We want to improve the performance of our current Git-based Dataset
implementation;
- We decide to go with `LakeFS` + `S3`, LakeFS for the version control
metadata and S3 for data transfer; But `LakeFS` doesn't have access
control layer
- Therefore, we build the `FileService`, providing 
   - all the APIs related to versioned files in datasets
   - access control

## Architecture before and after adding FileService
Before:
<img width="767" alt="Screenshot 2025-03-02 at 9 01 20 AM"
src="https://github.com/user-attachments/assets/7d039f7a-49f2-4d48-9d15-bb9889a5c5ed"
/>

After
<img width="827" alt="Screenshot 2025-03-02 at 8 53 32 AM"
src="https://github.com/user-attachments/assets/28aa72b9-97b1-4789-b46f-2050e0dd8547"
/>


## Key Changes
- A new service, `FileService` is introduced. All the dataset-related
endpoints are hosted on `FileService`
- Several configuration items related to LakeFS and S3 are introduced in
the `storage-config.yaml`
- Frontend UI updates to incorporate with new changes
- For `ComputingUnitMaster` and `ComputingUnitWorker`, they will call
`FileService` to read files, during which their access will be verified.
So in the dynamic computing architecture (which will be introduced in
#3298), they will send requests along with current user's token. In
single-machine architecture, they are bypassing the network requests by
doing direct local function calls.
- Python UDF can now directly read dataset's file by the following
example code:
```python
file = DatasetFileDocument("The URL of the file")
bytes = file.read_file() # return an io.Bytes object
```
You may refer to
`core/amber/src/main/python/pytexera/storage/dataset_file_document.py`
for implementation details. **This feature is only available in the
dynamic computing architecture**.

## How to migrate the previous datasets to the new datasets managed by
the LakeFS

As we did quite some refactoring, two dataset implementations are NOT
compatible with each others. To migrate the previous datasets to the
latest implementation, you will need to re-upload the data via the new
UI.

## How to deploy new architecture
### Step1. Deploy LakeFS & Minio
#### Use Docker (Highly recommended for local development)
- Go to directory: `core/file-service/src/main/resources`
- Execute ` docker-compose --profile local-lakefs up -d` at its
directory

#### Use Binary (Recommended for production deployment)
Refer to https://docs.lakefs.io/howto/deploy/


### Step2. Configure the `storage-config.yaml`

Configure the below section in the `storage-config.yaml`:
```
  lakefs:
    endpoint: ""
    auth:
      api-secret: ""
      username: ""
      password: ""
    block-storage:
      type: ""
      bucket-name: ""

  s3:
    endpoint: ""
    auth:
      username: ""
      password: ""
```

Here is the configuration you can directly use if you are using the
`core/file-service/src/main/resources/docker-compose.yml` to install
LakeFS & Minio:
```
  lakefs:
    endpoint: "http://127.0.0.1:8000/api/v1"
    auth:
      api-secret: "random_string_for_lakefs"
      username: "AKIAIOSFOLKFSSAMPLES"
      password: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    block-storage:
      type: "s3"
      bucket-name: "texera-dataset"

  s3:
    endpoint: "http://localhost:9000"
    auth:
      username: "texera_minio"
      password: "password"
```

### Step3. Launch services

Launch `FileService` , in addition to `TexeraWebApplication`,
`WorkflowCompilingService` and `ComputingUnitMaster`.


## Future PRs after this one

- Remove the Dataset related endpoints completely from `amber` package.
- Incorporate the deployment of LakeFS+S3 in the Helm chart of K8s-based
deployment.
- Some optimizations:
- for small files, directly upload it instead of using multipart upload
- when doing result exports, use multipart upload as the result size can
be quite big.
   - support re-transmit for partially-uploaded files.
@shengquan-ni shengquan-ni marked this pull request as draft March 14, 2025 02:08
Ma77Ball pushed a commit that referenced this pull request Apr 2, 2025
…orage (#3296)

This PR introduces the FileService as another microservice parallel to
WorkflowCompilingService, ComputingUnitMaster/Worker, and
TexeraWebApplication.

## Purpose of the FileService

- We want to improve the performance of our current Git-based Dataset
implementation;
- We decide to go with `LakeFS` + `S3`, LakeFS for the version control
metadata and S3 for data transfer; But `LakeFS` doesn't have access
control layer
- Therefore, we build the `FileService`, providing 
   - all the APIs related to versioned files in datasets
   - access control

## Architecture before and after adding FileService
Before:
<img width="767" alt="Screenshot 2025-03-02 at 9 01 20 AM"
src="https://github.com/user-attachments/assets/7d039f7a-49f2-4d48-9d15-bb9889a5c5ed"
/>

After
<img width="827" alt="Screenshot 2025-03-02 at 8 53 32 AM"
src="https://github.com/user-attachments/assets/28aa72b9-97b1-4789-b46f-2050e0dd8547"
/>


## Key Changes
- A new service, `FileService` is introduced. All the dataset-related
endpoints are hosted on `FileService`
- Several configuration items related to LakeFS and S3 are introduced in
the `storage-config.yaml`
- Frontend UI updates to incorporate with new changes
- For `ComputingUnitMaster` and `ComputingUnitWorker`, they will call
`FileService` to read files, during which their access will be verified.
So in the dynamic computing architecture (which will be introduced in
#3298), they will send requests along with current user's token. In
single-machine architecture, they are bypassing the network requests by
doing direct local function calls.
- Python UDF can now directly read dataset's file by the following
example code:
```python
file = DatasetFileDocument("The URL of the file")
bytes = file.read_file() # return an io.Bytes object
```
You may refer to
`core/amber/src/main/python/pytexera/storage/dataset_file_document.py`
for implementation details. **This feature is only available in the
dynamic computing architecture**.

## How to migrate the previous datasets to the new datasets managed by
the LakeFS

As we did quite some refactoring, two dataset implementations are NOT
compatible with each others. To migrate the previous datasets to the
latest implementation, you will need to re-upload the data via the new
UI.

## How to deploy new architecture
### Step1. Deploy LakeFS & Minio
#### Use Docker (Highly recommended for local development)
- Go to directory: `core/file-service/src/main/resources`
- Execute ` docker-compose --profile local-lakefs up -d` at its
directory

#### Use Binary (Recommended for production deployment)
Refer to https://docs.lakefs.io/howto/deploy/


### Step2. Configure the `storage-config.yaml`

Configure the below section in the `storage-config.yaml`:
```
  lakefs:
    endpoint: ""
    auth:
      api-secret: ""
      username: ""
      password: ""
    block-storage:
      type: ""
      bucket-name: ""

  s3:
    endpoint: ""
    auth:
      username: ""
      password: ""
```

Here is the configuration you can directly use if you are using the
`core/file-service/src/main/resources/docker-compose.yml` to install
LakeFS & Minio:
```
  lakefs:
    endpoint: "http://127.0.0.1:8000/api/v1"
    auth:
      api-secret: "random_string_for_lakefs"
      username: "AKIAIOSFOLKFSSAMPLES"
      password: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    block-storage:
      type: "s3"
      bucket-name: "texera-dataset"

  s3:
    endpoint: "http://localhost:9000"
    auth:
      username: "texera_minio"
      password: "password"
```

### Step3. Launch services

Launch `FileService` , in addition to `TexeraWebApplication`,
`WorkflowCompilingService` and `ComputingUnitMaster`.


## Future PRs after this one

- Remove the Dataset related endpoints completely from `amber` package.
- Incorporate the deployment of LakeFS+S3 in the Helm chart of K8s-based
deployment.
- Some optimizations:
- for small files, directly upload it instead of using multipart upload
- when doing result exports, use multipart upload as the result size can
be quite big.
   - support re-transmit for partially-uploaded files.
@aglinxinyuan aglinxinyuan deleted the shengquan-add-cu-mgr branch September 6, 2025 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.