Skip to content

Conversation

@bobbai00
Copy link
Contributor

@bobbai00 bobbai00 commented Feb 18, 2024

This PR introduces the APIs to the webserver. The frontend can utilize these APIs to manage and view dataset, the version of dataset, and file under different versions. This PR depends on the ddl and version control file service, see #2369 .

API overviews

The added APIs include:

1. Dataset and Version related APIs:

# create the dataset
POST    /api/dataset/create

# delete datasets given a list of did
POST    /api/dataset/delete

# update dataset's description
POST    /api/dataset/update/description

# update dataset's name
POST    /api/dataset/update/name

# get the dataset info by did
GET     /api/dataset/{did}

# create a new version of a dataset
POST    /api/dataset/{did}/version/create

# retrieve the latest version of a dataset
GET     /api/dataset/{did}/version/latest

# list all version of a dataset 
GET     /api/dataset/{did}/version/list 

# get the file content of a file of certain version of a dataset
GET     /api/dataset/{did}/version/{dvid}/file

# get the root file nodes of a certaion version of a dataset
GET     /api/dataset/{did}/version/{dvid}/rootFileNodes 

2. Dataset Access Control

PUT     /api/access/dataset/grant/{did}/{email}/{privilege} 
GET     /api/access/dataset/list/{did}
GET     /api/access/dataset/owner/{did}
DELETE  /api/access/dataset/revoke/{did}/{email}

Dataset+Version Filesystem Path Design

Each dataset will be stored in directory user-resources/datasets/{did}. Each directory is managed by GitVersionControlLocalFileStorage. Each version is one commit of this repo.

DB Schema change

A minor update to the Dataset table should be made, by:

USE `texera_db`;

ALTER TABLE dataset
MODIFY COLUMN storage_path VARCHAR(512) NOT NULL DEFAULT '';

So that the storage_path of the dataset will be '' if not set during insert. This default value exists because: the storage path is identified by the did, but did is valid only when the record is inserted into Dataset table.

Some Details

1. Concurrency Control of Dataset Modification & Access

A static lock store is initialized in object DatasetResource, it maps from did to ReentrantLock

  val datasetLocks: scala.collection.concurrent.Map[UInteger, ReentrantLock] =
    new scala.collection.concurrent.TrieMap[UInteger, ReentrantLock]()

When createNewDatasetVersion is called, before performing any operation, it first acquire or create the lock for given did:

  private def createNewDatasetVersion(
      ctx: DSLContext,
      did: UInteger,
      uid: UInteger,
      versionName: String,
      multiPart: FormDataMultiPart
  ): Option[DashboardDatasetVersion] = {

    // Acquire or Create the lock for dataset of {did}
    val lock = DatasetResource.datasetLocks.getOrElseUpdate(did, new ReentrantLock())

    if (lock.isLocked) {
      return None
    }
    lock.lock()

And in finally block, the lock is released

finally {
      // Release the lock
      lock.unlock()
    }

Based on the above mechanism, the below questions can be answered:

How to handle Alice create v2 and Bob is reading v1 file/file tree of dataset d1 simultaneously ?

Both requests will succeed. The read will NOT acquire any locks.

How to handle Alice create v2 and Bob create v2 for dataset d1 simultaneously?

One of them will fail. Because either Alice or Bob can get the Lock, then the other one will fail.

How to handle Alice create v2 for dataset d1, and Bob create v2 for dataset d2?

Both requests will succeed. The lock is at dataset level, meaning that one lock per did.

2. Representation of the File tree using JSON

To represent the root file nodes in JSON in order to make frontend able to parse and display the file tree, a FileNodeSerializer is added and plugged into the bootstrap of dropwizard.

public class FileNodeSerializer extends StdSerializer<FileNode> {

  public FileNodeSerializer() {
    this(null);
  }

  public FileNodeSerializer(Class<FileNode> t) {
    super(t);
  }

  @Override
  public void serialize(FileNode value, JsonGenerator gen, SerializerProvider provider) throws IOException {
    gen.writeStartObject();
    gen.writeStringField("path", value.getRelativePath().toString());
    gen.writeBooleanField("isFile", value.isFile());
    if (value.isDirectory()) {
      gen.writeFieldName("children");
      gen.writeStartArray();
      for (FileNode child : value.getChildren()) {
        serialize(child, gen, provider); // Recursively serialize children
      }
      gen.writeEndArray();
    }
    gen.writeEndObject();
  }
}
    // register a new custom module and add the custom serializer into it
    val customSerializerModule = new SimpleModule("CustomSerializers")
    customSerializerModule.addSerializer(classOf[FileNode], new FileNodeSerializer())
    bootstrap.getObjectMapper.registerModule(customSerializerModule)

To give a example, the JSON format of a file tree:

a.csv
1.txt
dir
- dir/1.pdf

will be:

{
    "fileNodes": [
        {
            "path": "a.csv",
            "isFile": true
        },
        {
            "path": "1.txt",
            "isFile": true
        },
        {
            "path": "dir",
            "isFile": false,
            "children": [
                {
                    "path": "dir/1.pdf",
                    "isFile": true
                }
            ]
        }
    ]
}

@bobbai00 bobbai00 marked this pull request as ready for review February 18, 2024 17:34
@bobbai00 bobbai00 self-assigned this Feb 18, 2024
aglinxinyuan

This comment was marked as resolved.

Copy link
Contributor

@aglinxinyuan aglinxinyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please remove your config file and fix backend test cases.

@bobbai00 bobbai00 force-pushed the jiadong-introduce-dataset-apis branch from fa8692b to eaddea7 Compare February 20, 2024 18:24
@bobbai00 bobbai00 force-pushed the jiadong-introduce-dataset-apis branch from eaddea7 to 915986e Compare February 20, 2024 19:47
@bobbai00 bobbai00 merged commit 0832f99 into master Feb 20, 2024
@bobbai00 bobbai00 deleted the jiadong-introduce-dataset-apis branch February 20, 2024 20:25
Yicong-Huang pushed a commit that referenced this pull request Feb 23, 2024
This PR introduces the APIs to the webserver. The frontend can utilize
these APIs to manage and view dataset, the version of dataset, and file
under different versions.

## API overviews
The added APIs include:
### 1. Dataset and Version related APIs:
```
# create the dataset
POST    /api/dataset/create

# delete datasets given a list of did
POST    /api/dataset/delete

# update dataset's description
POST    /api/dataset/update/description

# update dataset's name
POST    /api/dataset/update/name

# get the dataset info by did
GET     /api/dataset/{did}

# create a new version of a dataset
POST    /api/dataset/{did}/version/create

# retrieve the latest version of a dataset
GET     /api/dataset/{did}/version/latest

# list all version of a dataset 
GET     /api/dataset/{did}/version/list 

# get the file content of a file of certain version of a dataset
GET     /api/dataset/{did}/version/{dvid}/file

# get the root file nodes of a certaion version of a dataset
GET     /api/dataset/{did}/version/{dvid}/rootFileNodes 
```

### 2. Dataset Access Control
```
PUT     /api/access/dataset/grant/{did}/{email}/{privilege} 
GET     /api/access/dataset/list/{did}
GET     /api/access/dataset/owner/{did}
DELETE  /api/access/dataset/revoke/{did}/{email}
```

## Dataset+Version Filesystem Path Design

Each dataset will be stored in directory
`user-resources/datasets/{did}`. Each directory is managed by
`GitVersionControlLocalFileStorage`. Each version is one commit of this
repo.

## DB Schema change
A minor update to the `Dataset` table should be made, by:
```
USE `texera_db`;

ALTER TABLE dataset
MODIFY COLUMN storage_path VARCHAR(512) NOT NULL DEFAULT '';
```
So that the `storage_path` of the dataset will be '' if not set during
insert. This default value exists because: the storage path is
identified by the `did`, but `did` is valid only when the record is
inserted into `Dataset` table.

## Some Details
### 1. Concurrency Control of Dataset Modification & Access
A static lock store is initialized in `object DatasetResource`, it maps
from `did` to `ReentrantLock`
```scala
  val datasetLocks: scala.collection.concurrent.Map[UInteger, ReentrantLock] =
    new scala.collection.concurrent.TrieMap[UInteger, ReentrantLock]()
```

When `createNewDatasetVersion` is called, before performing any
operation, it first acquire or create the lock for given `did`:
```scala
  private def createNewDatasetVersion(
      ctx: DSLContext,
      did: UInteger,
      uid: UInteger,
      versionName: String,
      multiPart: FormDataMultiPart
  ): Option[DashboardDatasetVersion] = {

    // Acquire or Create the lock for dataset of {did}
    val lock = DatasetResource.datasetLocks.getOrElseUpdate(did, new ReentrantLock())

    if (lock.isLocked) {
      return None
    }
    lock.lock()
```

And in `finally` block, the lock is released
```scala
finally {
      // Release the lock
      lock.unlock()
    }
```
Based on the above mechanism, the below questions can be answered:

#### How to handle Alice create `v2` and Bob is reading `v1` file/file
tree of dataset `d1` simultaneously ?
Both requests will succeed. The read will NOT acquire any locks.

#### How to handle Alice create `v2` and Bob create `v2` for dataset
`d1` simultaneously?
One of them will fail. Because either Alice or Bob can get the Lock,
then the other one will fail.

#### How to handle Alice create `v2` for dataset `d1`, and Bob create
`v2` for dataset `d2`?
Both requests will succeed. The lock is at dataset level, meaning that
one lock per `did`.


### 2. Representation of the File tree using JSON
To represent the root file nodes in JSON in order to make frontend able
to parse and display the file tree, a `FileNodeSerializer` is added and
plugged into the bootstrap of dropwizard.
```java
public class FileNodeSerializer extends StdSerializer<FileNode> {

  public FileNodeSerializer() {
    this(null);
  }

  public FileNodeSerializer(Class<FileNode> t) {
    super(t);
  }

  @OverRide
  public void serialize(FileNode value, JsonGenerator gen, SerializerProvider provider) throws IOException {
    gen.writeStartObject();
    gen.writeStringField("path", value.getRelativePath().toString());
    gen.writeBooleanField("isFile", value.isFile());
    if (value.isDirectory()) {
      gen.writeFieldName("children");
      gen.writeStartArray();
      for (FileNode child : value.getChildren()) {
        serialize(child, gen, provider); // Recursively serialize children
      }
      gen.writeEndArray();
    }
    gen.writeEndObject();
  }
}
```

```scala
    // register a new custom module and add the custom serializer into it
    val customSerializerModule = new SimpleModule("CustomSerializers")
    customSerializerModule.addSerializer(classOf[FileNode], new FileNodeSerializer())
    bootstrap.getObjectMapper.registerModule(customSerializerModule)
```

To give a example, the JSON format of a file tree:
```
a.csv
1.txt
dir
- dir/1.pdf
```
will be:
```json
{
    "fileNodes": [
        {
            "path": "a.csv",
            "isFile": true
        },
        {
            "path": "1.txt",
            "isFile": true
        },
        {
            "path": "dir",
            "isFile": false,
            "children": [
                {
                    "path": "dir/1.pdf",
                    "isFile": true
                }
            ]
        }
    ]
}
```
shengquan-ni added a commit that referenced this pull request Feb 23, 2024
**IMPORTANT NOTE:** _Due to the introduction of dataset in #2391, we
need to add a new dataset search query builder to this PR. However, the
dataset changes are not merged completely yet, we decide not to support
dataset as a searchable resource. After the dataset changes are all
merged, we need another PR to add the support._

**Background:** 
In Texera, we have 3 resource types: File, Workflow and Project. Each
resource has its access control and schema. We want to be able to search
all resources using a single search input box. Our approach is to create
a unified schema by union all different schemas to create a big query.
This approach also simplifies the handling of `offset` and `limit`.

**Refactoring:**
This PR breaks down the construction of full-text search query into
several components for better maintainability:
1. `FulltextSearchQueryUtils` contains helper functions to formulate
`where` conditions given query parameters.
2. `UnifiedResourceSchema` provides the unified schema, each resource
maps its own schema to the unified schema.
3. `SearchQueryBuilder` provides a general framework to build a search
query of one type of resource. We have `FileSearchQueryBuilder`,
`ProjectSearchQueryBuilder` and `WorkflowSearchQueryBuilder` for the
existing 3 resource types.
4. `searchAllResources` in `DashBoardResouce` is the endpoint of the
full-text search. It unifies all the results and returns them to the
front end.

**Note:** 
To improve the quality of search results. I added a substring
search(`LIKE`) condition to the query. If the performance is downgraded
by this, we should remove it.
@bobbai00 bobbai00 added the ddl-change Changes to the TexeraDB DDL label Mar 6, 2024
This was referenced Mar 6, 2024
chenlica pushed a commit that referenced this pull request Dec 28, 2025
<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR proposes to remove the unused `retrieveDatasetSingleFile()`
endpoint (GET /api/dataset/file) which was allowing unauthenticated
downloads of non-downloadable datasets.

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes #1234`, `Resolves
#1234`
or `Closes #1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->

The endpoint is introduced in the PR #2391 which adds dataset APIs to
the webserver. Then it is modified in the PR #2719 which aims to remove
the concept of `Environment`.

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
Manually tested:
<img width="690" height="404" alt="Screenshot 2025-12-27 at 1 15 21 AM"
src="https://github.com/user-attachments/assets/91bea787-d447-4abe-ad39-74eb581fa657"
/>

### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No.
carloea2 pushed a commit to carloea2/texera that referenced this pull request Jan 6, 2026
<!--
Thanks for sending a pull request (PR)! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
[Contributing to
Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md)
  2. Ensure you have added or run the appropriate tests for your PR
  3. If the PR is work in progress, mark it a draft on GitHub.
  4. Please write your PR title to summarize what this PR proposes, we 
    are following Conventional Commits style for PR titles as well.
  5. Be sure to keep the PR description updated to reflect all changes.
-->

### What changes were proposed in this PR?
<!--
Please clarify what changes you are proposing. The purpose of this
section
is to outline the changes. Here are some tips for you:
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
  3. If it is a refactoring, clarify what has been changed.
  3. It would be helpful to include a before-and-after comparison using 
     screenshots or GIFs.
  4. Please consider writing useful notes for better and faster reviews.
-->
This PR proposes to remove the unused `retrieveDatasetSingleFile()`
endpoint (GET /api/dataset/file) which was allowing unauthenticated
downloads of non-downloadable datasets.

### Any related issues, documentation, discussions?
<!--
Please use this section to link other resources if not mentioned
already.
1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves
apache#1234`
or `Closes apache#1234`. If it is only related, simply mention the issue
number.
  2. If there is design documentation, please add the link.
  3. If there is a discussion in the mailing list, please add the link.
-->

The endpoint is introduced in the PR apache#2391 which adds dataset APIs to
the webserver. Then it is modified in the PR apache#2719 which aims to remove
the concept of `Environment`.

### How was this PR tested?
<!--
If tests were added, say they were added here. Or simply mention that if
the PR
is tested with existing test cases. Make sure to include/update test
cases that
check the changes thoroughly including negative and positive cases if
possible.
If it was tested in a way different from regular unit tests, please
clarify how
you tested step by step, ideally copy and paste-able, so that other
reviewers can
test and check, and descendants can verify in the future. If tests were
not added,
please describe why they were not added and/or why it was difficult to
add.
-->
Manually tested:
<img width="690" height="404" alt="Screenshot 2025-12-27 at 1 15 21 AM"
src="https://github.com/user-attachments/assets/91bea787-d447-4abe-ad39-74eb581fa657"
/>

### Was this PR authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
PR,
please include the phrase: 'Generated-by: ' followed by the name of the
tool
and its version. If no, write 'No'. 
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for
details.
-->
No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddl-change Changes to the TexeraDB DDL webserver

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants