Add Dataset APIs to the webserver #2391

bobbai00 · 2024-02-18T07:55:19Z

This PR introduces the APIs to the webserver. The frontend can utilize these APIs to manage and view dataset, the version of dataset, and file under different versions. This PR depends on the ddl and version control file service, see #2369 .

API overviews

The added APIs include:

1. Dataset and Version related APIs:

# create the dataset
POST    /api/dataset/create

# delete datasets given a list of did
POST    /api/dataset/delete

# update dataset's description
POST    /api/dataset/update/description

# update dataset's name
POST    /api/dataset/update/name

# get the dataset info by did
GET     /api/dataset/{did}

# create a new version of a dataset
POST    /api/dataset/{did}/version/create

# retrieve the latest version of a dataset
GET     /api/dataset/{did}/version/latest

# list all version of a dataset 
GET     /api/dataset/{did}/version/list 

# get the file content of a file of certain version of a dataset
GET     /api/dataset/{did}/version/{dvid}/file

# get the root file nodes of a certaion version of a dataset
GET     /api/dataset/{did}/version/{dvid}/rootFileNodes

2. Dataset Access Control

PUT     /api/access/dataset/grant/{did}/{email}/{privilege} 
GET     /api/access/dataset/list/{did}
GET     /api/access/dataset/owner/{did}
DELETE  /api/access/dataset/revoke/{did}/{email}

Dataset+Version Filesystem Path Design

Each dataset will be stored in directory user-resources/datasets/{did}. Each directory is managed by GitVersionControlLocalFileStorage. Each version is one commit of this repo.

DB Schema change

A minor update to the Dataset table should be made, by:

USE `texera_db`;

ALTER TABLE dataset
MODIFY COLUMN storage_path VARCHAR(512) NOT NULL DEFAULT '';

So that the storage_path of the dataset will be '' if not set during insert. This default value exists because: the storage path is identified by the did, but did is valid only when the record is inserted into Dataset table.

Some Details

1. Concurrency Control of Dataset Modification & Access

A static lock store is initialized in object DatasetResource, it maps from did to ReentrantLock

  val datasetLocks: scala.collection.concurrent.Map[UInteger, ReentrantLock] =
    new scala.collection.concurrent.TrieMap[UInteger, ReentrantLock]()

When createNewDatasetVersion is called, before performing any operation, it first acquire or create the lock for given did:

  private def createNewDatasetVersion(
      ctx: DSLContext,
      did: UInteger,
      uid: UInteger,
      versionName: String,
      multiPart: FormDataMultiPart
  ): Option[DashboardDatasetVersion] = {

    // Acquire or Create the lock for dataset of {did}
    val lock = DatasetResource.datasetLocks.getOrElseUpdate(did, new ReentrantLock())

    if (lock.isLocked) {
      return None
    }
    lock.lock()

And in finally block, the lock is released

finally {
      // Release the lock
      lock.unlock()
    }

Based on the above mechanism, the below questions can be answered:

How to handle Alice create `v2` and Bob is reading `v1` file/file tree of dataset `d1` simultaneously ?

Both requests will succeed. The read will NOT acquire any locks.

How to handle Alice create `v2` and Bob create `v2` for dataset `d1` simultaneously?

One of them will fail. Because either Alice or Bob can get the Lock, then the other one will fail.

How to handle Alice create `v2` for dataset `d1`, and Bob create `v2` for dataset `d2`?

Both requests will succeed. The lock is at dataset level, meaning that one lock per did.

2. Representation of the File tree using JSON

To represent the root file nodes in JSON in order to make frontend able to parse and display the file tree, a FileNodeSerializer is added and plugged into the bootstrap of dropwizard.

public class FileNodeSerializer extends StdSerializer<FileNode> {

  public FileNodeSerializer() {
    this(null);
  }

  public FileNodeSerializer(Class<FileNode> t) {
    super(t);
  }

  @Override
  public void serialize(FileNode value, JsonGenerator gen, SerializerProvider provider) throws IOException {
    gen.writeStartObject();
    gen.writeStringField("path", value.getRelativePath().toString());
    gen.writeBooleanField("isFile", value.isFile());
    if (value.isDirectory()) {
      gen.writeFieldName("children");
      gen.writeStartArray();
      for (FileNode child : value.getChildren()) {
        serialize(child, gen, provider); // Recursively serialize children
      }
      gen.writeEndArray();
    }
    gen.writeEndObject();
  }
}

    // register a new custom module and add the custom serializer into it
    val customSerializerModule = new SimpleModule("CustomSerializers")
    customSerializerModule.addSerializer(classOf[FileNode], new FileNodeSerializer())
    bootstrap.getObjectMapper.registerModule(customSerializerModule)

To give a example, the JSON format of a file tree:

a.csv
1.txt
dir
- dir/1.pdf

will be:

{
    "fileNodes": [
        {
            "path": "a.csv",
            "isFile": true
        },
        {
            "path": "1.txt",
            "isFile": true
        },
        {
            "path": "dir",
            "isFile": false,
            "children": [
                {
                    "path": "dir/1.pdf",
                    "isFile": true
                }
            ]
        }
    ]
}

aglinxinyuan

LGTM! Please remove your config file and fix backend test cases.

@OverRide

This PR introduces the APIs to the webserver. The frontend can utilize these APIs to manage and view dataset, the version of dataset, and file under different versions. ## API overviews The added APIs include: ### 1. Dataset and Version related APIs: ``` # create the dataset POST /api/dataset/create # delete datasets given a list of did POST /api/dataset/delete # update dataset's description POST /api/dataset/update/description # update dataset's name POST /api/dataset/update/name # get the dataset info by did GET /api/dataset/{did} # create a new version of a dataset POST /api/dataset/{did}/version/create # retrieve the latest version of a dataset GET /api/dataset/{did}/version/latest # list all version of a dataset GET /api/dataset/{did}/version/list # get the file content of a file of certain version of a dataset GET /api/dataset/{did}/version/{dvid}/file # get the root file nodes of a certaion version of a dataset GET /api/dataset/{did}/version/{dvid}/rootFileNodes ``` ### 2. Dataset Access Control ``` PUT /api/access/dataset/grant/{did}/{email}/{privilege} GET /api/access/dataset/list/{did} GET /api/access/dataset/owner/{did} DELETE /api/access/dataset/revoke/{did}/{email} ``` ## Dataset+Version Filesystem Path Design Each dataset will be stored in directory `user-resources/datasets/{did}`. Each directory is managed by `GitVersionControlLocalFileStorage`. Each version is one commit of this repo. ## DB Schema change A minor update to the `Dataset` table should be made, by: ``` USE `texera_db`; ALTER TABLE dataset MODIFY COLUMN storage_path VARCHAR(512) NOT NULL DEFAULT ''; ``` So that the `storage_path` of the dataset will be '' if not set during insert. This default value exists because: the storage path is identified by the `did`, but `did` is valid only when the record is inserted into `Dataset` table. ## Some Details ### 1. Concurrency Control of Dataset Modification & Access A static lock store is initialized in `object DatasetResource`, it maps from `did` to `ReentrantLock` ```scala val datasetLocks: scala.collection.concurrent.Map[UInteger, ReentrantLock] = new scala.collection.concurrent.TrieMap[UInteger, ReentrantLock]() ``` When `createNewDatasetVersion` is called, before performing any operation, it first acquire or create the lock for given `did`: ```scala private def createNewDatasetVersion( ctx: DSLContext, did: UInteger, uid: UInteger, versionName: String, multiPart: FormDataMultiPart ): Option[DashboardDatasetVersion] = { // Acquire or Create the lock for dataset of {did} val lock = DatasetResource.datasetLocks.getOrElseUpdate(did, new ReentrantLock()) if (lock.isLocked) { return None } lock.lock() ``` And in `finally` block, the lock is released ```scala finally { // Release the lock lock.unlock() } ``` Based on the above mechanism, the below questions can be answered: #### How to handle Alice create `v2` and Bob is reading `v1` file/file tree of dataset `d1` simultaneously ? Both requests will succeed. The read will NOT acquire any locks. #### How to handle Alice create `v2` and Bob create `v2` for dataset `d1` simultaneously? One of them will fail. Because either Alice or Bob can get the Lock, then the other one will fail. #### How to handle Alice create `v2` for dataset `d1`, and Bob create `v2` for dataset `d2`? Both requests will succeed. The lock is at dataset level, meaning that one lock per `did`. ### 2. Representation of the File tree using JSON To represent the root file nodes in JSON in order to make frontend able to parse and display the file tree, a `FileNodeSerializer` is added and plugged into the bootstrap of dropwizard. ```java public class FileNodeSerializer extends StdSerializer<FileNode> { public FileNodeSerializer() { this(null); } public FileNodeSerializer(Class<FileNode> t) { super(t); } @OverRide public void serialize(FileNode value, JsonGenerator gen, SerializerProvider provider) throws IOException { gen.writeStartObject(); gen.writeStringField("path", value.getRelativePath().toString()); gen.writeBooleanField("isFile", value.isFile()); if (value.isDirectory()) { gen.writeFieldName("children"); gen.writeStartArray(); for (FileNode child : value.getChildren()) { serialize(child, gen, provider); // Recursively serialize children } gen.writeEndArray(); } gen.writeEndObject(); } } ``` ```scala // register a new custom module and add the custom serializer into it val customSerializerModule = new SimpleModule("CustomSerializers") customSerializerModule.addSerializer(classOf[FileNode], new FileNodeSerializer()) bootstrap.getObjectMapper.registerModule(customSerializerModule) ``` To give a example, the JSON format of a file tree: ``` a.csv 1.txt dir - dir/1.pdf ``` will be: ```json { "fileNodes": [ { "path": "a.csv", "isFile": true }, { "path": "1.txt", "isFile": true }, { "path": "dir", "isFile": false, "children": [ { "path": "dir/1.pdf", "isFile": true } ] } ] } ```

**IMPORTANT NOTE:** _Due to the introduction of dataset in #2391, we need to add a new dataset search query builder to this PR. However, the dataset changes are not merged completely yet, we decide not to support dataset as a searchable resource. After the dataset changes are all merged, we need another PR to add the support._ **Background:** In Texera, we have 3 resource types: File, Workflow and Project. Each resource has its access control and schema. We want to be able to search all resources using a single search input box. Our approach is to create a unified schema by union all different schemas to create a big query. This approach also simplifies the handling of `offset` and `limit`. **Refactoring:** This PR breaks down the construction of full-text search query into several components for better maintainability: 1. `FulltextSearchQueryUtils` contains helper functions to formulate `where` conditions given query parameters. 2. `UnifiedResourceSchema` provides the unified schema, each resource maps its own schema to the unified schema. 3. `SearchQueryBuilder` provides a general framework to build a search query of one type of resource. We have `FileSearchQueryBuilder`, `ProjectSearchQueryBuilder` and `WorkflowSearchQueryBuilder` for the existing 3 resource types. 4. `searchAllResources` in `DashBoardResouce` is the endpoint of the full-text search. It unifies all the results and returns them to the front end. **Note:** To improve the quality of search results. I added a substring search(`LIKE`) condition to the query. If the performance is downgraded by this, we should remove it.

### What changes were proposed in this PR?  This PR proposes to remove the unused `retrieveDatasetSingleFile()` endpoint (GET /api/dataset/file) which was allowing unauthenticated downloads of non-downloadable datasets. ### Any related issues, documentation, discussions?  The endpoint is introduced in the PR #2391 which adds dataset APIs to the webserver. Then it is modified in the PR #2719 which aims to remove the concept of `Environment`. ### How was this PR tested?  Manually tested: <img width="690" height="404" alt="Screenshot 2025-12-27 at 1 15 21 AM" src="https://github.com/user-attachments/assets/91bea787-d447-4abe-ad39-74eb581fa657" /> ### Was this PR authored or co-authored using generative AI tooling?  No.

### What changes were proposed in this PR?  This PR proposes to remove the unused `retrieveDatasetSingleFile()` endpoint (GET /api/dataset/file) which was allowing unauthenticated downloads of non-downloadable datasets. ### Any related issues, documentation, discussions?  The endpoint is introduced in the PR apache#2391 which adds dataset APIs to the webserver. Then it is modified in the PR apache#2719 which aims to remove the concept of `Environment`. ### How was this PR tested?  Manually tested: <img width="690" height="404" alt="Screenshot 2025-12-27 at 1 15 21 AM" src="https://github.com/user-attachments/assets/91bea787-d447-4abe-ad39-74eb581fa657" /> ### Was this PR authored or co-authored using generative AI tooling?  No.

bobbai00 marked this pull request as ready for review February 18, 2024 17:34

bobbai00 requested review from Yicong-Huang and aglinxinyuan February 18, 2024 17:34

bobbai00 self-assigned this Feb 18, 2024

bobbai00 added the webserver label Feb 18, 2024

This comment was marked as resolved.

Sign in to view

aglinxinyuan approved these changes Feb 19, 2024

View reviewed changes

bobbai00 force-pushed the jiadong-introduce-dataset-apis branch from fa8692b to eaddea7 Compare February 20, 2024 18:24

bobbai00 added 9 commits February 20, 2024 11:47

fix with new git storage

8d9d7a4

add unit tests

a9e02ee

add db changes

9781359

add more comments

2eb3d6d

add cleanup

7127128

add dashboard interfaces

f770f21

revert conf changes

60f4bee

remove unused imports

7106b6d

format codes

915986e

bobbai00 force-pushed the jiadong-introduce-dataset-apis branch from eaddea7 to 915986e Compare February 20, 2024 19:47

bobbai00 merged commit 0832f99 into master Feb 20, 2024

bobbai00 deleted the jiadong-introduce-dataset-apis branch February 20, 2024 20:25

shengquan-ni mentioned this pull request Feb 21, 2024

Refactor fulltext search #2358

Merged

bobbai00 added the ddl-change Changes to the TexeraDB DDL label Mar 6, 2024

This was referenced Mar 6, 2024

Add Environment WebServer APIs #2434

Merged

Introduce Dataset GUI #2413

Merged

seongjinyoon mentioned this pull request Dec 27, 2025

fix: remove /api/dataset/file endpoint #4137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Dataset APIs to the webserver #2391

Add Dataset APIs to the webserver #2391

Uh oh!

bobbai00 commented Feb 18, 2024 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

aglinxinyuan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Dataset APIs to the webserver #2391

Add Dataset APIs to the webserver #2391

Uh oh!

Conversation

bobbai00 commented Feb 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API overviews

1. Dataset and Version related APIs:

2. Dataset Access Control

Dataset+Version Filesystem Path Design

DB Schema change

Some Details

1. Concurrency Control of Dataset Modification & Access

How to handle Alice create v2 and Bob is reading v1 file/file tree of dataset d1 simultaneously ?

How to handle Alice create v2 and Bob create v2 for dataset d1 simultaneously?

How to handle Alice create v2 for dataset d1, and Bob create v2 for dataset d2?

2. Representation of the File tree using JSON

Uh oh!

This comment was marked as resolved.

Uh oh!

aglinxinyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bobbai00 commented Feb 18, 2024 •

edited

Loading

How to handle Alice create `v2` and Bob is reading `v1` file/file tree of dataset `d1` simultaneously ?

How to handle Alice create `v2` and Bob create `v2` for dataset `d1` simultaneously?

How to handle Alice create `v2` for dataset `d1`, and Bob create `v2` for dataset `d2`?