-
Notifications
You must be signed in to change notification settings - Fork 113
Add Dataset APIs to the webserver #2391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
aglinxinyuan
approved these changes
Feb 19, 2024
Contributor
aglinxinyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please remove your config file and fix backend test cases.
fa8692b to
eaddea7
Compare
eaddea7 to
915986e
Compare
Yicong-Huang
pushed a commit
that referenced
this pull request
Feb 23, 2024
This PR introduces the APIs to the webserver. The frontend can utilize
these APIs to manage and view dataset, the version of dataset, and file
under different versions.
## API overviews
The added APIs include:
### 1. Dataset and Version related APIs:
```
# create the dataset
POST /api/dataset/create
# delete datasets given a list of did
POST /api/dataset/delete
# update dataset's description
POST /api/dataset/update/description
# update dataset's name
POST /api/dataset/update/name
# get the dataset info by did
GET /api/dataset/{did}
# create a new version of a dataset
POST /api/dataset/{did}/version/create
# retrieve the latest version of a dataset
GET /api/dataset/{did}/version/latest
# list all version of a dataset
GET /api/dataset/{did}/version/list
# get the file content of a file of certain version of a dataset
GET /api/dataset/{did}/version/{dvid}/file
# get the root file nodes of a certaion version of a dataset
GET /api/dataset/{did}/version/{dvid}/rootFileNodes
```
### 2. Dataset Access Control
```
PUT /api/access/dataset/grant/{did}/{email}/{privilege}
GET /api/access/dataset/list/{did}
GET /api/access/dataset/owner/{did}
DELETE /api/access/dataset/revoke/{did}/{email}
```
## Dataset+Version Filesystem Path Design
Each dataset will be stored in directory
`user-resources/datasets/{did}`. Each directory is managed by
`GitVersionControlLocalFileStorage`. Each version is one commit of this
repo.
## DB Schema change
A minor update to the `Dataset` table should be made, by:
```
USE `texera_db`;
ALTER TABLE dataset
MODIFY COLUMN storage_path VARCHAR(512) NOT NULL DEFAULT '';
```
So that the `storage_path` of the dataset will be '' if not set during
insert. This default value exists because: the storage path is
identified by the `did`, but `did` is valid only when the record is
inserted into `Dataset` table.
## Some Details
### 1. Concurrency Control of Dataset Modification & Access
A static lock store is initialized in `object DatasetResource`, it maps
from `did` to `ReentrantLock`
```scala
val datasetLocks: scala.collection.concurrent.Map[UInteger, ReentrantLock] =
new scala.collection.concurrent.TrieMap[UInteger, ReentrantLock]()
```
When `createNewDatasetVersion` is called, before performing any
operation, it first acquire or create the lock for given `did`:
```scala
private def createNewDatasetVersion(
ctx: DSLContext,
did: UInteger,
uid: UInteger,
versionName: String,
multiPart: FormDataMultiPart
): Option[DashboardDatasetVersion] = {
// Acquire or Create the lock for dataset of {did}
val lock = DatasetResource.datasetLocks.getOrElseUpdate(did, new ReentrantLock())
if (lock.isLocked) {
return None
}
lock.lock()
```
And in `finally` block, the lock is released
```scala
finally {
// Release the lock
lock.unlock()
}
```
Based on the above mechanism, the below questions can be answered:
#### How to handle Alice create `v2` and Bob is reading `v1` file/file
tree of dataset `d1` simultaneously ?
Both requests will succeed. The read will NOT acquire any locks.
#### How to handle Alice create `v2` and Bob create `v2` for dataset
`d1` simultaneously?
One of them will fail. Because either Alice or Bob can get the Lock,
then the other one will fail.
#### How to handle Alice create `v2` for dataset `d1`, and Bob create
`v2` for dataset `d2`?
Both requests will succeed. The lock is at dataset level, meaning that
one lock per `did`.
### 2. Representation of the File tree using JSON
To represent the root file nodes in JSON in order to make frontend able
to parse and display the file tree, a `FileNodeSerializer` is added and
plugged into the bootstrap of dropwizard.
```java
public class FileNodeSerializer extends StdSerializer<FileNode> {
public FileNodeSerializer() {
this(null);
}
public FileNodeSerializer(Class<FileNode> t) {
super(t);
}
@OverRide
public void serialize(FileNode value, JsonGenerator gen, SerializerProvider provider) throws IOException {
gen.writeStartObject();
gen.writeStringField("path", value.getRelativePath().toString());
gen.writeBooleanField("isFile", value.isFile());
if (value.isDirectory()) {
gen.writeFieldName("children");
gen.writeStartArray();
for (FileNode child : value.getChildren()) {
serialize(child, gen, provider); // Recursively serialize children
}
gen.writeEndArray();
}
gen.writeEndObject();
}
}
```
```scala
// register a new custom module and add the custom serializer into it
val customSerializerModule = new SimpleModule("CustomSerializers")
customSerializerModule.addSerializer(classOf[FileNode], new FileNodeSerializer())
bootstrap.getObjectMapper.registerModule(customSerializerModule)
```
To give a example, the JSON format of a file tree:
```
a.csv
1.txt
dir
- dir/1.pdf
```
will be:
```json
{
"fileNodes": [
{
"path": "a.csv",
"isFile": true
},
{
"path": "1.txt",
"isFile": true
},
{
"path": "dir",
"isFile": false,
"children": [
{
"path": "dir/1.pdf",
"isFile": true
}
]
}
]
}
```
shengquan-ni
added a commit
that referenced
this pull request
Feb 23, 2024
**IMPORTANT NOTE:** _Due to the introduction of dataset in #2391, we need to add a new dataset search query builder to this PR. However, the dataset changes are not merged completely yet, we decide not to support dataset as a searchable resource. After the dataset changes are all merged, we need another PR to add the support._ **Background:** In Texera, we have 3 resource types: File, Workflow and Project. Each resource has its access control and schema. We want to be able to search all resources using a single search input box. Our approach is to create a unified schema by union all different schemas to create a big query. This approach also simplifies the handling of `offset` and `limit`. **Refactoring:** This PR breaks down the construction of full-text search query into several components for better maintainability: 1. `FulltextSearchQueryUtils` contains helper functions to formulate `where` conditions given query parameters. 2. `UnifiedResourceSchema` provides the unified schema, each resource maps its own schema to the unified schema. 3. `SearchQueryBuilder` provides a general framework to build a search query of one type of resource. We have `FileSearchQueryBuilder`, `ProjectSearchQueryBuilder` and `WorkflowSearchQueryBuilder` for the existing 3 resource types. 4. `searchAllResources` in `DashBoardResouce` is the endpoint of the full-text search. It unifies all the results and returns them to the front end. **Note:** To improve the quality of search results. I added a substring search(`LIKE`) condition to the query. If the performance is downgraded by this, we should remove it.
This was referenced Mar 6, 2024
Merged
chenlica
pushed a commit
that referenced
this pull request
Dec 28, 2025
<!-- Thanks for sending a pull request (PR)! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: [Contributing to Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md) 2. Ensure you have added or run the appropriate tests for your PR 3. If the PR is work in progress, mark it a draft on GitHub. 4. Please write your PR title to summarize what this PR proposes, we are following Conventional Commits style for PR titles as well. 5. Be sure to keep the PR description updated to reflect all changes. --> ### What changes were proposed in this PR? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes. Here are some tips for you: 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. 3. If it is a refactoring, clarify what has been changed. 3. It would be helpful to include a before-and-after comparison using screenshots or GIFs. 4. Please consider writing useful notes for better and faster reviews. --> This PR proposes to remove the unused `retrieveDatasetSingleFile()` endpoint (GET /api/dataset/file) which was allowing unauthenticated downloads of non-downloadable datasets. ### Any related issues, documentation, discussions? <!-- Please use this section to link other resources if not mentioned already. 1. If this PR fixes an issue, please include `Fixes #1234`, `Resolves #1234` or `Closes #1234`. If it is only related, simply mention the issue number. 2. If there is design documentation, please add the link. 3. If there is a discussion in the mailing list, please add the link. --> The endpoint is introduced in the PR #2391 which adds dataset APIs to the webserver. Then it is modified in the PR #2719 which aims to remove the concept of `Environment`. ### How was this PR tested? <!-- If tests were added, say they were added here. Or simply mention that if the PR is tested with existing test cases. Make sure to include/update test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Manually tested: <img width="690" height="404" alt="Screenshot 2025-12-27 at 1 15 21 AM" src="https://github.com/user-attachments/assets/91bea787-d447-4abe-ad39-74eb581fa657" /> ### Was this PR authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this PR, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> No.
carloea2
pushed a commit
to carloea2/texera
that referenced
this pull request
Jan 6, 2026
<!-- Thanks for sending a pull request (PR)! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: [Contributing to Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md) 2. Ensure you have added or run the appropriate tests for your PR 3. If the PR is work in progress, mark it a draft on GitHub. 4. Please write your PR title to summarize what this PR proposes, we are following Conventional Commits style for PR titles as well. 5. Be sure to keep the PR description updated to reflect all changes. --> ### What changes were proposed in this PR? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes. Here are some tips for you: 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. 3. If it is a refactoring, clarify what has been changed. 3. It would be helpful to include a before-and-after comparison using screenshots or GIFs. 4. Please consider writing useful notes for better and faster reviews. --> This PR proposes to remove the unused `retrieveDatasetSingleFile()` endpoint (GET /api/dataset/file) which was allowing unauthenticated downloads of non-downloadable datasets. ### Any related issues, documentation, discussions? <!-- Please use this section to link other resources if not mentioned already. 1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves apache#1234` or `Closes apache#1234`. If it is only related, simply mention the issue number. 2. If there is design documentation, please add the link. 3. If there is a discussion in the mailing list, please add the link. --> The endpoint is introduced in the PR apache#2391 which adds dataset APIs to the webserver. Then it is modified in the PR apache#2719 which aims to remove the concept of `Environment`. ### How was this PR tested? <!-- If tests were added, say they were added here. Or simply mention that if the PR is tested with existing test cases. Make sure to include/update test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Manually tested: <img width="690" height="404" alt="Screenshot 2025-12-27 at 1 15 21 AM" src="https://github.com/user-attachments/assets/91bea787-d447-4abe-ad39-74eb581fa657" /> ### Was this PR authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this PR, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> No.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces the APIs to the webserver. The frontend can utilize these APIs to manage and view dataset, the version of dataset, and file under different versions. This PR depends on the ddl and version control file service, see #2369 .
API overviews
The added APIs include:
1. Dataset and Version related APIs:
2. Dataset Access Control
Dataset+Version Filesystem Path Design
Each dataset will be stored in directory
user-resources/datasets/{did}. Each directory is managed byGitVersionControlLocalFileStorage. Each version is one commit of this repo.DB Schema change
A minor update to the
Datasettable should be made, by:So that the
storage_pathof the dataset will be '' if not set during insert. This default value exists because: the storage path is identified by thedid, butdidis valid only when the record is inserted intoDatasettable.Some Details
1. Concurrency Control of Dataset Modification & Access
A static lock store is initialized in
object DatasetResource, it maps fromdidtoReentrantLockWhen
createNewDatasetVersionis called, before performing any operation, it first acquire or create the lock for givendid:And in
finallyblock, the lock is releasedBased on the above mechanism, the below questions can be answered:
How to handle Alice create
v2and Bob is readingv1file/file tree of datasetd1simultaneously ?Both requests will succeed. The read will NOT acquire any locks.
How to handle Alice create
v2and Bob createv2for datasetd1simultaneously?One of them will fail. Because either Alice or Bob can get the Lock, then the other one will fail.
How to handle Alice create
v2for datasetd1, and Bob createv2for datasetd2?Both requests will succeed. The lock is at dataset level, meaning that one lock per
did.2. Representation of the File tree using JSON
To represent the root file nodes in JSON in order to make frontend able to parse and display the file tree, a
FileNodeSerializeris added and plugged into the bootstrap of dropwizard.To give a example, the JSON format of a file tree:
will be:
{ "fileNodes": [ { "path": "a.csv", "isFile": true }, { "path": "1.txt", "isFile": true }, { "path": "dir", "isFile": false, "children": [ { "path": "dir/1.pdf", "isFile": true } ] } ] }