Add Environment WebServer APIs #2434

bobbai00 · 2024-02-29T00:31:17Z

This PR introduces the APIs of Environment to the Web server. It depends on the dataset feature, #2413 and #2391 .

Designs

Each workflow will have a unique environment when the workflow is being created and persisted.
Environment currently store the datasets and the versions visible to the workflow.
When a workflow is executed, the environment id will be recorded as the part of workflow_execution record.
When using the source scan operator, workflow can ONLY scan the files in the datasets dictated by its environment.

DB Schema Changes

Three new Tables are added:

CREATE TABLE IF NOT EXISTS environment
(
    `eid`              INT UNSIGNED AUTO_INCREMENT NOT NULL,
    `owner_uid`        INT UNSIGNED NOT NULL,
    `name`			   VARCHAR(128) NOT NULL DEFAULT 'Untitled Environment',
    `description`      VARCHAR(1000),
    `creation_time`    TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (`eid`),
    FOREIGN KEY (`owner_uid`) REFERENCES `user` (`uid`) ON DELETE CASCADE
) ENGINE = INNODB;

CREATE TABLE IF NOT EXISTS environment_of_workflow
(
    `eid`              INT UNSIGNED NOT NULL,
    `wid`              INT UNSIGNED NOT NULL,
    PRIMARY KEY (`eid`, `wid`),
    FOREIGN KEY (`wid`) REFERENCES `workflow` (`wid`) ON DELETE CASCADE,
    FOREIGN KEY (`eid`) REFERENCES `environment` (`eid`) ON DELETE CASCADE
) ENGINE = INNODB;

CREATE TABLE IF NOT EXISTS dataset_of_environment
(
    `did`                   INT UNSIGNED NOT NULL,
    `eid`                   INT UNSIGNED NOT NULL,
    `dvid`                  INT UNSIGNED NOT NULL,
    PRIMARY KEY (`did`, `eid`),
    FOREIGN KEY (`eid`) REFERENCES `environment` (`eid`) ON DELETE CASCADE,
    FOREIGN KEY (`dvid`) REFERENCES `dataset_version` (`dvid`) ON DELETE CASCADE
) ENGINE = INNODB;

environment is the table for storing the environment info.
environment_of_workflow maintains the which environment the workflow is within. CURRENTLY, WORKFLOW is 1-to-1 correspondence to environment.
dataset_of_environment records which dataset(s) are visible to the workflow that is using this environment.

New Column is added to the workflow_executions:

-- Add the `environment_eid` column to the `workflow_executions` table
ALTER TABLE workflow_executions
ADD COLUMN `environment_eid` INT UNSIGNED;

-- Add the foreign key constraint for `environment_eid`
ALTER TABLE workflow_executions
ADD CONSTRAINT fk_environment_eid
FOREIGN KEY (`environment_eid`) REFERENCES environment(`eid`) ON DELETE SET NULL;

New column environment_eid is used to record which the environment is used for that workflow execution.

New APIs

Several APIs related to environment is added.

    POST    /api/environment/create    // create the environment
    POST    /api/environment/delete    // delete the environment
    GET     /api/environment/{eid}        // get the environment info by eid
    GET     /api/environment/{eid}/dataset/list // list all datasets(the datasetID, datasetVersionID) of the environment
    POST    /api/environment/{eid}/dataset/add // add dataset to the environment
    GET     /api/environment/{eid}/dataset/list/details   // list the details info of the datasets in the environment
    POST    /api/environment/{eid}/dataset/remove  // remove dataset from the environment
    GET     /api/environment/{eid}/files/{query:.*}  // for file auto complete, retrieve the file path matching the query

Existing API Updates

I changed the implementation of persistWorkflow in WorkflowResource. Specifically, a environment will be created if the workflow has no corresponding environment when persisting, the code snippet is:

    val wid = workflow.getWid
    // check if the runtime environment of this workflow exists, if not, create one
    if (!doesWorkflowHaveEnvironment(context, wid)) {
      // create an environment, and associate this environment to this workflow
      val createdEnvironment = createEnvironment(
        context,
        uid,
        "Environment of Workflow #%d %s".format(wid.intValue(), workflow.getName),
        "Runtime Environment of Workflow #%d %s".format(wid.intValue(), workflow.getName)
      )

      environmentOfWorkflowDao.insert(new EnvironmentOfWorkflow(createdEnvironment.getEid, wid))
    }

aglinxinyuan

LGTM!

This PR introduces the GUI of `environment` and some fixes to previous `dataset` features. For the backend of `environment`, see #2434 ### Features - Environment Tab on the left panel ![2024-03-04 23 26 19](https://github.com/Texera/texera/assets/43344272/09738edd-aa91-4a1f-a915-4d41f25afc9b) - Auto Complete only from the files in datasets ![2024-03-04 23 27 32](https://github.com/Texera/texera/assets/43344272/36178dea-24fd-41f3-92e2-800c10f21ae5) ### Implementation Details 1. The changes on the `ScanSourceOperatorDesc` Previously, the source file is located by its absolute path and scanned into the workflow. Now, since all the files are within the dataset and managed by JGit, its physical file may not be directly available, current solution is to write the target file into a temporary file, which is identified by an absolute path generated by JVM. The file will be deleted when JVM quits. 2. When workflow execute request is submitted, the webserver will also persist the environment eid to the `WorkflowExecutions` table.

This PR introduces the GUI of environment and some fixes to previous dataset features. For the backend of environment, see #2434 After introducing the environment, the way of uploading data and scanning data using workflow is presented in this [blog](https://github.com/Texera/texera/wiki/Create-Dataset,-upload-data-to-it-and-use-it-in-Workflow). For more specific information, there is a [demo video](https://www.youtube.com/watch?app=desktop&v=EJ269aWnHv4&ab_channel=TexeraProject). ## Features - View the Environment information at the workspace ![2024-03-21 23 02 09](https://github.com/Texera/texera/assets/43344272/23d76935-bcf1-4879-a06e-628f556609cf) - Add dataset to the current environment ![2024-03-21 23 03 14](https://github.com/Texera/texera/assets/43344272/4277e236-6d8b-4127-9be5-cb0be4965e58) - Preview Data File in Dataset of environment ![2024-03-21 23 04 08](https://github.com/Texera/texera/assets/43344272/c5db33ed-9bef-46f4-bd63-3b17dd7cde9d) - Scan Files that are in the datasets ![2024-03-21 23 05 53](https://github.com/Texera/texera/assets/43344272/b7ceccbd-6487-4731-bec3-3d254a73dcb3) ## Implementation Details ### The changes on the ScanSourceOperatorDesc Previously, the source file is located by its absolute path and scanned into the workflow. Now, since all the files are within the dataset and managed by JGit, its physical file may not be directly available. Therefore, couple of changes are made regarding the way that source operator scans the file. 1. In the source operator descriptor: ScanSourceOpDesc A new member variable is added: ```scala @JsonIgnore var filePath: Option[String] = None // new @JsonIgnore var datasetFileDesc: Option[DatasetFileDesc] = None ``` class `DatasetFileDesc` contains the softlink to the file in the dataset, and has utilities to read the file as stream/tempraory file. `datasetFileDesc` will be initialized when `setContext` is called: ```scala if (getContext.userId.isDefined) { val environmentEid = WorkflowResource.getEnvironmentEidOfWorkflow( UInteger.valueOf(workflowContext.workflowId.id) ) // if user system is defined, a datasetFileDesc will be initialized, which is the handle of reading file from the dataset datasetFileDesc = Some( getEnvironmentDatasetFilePathAndVersion(getContext.userId.get, environmentEid, fileName.get) ) } ``` 2. For each source operator executor, i.e. CSVScanSourceExec A new parameter is added in the constructor: ```scala class CSVScanSourceOpExec private[csv] ( filePath: String, datasetFileDesc: DatasetFileDesc, ``` If `datasetFileDesc` is set non-null(i.e. user system is enabled), when creating the input stream reader, the stream will be created using `datasetFileDesc.fileInputStream`: ```scala // this function create the input stream accordingly: // - if filePath is set, create the stream from the file // - if fileDesc is set, create the stream via JGit call def createInputStream(filePath: String, fileDesc: DatasetFileDesc): InputStream = { if (filePath != null && fileDesc != null) { throw new RuntimeException( "File Path and Dataset File Descriptor cannot present at the same time." ) } if (filePath != null) { new FileInputStream(filePath) } else { // create stream from dataset file desc fileDesc.fileInputStream() } } ``` --------- Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>

bobbai00 added the webserver label Feb 29, 2024

bobbai00 requested a review from aglinxinyuan February 29, 2024 00:31

bobbai00 self-assigned this Feb 29, 2024

bobbai00 added 4 commits February 29, 2024 10:35

add db schemas

74cee03

migrate the first part of the changes

800a9d4

format

def34db

format

0047eb4

bobbai00 force-pushed the jiadong-introduce-environment-webserver branch from ceceeca to 0047eb4 Compare February 29, 2024 18:35

bobbai00 added 2 commits February 29, 2024 15:47

fix ddl

a910e24

fix ddl

f7c91ed

aglinxinyuan approved these changes Mar 1, 2024

View reviewed changes

bobbai00 merged commit 066da03 into master Mar 1, 2024

bobbai00 deleted the jiadong-introduce-environment-webserver branch March 1, 2024 05:20

bobbai00 mentioned this pull request Mar 5, 2024

Add Environment GUI #2444

Merged

bobbai00 added the ddl-change Changes to the TexeraDB DDL label Mar 6, 2024

bobbai00 mentioned this pull request Mar 18, 2024

Add environment GUI and new mechanism to scan source files in workflow #2481

Closed

bobbai00 mentioned this pull request Mar 27, 2024

Enable environment tab and file scan from files in environment #2515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Environment WebServer APIs #2434

Add Environment WebServer APIs #2434

Uh oh!

bobbai00 commented Feb 29, 2024 •

edited

Loading

Uh oh!

aglinxinyuan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Environment WebServer APIs #2434

Add Environment WebServer APIs #2434

Uh oh!

Conversation

bobbai00 commented Feb 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Designs

DB Schema Changes

New APIs

Existing API Updates

Uh oh!

aglinxinyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bobbai00 commented Feb 29, 2024 •

edited

Loading