Skip to content

Conversation

@bobbai00
Copy link
Contributor

@bobbai00 bobbai00 commented Feb 29, 2024

This PR introduces the APIs of Environment to the Web server. It depends on the dataset feature, #2413 and #2391 .

Designs

  1. Each workflow will have a unique environment when the workflow is being created and persisted.
  2. Environment currently store the datasets and the versions visible to the workflow.
  3. When a workflow is executed, the environment id will be recorded as the part of workflow_execution record.
  4. When using the source scan operator, workflow can ONLY scan the files in the datasets dictated by its environment.

DB Schema Changes

Three new Tables are added:

CREATE TABLE IF NOT EXISTS environment
(
    `eid`              INT UNSIGNED AUTO_INCREMENT NOT NULL,
    `owner_uid`        INT UNSIGNED NOT NULL,
    `name`			   VARCHAR(128) NOT NULL DEFAULT 'Untitled Environment',
    `description`      VARCHAR(1000),
    `creation_time`    TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (`eid`),
    FOREIGN KEY (`owner_uid`) REFERENCES `user` (`uid`) ON DELETE CASCADE
) ENGINE = INNODB;

CREATE TABLE IF NOT EXISTS environment_of_workflow
(
    `eid`              INT UNSIGNED NOT NULL,
    `wid`              INT UNSIGNED NOT NULL,
    PRIMARY KEY (`eid`, `wid`),
    FOREIGN KEY (`wid`) REFERENCES `workflow` (`wid`) ON DELETE CASCADE,
    FOREIGN KEY (`eid`) REFERENCES `environment` (`eid`) ON DELETE CASCADE
) ENGINE = INNODB;

CREATE TABLE IF NOT EXISTS dataset_of_environment
(
    `did`                   INT UNSIGNED NOT NULL,
    `eid`                   INT UNSIGNED NOT NULL,
    `dvid`                  INT UNSIGNED NOT NULL,
    PRIMARY KEY (`did`, `eid`),
    FOREIGN KEY (`eid`) REFERENCES `environment` (`eid`) ON DELETE CASCADE,
    FOREIGN KEY (`dvid`) REFERENCES `dataset_version` (`dvid`) ON DELETE CASCADE
) ENGINE = INNODB;
  • environment is the table for storing the environment info.
  • environment_of_workflow maintains the which environment the workflow is within. CURRENTLY, WORKFLOW is 1-to-1 correspondence to environment.
  • dataset_of_environment records which dataset(s) are visible to the workflow that is using this environment.

New Column is added to the workflow_executions:

-- Add the `environment_eid` column to the `workflow_executions` table
ALTER TABLE workflow_executions
ADD COLUMN `environment_eid` INT UNSIGNED;

-- Add the foreign key constraint for `environment_eid`
ALTER TABLE workflow_executions
ADD CONSTRAINT fk_environment_eid
FOREIGN KEY (`environment_eid`) REFERENCES environment(`eid`) ON DELETE SET NULL;

New column environment_eid is used to record which the environment is used for that workflow execution.

New APIs

Several APIs related to environment is added.

    POST    /api/environment/create    // create the environment
    POST    /api/environment/delete    // delete the environment
    GET     /api/environment/{eid}        // get the environment info by eid
    GET     /api/environment/{eid}/dataset/list // list all datasets(the datasetID, datasetVersionID) of the environment
    POST    /api/environment/{eid}/dataset/add // add dataset to the environment
    GET     /api/environment/{eid}/dataset/list/details   // list the details info of the datasets in the environment
    POST    /api/environment/{eid}/dataset/remove  // remove dataset from the environment
    GET     /api/environment/{eid}/files/{query:.*}  // for file auto complete, retrieve the file path matching the query

Existing API Updates

I changed the implementation of persistWorkflow in WorkflowResource. Specifically, a environment will be created if the workflow has no corresponding environment when persisting, the code snippet is:

    val wid = workflow.getWid
    // check if the runtime environment of this workflow exists, if not, create one
    if (!doesWorkflowHaveEnvironment(context, wid)) {
      // create an environment, and associate this environment to this workflow
      val createdEnvironment = createEnvironment(
        context,
        uid,
        "Environment of Workflow #%d %s".format(wid.intValue(), workflow.getName),
        "Runtime Environment of Workflow #%d %s".format(wid.intValue(), workflow.getName)
      )

      environmentOfWorkflowDao.insert(new EnvironmentOfWorkflow(createdEnvironment.getEid, wid))
    }

@bobbai00 bobbai00 self-assigned this Feb 29, 2024
@bobbai00 bobbai00 force-pushed the jiadong-introduce-environment-webserver branch from ceceeca to 0047eb4 Compare February 29, 2024 18:35
Copy link
Contributor

@aglinxinyuan aglinxinyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bobbai00 bobbai00 merged commit 066da03 into master Mar 1, 2024
@bobbai00 bobbai00 deleted the jiadong-introduce-environment-webserver branch March 1, 2024 05:20
@bobbai00 bobbai00 mentioned this pull request Mar 5, 2024
bobbai00 added a commit that referenced this pull request Mar 6, 2024
This PR introduces the GUI of `environment` and some fixes to previous
`dataset` features. For the backend of `environment`, see #2434

### Features

- Environment Tab on the left panel
![2024-03-04 23 26
19](https://github.com/Texera/texera/assets/43344272/09738edd-aa91-4a1f-a915-4d41f25afc9b)

- Auto Complete only from the files in datasets
![2024-03-04 23 27
32](https://github.com/Texera/texera/assets/43344272/36178dea-24fd-41f3-92e2-800c10f21ae5)

### Implementation Details

1. The changes on the `ScanSourceOperatorDesc`

Previously, the source file is located by its absolute path and scanned
into the workflow. Now, since all the files are within the dataset and
managed by JGit, its physical file may not be directly available,
current solution is to write the target file into a temporary file,
which is identified by an absolute path generated by JVM. The file will
be deleted when JVM quits.

2. When workflow execute request is submitted, the webserver will also
persist the environment eid to the `WorkflowExecutions` table.
@bobbai00 bobbai00 added the ddl-change Changes to the TexeraDB DDL label Mar 6, 2024
bobbai00 added a commit that referenced this pull request Mar 28, 2024
This PR introduces the GUI of environment and some fixes to previous
dataset features. For the backend of environment, see
#2434

After introducing the environment, the way of uploading data and
scanning data using workflow is presented in this
[blog](https://github.com/Texera/texera/wiki/Create-Dataset,-upload-data-to-it-and-use-it-in-Workflow).
For more specific information, there is a [demo
video](https://www.youtube.com/watch?app=desktop&v=EJ269aWnHv4&ab_channel=TexeraProject).

## Features

- View the Environment information at the workspace
![2024-03-21 23 02
09](https://github.com/Texera/texera/assets/43344272/23d76935-bcf1-4879-a06e-628f556609cf)

- Add dataset to the current environment
![2024-03-21 23 03
14](https://github.com/Texera/texera/assets/43344272/4277e236-6d8b-4127-9be5-cb0be4965e58)

- Preview Data File in Dataset of environment
![2024-03-21 23 04
08](https://github.com/Texera/texera/assets/43344272/c5db33ed-9bef-46f4-bd63-3b17dd7cde9d)

- Scan Files that are in the datasets
![2024-03-21 23 05
53](https://github.com/Texera/texera/assets/43344272/b7ceccbd-6487-4731-bec3-3d254a73dcb3)


## Implementation Details
### The changes on the ScanSourceOperatorDesc
Previously, the source file is located by its absolute path and scanned
into the workflow. Now, since all the files are within the dataset and
managed by JGit, its physical file may not be directly available.
Therefore, couple of changes are made regarding the way that source
operator scans the file.

1. In the source operator descriptor: ScanSourceOpDesc
A new member variable is added:
```scala
  @JsonIgnore
  var filePath: Option[String] = None

// new
  @JsonIgnore
  var datasetFileDesc: Option[DatasetFileDesc] = None
```
class `DatasetFileDesc` contains the softlink to the file in the
dataset, and has utilities to read the file as stream/tempraory file.

`datasetFileDesc` will be initialized when `setContext` is called:
```scala
    if (getContext.userId.isDefined) {
      val environmentEid = WorkflowResource.getEnvironmentEidOfWorkflow(
        UInteger.valueOf(workflowContext.workflowId.id)
      )
      // if user system is defined, a datasetFileDesc will be initialized, which is the handle of reading file from the dataset
      datasetFileDesc = Some(
        getEnvironmentDatasetFilePathAndVersion(getContext.userId.get, environmentEid, fileName.get)
      )
    }
```

2. For each source operator executor, i.e. CSVScanSourceExec

A new parameter is added in the constructor:
```scala
class CSVScanSourceOpExec private[csv] (
    filePath: String,
    datasetFileDesc: DatasetFileDesc,
```

If `datasetFileDesc` is set non-null(i.e. user system is enabled), when
creating the input stream reader, the stream will be created using
`datasetFileDesc.fileInputStream`:
```scala
  // this function create the input stream accordingly:
  // - if filePath is set, create the stream from the file
  // - if fileDesc is set, create the stream via JGit call
  def createInputStream(filePath: String, fileDesc: DatasetFileDesc): InputStream = {
    if (filePath != null && fileDesc != null) {
      throw new RuntimeException(
        "File Path and Dataset File Descriptor cannot present at the same time."
      )
    }
    if (filePath != null) {
      new FileInputStream(filePath)
    } else {
      // create stream from dataset file desc
      fileDesc.fileInputStream()
    }
  }
```

---------

Co-authored-by: Xinyuan Lin <xinyual3@uci.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddl-change Changes to the TexeraDB DDL webserver

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants