Skip to content

Conversation

@bobbai00
Copy link
Contributor

This PR refactors the logic of resolving a user-given filename.

Previous Logic and the problem

1. FileResolver.resolve not consistent across different file types

For FileResolver.resolve

  • input parameter: user-given filename
  • output: Either[String, DatasetFileDocument]

This logic is not consistent, as it return either a string, which is the URI of the local file, or a DatasetFileDocument, which is the fileHandle of a file in dataset.

We want to make this logic consistent, meaning that it only output one type, URI, for all kinds of files.

2. FileHandle is opened when doing setContext for ScanSourceOp, not before actually reading it

Currently for the ScanSourceOpDesc, the setContext function opens the fileHandle,

override def setContext(workflowContext: WorkflowContext): Unit = {
    super.setContext(workflowContext)

    if (fileName.isEmpty) {
      throw new RuntimeException("no input file name")
    }

    // Resolve the file and assign the result to fileHandle
    fileHandle = FileResolver.resolve(fileName.get)
  }

This is not proper, as the file is not read at this moment yet. The ScanSourceOpDesc should carry fileUri, instead of fileHandle. And when doing inferSchema, or OpExec executing, will the fileHandle be created.

New Logic

For problem 1, FileResolver.resolve will return URI instead of FileHandle

For problem 2,

  • FileResolve.open will return the FileHandle
  • ScanSourceOpDesc change the fileHandle: FileHandle to fileUri: String
  • A new method LogicalPlan.resolveScanSourceOpFileName is added to resolving user-given filenames to URI, this method will be called when doing workflow compilation
  • fileHandle will be created by FileResolve.open right before the inferSchema and PhysicalOp executing.

@bobbai00 bobbai00 added the refactor Refactor the code label Oct 27, 2024
@bobbai00 bobbai00 requested a review from Yicong-Huang October 27, 2024 16:55
@bobbai00 bobbai00 self-assigned this Oct 27, 2024
bobbai00 added a commit that referenced this pull request Oct 28, 2024
…urce operators. (#2975)

This PR refines the logic of showing selected dataset and version in
scan source operators, if a file path is being selected.

It makes it no longer rely on the backend's API, a cleaner
implementation. This is the base of the following refactor PR like #2972
bobbai00 added a commit that referenced this pull request Oct 28, 2024
This PR refactors the API of downloading a version of a dataset. The
purpose of this refactoring is for future's refactor PRs, like #2972 .

### New API

GET `/version-zip`
- did: dataset's ID. must be provided, specify which dataset
- dvid: dataset version's ID, optional. If provided, retrieve this
version; otherwise, retrieve the latest version.
bobbai00 and others added 4 commits October 28, 2024 13:08
# Conflicts:
#	core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/ReadonlyLocalFileDocument.scala
#	core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/ReadonlyVirtualDocument.scala
#	core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/dataset/DatasetResource.scala
@bobbai00 bobbai00 requested a review from Yicong-Huang October 29, 2024 00:20
Copy link
Contributor

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. left some comments. We can manage URIs as string, primitive types are more easy to use accross languages. Not very clear of the life cycle to open the file: 1) given a resolved URI, can we open the file (either local or remote) with new File(resolvedUri)? if so maybe we don't need the FileHandle class and we can use standard File object? 2) no matter if we are managing FileHandle or using standard File object, please make sure the opened files are closed properly.

Copy link
Contributor

@Yicong-Huang Yicong-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bobbai00 bobbai00 merged commit fe684b5 into master Oct 30, 2024
@bobbai00 bobbai00 deleted the jiadong-refactor-resolve branch October 30, 2024 01:29
PurelyBlank pushed a commit that referenced this pull request Dec 4, 2024
…urce operators. (#2975)

This PR refines the logic of showing selected dataset and version in
scan source operators, if a file path is being selected.

It makes it no longer rely on the backend's API, a cleaner
implementation. This is the base of the following refactor PR like #2972
PurelyBlank pushed a commit that referenced this pull request Dec 4, 2024
This PR refactors the API of downloading a version of a dataset. The
purpose of this refactoring is for future's refactor PRs, like #2972 .

### New API

GET `/version-zip`
- did: dataset's ID. must be provided, specify which dataset
- dvid: dataset version's ID, optional. If provided, retrieve this
version; otherwise, retrieve the latest version.
PurelyBlank pushed a commit that referenced this pull request Dec 4, 2024
…SourceOp (#2972)

This PR refactors the logic of resolving a user-given filename. 

### Previous Logic and the problem
#### 1. `FileResolver.resolve` not consistent across different file
types
For `FileResolver.resolve`
- input parameter: user-given filename
- output: Either[String, DatasetFileDocument]

This logic is not consistent, as it return either a string, which is the
URI of the local file, or a `DatasetFileDocument`, which is the
fileHandle of a file in dataset.

We want to make this logic consistent, meaning that it only output one
type, URI, for all kinds of files.

#### 2. FileHandle is opened when doing `setContext` for ScanSourceOp,
not before actually reading it

Currently for the ScanSourceOpDesc, the setContext function opens the
fileHandle,
```
override def setContext(workflowContext: WorkflowContext): Unit = {
    super.setContext(workflowContext)

    if (fileName.isEmpty) {
      throw new RuntimeException("no input file name")
    }

    // Resolve the file and assign the result to fileHandle
    fileHandle = FileResolver.resolve(fileName.get)
  }
```
This is not proper, as the file is not read at this moment yet. The
ScanSourceOpDesc should carry `fileUri`, instead of `fileHandle`. And
when doing `inferSchema`, or OpExec executing, will the `fileHandle` be
created.

### New Logic

For problem 1, `FileResolver.resolve` will return `URI` instead of
`FileHandle`

For problem 2, 
- `FileResolve.open` will return the FileHandle
- ScanSourceOpDesc change the `fileHandle: FileHandle` to `fileUri:
String`
- A new method `LogicalPlan.resolveScanSourceOpFileName` is added to
resolving user-given filenames to URI, this method will be called when
doing workflow compilation
- fileHandle will be created by `FileResolve.open` right before the
`inferSchema` and PhysicalOp executing.

---------

Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor Refactor the code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants