-
Notifications
You must be signed in to change notification settings - Fork 113
Decouple creating source file's handler with resolving as URI in ScanSourceOp #2972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/storage/FileResolver.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/storage/FileResolver.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/storage/FileResolver.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/storage/FileResolver.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/storage/FileResolver.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/storage/FileResolver.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/storage/FileResolver.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/DatasetFileDocument.scala
Outdated
Show resolved
Hide resolved
...c/main/scala/edu/uci/ics/texera/workflow/operators/source/scan/csv/CSVScanSourceOpDesc.scala
Outdated
Show resolved
Hide resolved
.../scala/edu/uci/ics/texera/workflow/operators/source/scan/text/FileScanSourceOpDescSpec.scala
Outdated
Show resolved
Hide resolved
This PR refactors the API of downloading a version of a dataset. The purpose of this refactoring is for future's refactor PRs, like #2972 . ### New API GET `/version-zip` - did: dataset's ID. must be provided, specify which dataset - dvid: dataset version's ID, optional. If provided, retrieve this version; otherwise, retrieve the latest version.
# Conflicts: # core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/ReadonlyLocalFileDocument.scala # core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/ReadonlyVirtualDocument.scala # core/amber/src/main/scala/edu/uci/ics/texera/web/resource/dashboard/user/dataset/DatasetResource.scala
Yicong-Huang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. left some comments. We can manage URIs as string, primitive types are more easy to use accross languages. Not very clear of the life cycle to open the file: 1) given a resolved URI, can we open the file (either local or remote) with new File(resolvedUri)? if so maybe we don't need the FileHandle class and we can use standard File object? 2) no matter if we are managing FileHandle or using standard File object, please make sure the opened files are closed properly.
core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/DatasetFileDocument.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/DatasetFileDocument.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/DatasetFileDocument.scala
Outdated
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/DatasetFileDocument.scala
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/amber/engine/common/storage/DatasetFileDocument.scala
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/workflow/LogicalPlan.scala
Show resolved
Hide resolved
core/amber/src/main/scala/edu/uci/ics/texera/workflow/common/storage/FileResolver.scala
Outdated
Show resolved
Hide resolved
.../scala/edu/uci/ics/texera/workflow/operators/source/scan/csvOld/CSVOldScanSourceOpExec.scala
Outdated
Show resolved
Hide resolved
.../scala/edu/uci/ics/texera/workflow/operators/source/scan/csvOld/CSVOldScanSourceOpExec.scala
Outdated
Show resolved
Hide resolved
Yicong-Huang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
...cala/edu/uci/ics/texera/workflow/operators/source/scan/csv/ParallelCSVScanSourceOpDesc.scala
Show resolved
Hide resolved
This PR refactors the API of downloading a version of a dataset. The purpose of this refactoring is for future's refactor PRs, like #2972 . ### New API GET `/version-zip` - did: dataset's ID. must be provided, specify which dataset - dvid: dataset version's ID, optional. If provided, retrieve this version; otherwise, retrieve the latest version.
…SourceOp (#2972) This PR refactors the logic of resolving a user-given filename. ### Previous Logic and the problem #### 1. `FileResolver.resolve` not consistent across different file types For `FileResolver.resolve` - input parameter: user-given filename - output: Either[String, DatasetFileDocument] This logic is not consistent, as it return either a string, which is the URI of the local file, or a `DatasetFileDocument`, which is the fileHandle of a file in dataset. We want to make this logic consistent, meaning that it only output one type, URI, for all kinds of files. #### 2. FileHandle is opened when doing `setContext` for ScanSourceOp, not before actually reading it Currently for the ScanSourceOpDesc, the setContext function opens the fileHandle, ``` override def setContext(workflowContext: WorkflowContext): Unit = { super.setContext(workflowContext) if (fileName.isEmpty) { throw new RuntimeException("no input file name") } // Resolve the file and assign the result to fileHandle fileHandle = FileResolver.resolve(fileName.get) } ``` This is not proper, as the file is not read at this moment yet. The ScanSourceOpDesc should carry `fileUri`, instead of `fileHandle`. And when doing `inferSchema`, or OpExec executing, will the `fileHandle` be created. ### New Logic For problem 1, `FileResolver.resolve` will return `URI` instead of `FileHandle` For problem 2, - `FileResolve.open` will return the FileHandle - ScanSourceOpDesc change the `fileHandle: FileHandle` to `fileUri: String` - A new method `LogicalPlan.resolveScanSourceOpFileName` is added to resolving user-given filenames to URI, this method will be called when doing workflow compilation - fileHandle will be created by `FileResolve.open` right before the `inferSchema` and PhysicalOp executing. --------- Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
This PR refactors the logic of resolving a user-given filename.
Previous Logic and the problem
1.
FileResolver.resolvenot consistent across different file typesFor
FileResolver.resolveThis logic is not consistent, as it return either a string, which is the URI of the local file, or a
DatasetFileDocument, which is the fileHandle of a file in dataset.We want to make this logic consistent, meaning that it only output one type, URI, for all kinds of files.
2. FileHandle is opened when doing
setContextfor ScanSourceOp, not before actually reading itCurrently for the ScanSourceOpDesc, the setContext function opens the fileHandle,
This is not proper, as the file is not read at this moment yet. The ScanSourceOpDesc should carry
fileUri, instead offileHandle. And when doinginferSchema, or OpExec executing, will thefileHandlebe created.New Logic
For problem 1,
FileResolver.resolvewill returnURIinstead ofFileHandleFor problem 2,
FileResolve.openwill return the FileHandlefileHandle: FileHandletofileUri: StringLogicalPlan.resolveScanSourceOpFileNameis added to resolving user-given filenames to URI, this method will be called when doing workflow compilationFileResolve.openright before theinferSchemaand PhysicalOp executing.