-
Notifications
You must be signed in to change notification settings - Fork 116
Revisit index metadata for mutable dataset #198
Description
Describe the issue
This issue is for the discussion started in #194 (comment)
We are currently proposing the features for mutable dataset, and introduced appendedFiles and deletedFiles for new refresh modes - incremental, quick (planned).
/**
* Hdfs file properties.
* @param content Content object representing Hdfs file based data source.
* @param appendedFiles Appended files since the last time derived dataset was updated.
* @param deletedFiles Deleted files since the last time derived dataset was updated.
*/
case class Properties(
content: Content,
appendedFiles: Seq[String] = Nil,
deletedFiles: Seq[String] = Nil)
relations.head.data.properties.deletedFiles
relations.head.data.properties.appendedFiles
The current master is based on "mixed" semantic approach for source.Content in IndexLogEntry - there could be "unhandled" appendedFiles and deletedFiles during the new refresh mode. "unhandled" means their data is not included in the index data - so need to handle them appropriately for correctness. When updating unhandled appendedFiles and deletedFiles, "mixed" version updates the list of source files and its signature value though the index data doesn't include the data from "appendedFiles" and "deletedFiles". This "mixed" semantic might cause some complexity in code base.
@imback82 suggested a new semantic for this, always "since" semantic approach.
The following example might help understand what "mixed" and "since" (from @imback82 's #194 (comment))
There are source files: f1, f2, f3
- An index is created.
- A source file is added: f4
- A source file is deleted: f1
- Refresh "quick" (metadata only operation)
With the current semantics (mix of "latest" and "since"), we will have the following in the latest index log entry:
- source files / signature: f2, f3, f4 / sig(f2, f3, f4)
- appendedFiles: f4
- deletedFiles: f1
With the semantics of always using "since":
- source files / signature: f1, f2, f3 / sig(f1, f2, f3)
- appendedFiles: f4
- deletedFiles: f1
- latestSourceFilesSignature: sig(f2, f3, f4)
However, to support Hybrid scan effectively with the quick refresh, we need to keep the updated signature leveraging appendedFiles and deletedFiles.