Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Revisit index metadata for mutable dataset #198

@sezruby

Description

@sezruby

Describe the issue

This issue is for the discussion started in #194 (comment)

We are currently proposing the features for mutable dataset, and introduced appendedFiles and deletedFiles for new refresh modes - incremental, quick (planned).

/**
   * Hdfs file properties.
   * @param content Content object representing Hdfs file based data source.
   * @param appendedFiles Appended files since the last time derived dataset was updated.
   * @param deletedFiles Deleted files since the last time derived dataset was updated.
   */
  case class Properties(
      content: Content,
      appendedFiles: Seq[String] = Nil,
      deletedFiles: Seq[String] = Nil)

relations.head.data.properties.deletedFiles
relations.head.data.properties.appendedFiles

The current master is based on "mixed" semantic approach for source.Content in IndexLogEntry - there could be "unhandled" appendedFiles and deletedFiles during the new refresh mode. "unhandled" means their data is not included in the index data - so need to handle them appropriately for correctness. When updating unhandled appendedFiles and deletedFiles, "mixed" version updates the list of source files and its signature value though the index data doesn't include the data from "appendedFiles" and "deletedFiles". This "mixed" semantic might cause some complexity in code base.
@imback82 suggested a new semantic for this, always "since" semantic approach.

The following example might help understand what "mixed" and "since" (from @imback82 's #194 (comment))

There are source files: f1, f2, f3

  • An index is created.
  • A source file is added: f4
  • A source file is deleted: f1
  • Refresh "quick" (metadata only operation)

With the current semantics (mix of "latest" and "since"), we will have the following in the latest index log entry:

  • source files / signature: f2, f3, f4 / sig(f2, f3, f4)
  • appendedFiles: f4
  • deletedFiles: f1

With the semantics of always using "since":

  • source files / signature: f1, f2, f3 / sig(f1, f2, f3)
  • appendedFiles: f4
  • deletedFiles: f1
  • latestSourceFilesSignature: sig(f2, f3, f4)

However, to support Hybrid scan effectively with the quick refresh, we need to keep the updated signature leveraging appendedFiles and deletedFiles.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions