[SPARK-18826][SS]Add 'latestFirst' option to FileStreamSource #16251

zsxwing · 2016-12-12T05:19:38Z

What changes were proposed in this pull request?

When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data.

This PR adds a new option latestFirst to control this behavior. When it's true, FileStreamSource will sort the files by the modified time from latest to oldest, and take the first maxFilesPerTrigger files as a new batch.

How was this patch tested?

The added test.

SparkQA · 2016-12-12T07:28:42Z

Test build #70002 has finished for PR 16251 at commit 58a57d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-12-14T20:56:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala

  val optionMapWithoutPath: Map[String, String] =
    parameters.filterKeys(_ != "path")
+
+  /** Whether to scan new files first. */


Can you elaborate this comment further. In a trigger when it finds unprocessed files, it will first process the latest file.

Also, isnt latest more common than newest?

SparkQA · 2016-12-14T23:14:31Z

Test build #70149 has finished for PR 16251 at commit ce1d57e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-12-14T23:23:53Z

retest this please

SparkQA · 2016-12-15T01:59:57Z

Test build #70154 has finished for PR 16251 at commit ce1d57e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-12-15T07:51:29Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+      // Prepare two files: 1.txt, 2.txt, and make sure they have different modified time.
+      val f1 = stringToFile(new File(src, "1.txt"), "1")
+      val f2 = stringToFile(new File(src, "2.txt"), "2")
+      eventually(timeout(streamingTimeout)) {


why use eventually? Why not just set f1.setLatModified(f2.lastModified + 1000)

tdas · 2016-12-15T07:54:44Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+      val clock = new StreamManualClock()
+      testStream(fileStream)(
+        StartStream(trigger = ProcessingTime(10), triggerClock = clock),
+        AssertOnQuery { _ =>


why do you need to wait on the manual clock? CheckLastBatch will automatically wait for the batch to complete?

why do you need to wait on the manual clock? CheckLastBatch will automatically wait for the batch to complete?

CheckLastBatch waits only when AddData is used, but in this test, I need to add data before starting the query.

tdas · 2016-12-15T07:56:53Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala

+      )
+
+      // Read latest files first, so the first batch is "2", and the second batch is "1".
+      val fileStream2 = createFileStream(


I think this code can be deduped by writing a function that make the query run two batches and collect the results in order. And then the function is called with latestFirst true or false, and the result order checked.

tdas · 2016-12-15T19:49:51Z

LGTM, pending tests.

SparkQA · 2016-12-15T21:09:12Z

Test build #70200 has finished for PR 16251 at commit 2847738.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-12-15T21:13:42Z

Merging to master and 2.1

## What changes were proposed in this pull request? When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data. This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16251 from zsxwing/newest-first. (cherry picked from commit 68a6dc9) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

## What changes were proposed in this pull request? When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data. This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16251 from zsxwing/newest-first.

Add 'newestFirst' option to FileStreamSource

58a57d4

tdas reviewed Dec 14, 2016

View reviewed changes

Address

ce1d57e

zsxwing changed the title ~~[SPARK-18826][SS]Add 'newestFirst' option to FileStreamSource~~ [SPARK-18826][SS]Add 'latestFirst' option to FileStreamSource Dec 14, 2016

tdas reviewed Dec 15, 2016

View reviewed changes

Address TD's comments

2847738

asfgit closed this in 68a6dc9 Dec 15, 2016

zsxwing deleted the newest-first branch January 4, 2017 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18826][SS]Add 'latestFirst' option to FileStreamSource #16251

[SPARK-18826][SS]Add 'latestFirst' option to FileStreamSource #16251

Uh oh!

zsxwing commented Dec 12, 2016 •

edited

Loading

Uh oh!

SparkQA commented Dec 12, 2016

Uh oh!

tdas Dec 14, 2016

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

zsxwing commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

tdas Dec 15, 2016

Uh oh!

tdas Dec 15, 2016

Uh oh!

zsxwing Dec 15, 2016

Uh oh!

tdas Dec 15, 2016

Uh oh!

tdas commented Dec 15, 2016

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

tdas commented Dec 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-18826][SS]Add 'latestFirst' option to FileStreamSource #16251

[SPARK-18826][SS]Add 'latestFirst' option to FileStreamSource #16251

Uh oh!

Conversation

zsxwing commented Dec 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 12, 2016

Uh oh!

tdas Dec 14, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 14, 2016

Uh oh!

zsxwing commented Dec 14, 2016

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

tdas Dec 15, 2016

Choose a reason for hiding this comment

Uh oh!

tdas Dec 15, 2016

Choose a reason for hiding this comment

Uh oh!

zsxwing Dec 15, 2016

Choose a reason for hiding this comment

Uh oh!

tdas Dec 15, 2016

Choose a reason for hiding this comment

Uh oh!

tdas commented Dec 15, 2016

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

tdas commented Dec 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zsxwing commented Dec 12, 2016 •

edited

Loading