-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18826][SS]Add 'latestFirst' option to FileStreamSource #16251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #70002 has finished for PR 16251 at commit
|
| val optionMapWithoutPath: Map[String, String] = | ||
| parameters.filterKeys(_ != "path") | ||
|
|
||
| /** Whether to scan new files first. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate this comment further. In a trigger when it finds unprocessed files, it will first process the latest file.
Also, isnt latest more common than newest?
|
Test build #70149 has finished for PR 16251 at commit
|
|
retest this please |
|
Test build #70154 has finished for PR 16251 at commit
|
| // Prepare two files: 1.txt, 2.txt, and make sure they have different modified time. | ||
| val f1 = stringToFile(new File(src, "1.txt"), "1") | ||
| val f2 = stringToFile(new File(src, "2.txt"), "2") | ||
| eventually(timeout(streamingTimeout)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use eventually? Why not just set f1.setLatModified(f2.lastModified + 1000)
| val clock = new StreamManualClock() | ||
| testStream(fileStream)( | ||
| StartStream(trigger = ProcessingTime(10), triggerClock = clock), | ||
| AssertOnQuery { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to wait on the manual clock? CheckLastBatch will automatically wait for the batch to complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to wait on the manual clock? CheckLastBatch will automatically wait for the batch to complete?
CheckLastBatch waits only when AddData is used, but in this test, I need to add data before starting the query.
| ) | ||
|
|
||
| // Read latest files first, so the first batch is "2", and the second batch is "1". | ||
| val fileStream2 = createFileStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code can be deduped by writing a function that make the query run two batches and collect the results in order. And then the function is called with latestFirst true or false, and the result order checked.
|
LGTM, pending tests. |
|
Test build #70200 has finished for PR 16251 at commit
|
|
Merging to master and 2.1 |
## What changes were proposed in this pull request? When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data. This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16251 from zsxwing/newest-first. (cherry picked from commit 68a6dc9) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
## What changes were proposed in this pull request? When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data. This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16251 from zsxwing/newest-first.
What changes were proposed in this pull request?
When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data.
This PR adds a new option
latestFirstto control this behavior. When it's true,FileStreamSourcewill sort the files by the modified time from latest to oldest, and take the firstmaxFilesPerTriggerfiles as a new batch.How was this patch tested?
The added test.