[SPARK-20971][SS] purge metadata log in FileStreamSource#18410
[SPARK-20971][SS] purge metadata log in FileStreamSource#18410CodingCat wants to merge 3 commits intoapache:masterfrom
Conversation
|
Test build #78549 has finished for PR 18410 at commit
|
|
Jenkins, test it please |
|
Jenkins, retest this please |
|
Test build #78558 has finished for PR 18410 at commit
|
|
retest this please |
|
Test build #78560 has started for PR 18410 at commit |
|
retest this please |
|
Test build #78570 has finished for PR 18410 at commit
|
|
@zsxwing would you mind taking a look at this PR...what does this pip packaging tests mean? it's a flaky test? |
|
(I believe the discussion here might be helpful for pip packaging failure - |
|
@HyukjinKwon thanks for the pointer, is it fixed now? |
|
retest this please |
|
Test build #78704 has finished for PR 18410 at commit
|
|
retest this please |
|
Test build #78919 has finished for PR 18410 at commit
|
|
cc @zsxwing Could you check whether we should continue or close this PR? |
|
Looks like deleting log for outdated batch is happening when "spark.sql.streaming.fileSource.log.deletion" is |
|
Test build #97736 has finished for PR 18410 at commit
|
|
Test build #97707 has finished for PR 18410 at commit
|
|
Test build #97806 has finished for PR 18410 at commit
|
| // No-op for now; FileStreamSource currently garbage-collects files based on timestamp | ||
| // and the value of the maxFileAge parameter. | ||
| if (currentLogOffset > minBatchesToRetain) { | ||
| metadataLog.purge(currentLogOffset - minBatchesToRetain) |
There was a problem hiding this comment.
What would be the behavior of this when spark.sql.streaming.fileSource.log.deletion=false? Looks like HDFSMetadataLog.purge will always delete the files.
There was a problem hiding this comment.
HDFSMetadataLog is not aware of such configuration.
Btw, I've put my observation on CompactibleFileStreamLog.purge() in comment on SPARK-20971.
Let me quote it here:
Btw, calling
purgebreaks CompactibleFileStreamLog since CompactibleFileStreamLog expects non-compacted batches to be exist, butpurgejust removes all of metadata files matching criteria. The safest way seems to be just disallowingpurgefor CompactibleFileStreamLog, otherwise we have to concern about the intention of callingpurge, like I was curious of rationalization of this issue like above.
So I've got a feeling that this may bring unexpected behavior and should be avoided.
|
I played a bit with my idea (filter out entities which were in committed batch when compacting), and realized it cannot solve the issue which filtered out files being re-read (not sure I'm just missing here). Once the files are filtered out in metadata they could be included in source of new batch (even if SeenFilesMap could help a bit, it's not persisted to storage so the issue remains same when query is rerun). I think the best way to do this safely would be incorporating this to #22952 - when files are successfully archived or deleted, we're safe to filter out them in metadata log as well. I'll address it to #22952. |
|
FYI: I've addressed removing obsolete file entries from compacted metadata logs. Please refer #22952 as well as 5dcece0 |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Currently, there is no cleanup mechanism for FileStreamSource's metadata log so that the data is growing infinitely
This PR purges the log which is out of the retaining windowing
How was this patch tested?
existing tests