Skip to content

[SPARK-20971][SS] purge metadata log in FileStreamSource#18410

Closed
CodingCat wants to merge 3 commits intoapache:masterfrom
CodingCat:SPARK-20971
Closed

[SPARK-20971][SS] purge metadata log in FileStreamSource#18410
CodingCat wants to merge 3 commits intoapache:masterfrom
CodingCat:SPARK-20971

Conversation

@CodingCat
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Currently, there is no cleanup mechanism for FileStreamSource's metadata log so that the data is growing infinitely

This PR purges the log which is out of the retaining windowing

How was this patch tested?

existing tests

@CodingCat CodingCat changed the title [SS][SPARK-20971] purge metadata log in FileStreamSource [SPARK-20971][SS] purge metadata log in FileStreamSource Jun 23, 2017
@SparkQA
Copy link
Copy Markdown

SparkQA commented Jun 24, 2017

Test build #78549 has finished for PR 18410 at commit b37ec11.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@CodingCat
Copy link
Copy Markdown
Contributor Author

Jenkins, test it please

@CodingCat
Copy link
Copy Markdown
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jun 24, 2017

Test build #78558 has finished for PR 18410 at commit b37ec11.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@CodingCat
Copy link
Copy Markdown
Contributor Author

retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jun 24, 2017

Test build #78560 has started for PR 18410 at commit b37ec11.

@CodingCat
Copy link
Copy Markdown
Contributor Author

retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jun 24, 2017

Test build #78570 has finished for PR 18410 at commit b37ec11.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@CodingCat
Copy link
Copy Markdown
Contributor Author

CodingCat commented Jun 24, 2017

@zsxwing would you mind taking a look at this PR...what does this pip packaging tests mean? it's a flaky test?

@HyukjinKwon
Copy link
Copy Markdown
Member

(I believe the discussion here might be helpful for pip packaging failure - https://github.com/apache/spark/pull/15821#issuecomment-310894657)

@CodingCat
Copy link
Copy Markdown
Contributor Author

@HyukjinKwon thanks for the pointer, is it fixed now?

@CodingCat
Copy link
Copy Markdown
Contributor Author

retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jun 27, 2017

Test build #78704 has finished for PR 18410 at commit b37ec11.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@CodingCat
Copy link
Copy Markdown
Contributor Author

retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jun 29, 2017

Test build #78919 has finished for PR 18410 at commit b37ec11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Copy Markdown
Member

cc @zsxwing Could you check whether we should continue or close this PR?

@HeartSaVioR
Copy link
Copy Markdown
Contributor

HeartSaVioR commented Jul 30, 2018

Looks like this PR is not needed, since CompactibleFileStreamLog also takes care of metadata log.

https://github.com/apache/spark/commits/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala

Looks like deleting log for outdated batch is happening when "spark.sql.streaming.fileSource.log.deletion" is true and also default value is true, so I'm curious whether this patch is needed to remove outdated logs explicitly, or it is OK to keep it as it is.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 22, 2018

Test build #97736 has finished for PR 18410 at commit b37ec11.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 22, 2018

Test build #97707 has finished for PR 18410 at commit b37ec11.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 22, 2018

Test build #97806 has finished for PR 18410 at commit b37ec11.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// No-op for now; FileStreamSource currently garbage-collects files based on timestamp
// and the value of the maxFileAge parameter.
if (currentLogOffset > minBatchesToRetain) {
metadataLog.purge(currentLogOffset - minBatchesToRetain)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the behavior of this when spark.sql.streaming.fileSource.log.deletion=false? Looks like HDFSMetadataLog.purge will always delete the files.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HDFSMetadataLog is not aware of such configuration.

Btw, I've put my observation on CompactibleFileStreamLog.purge() in comment on SPARK-20971.

https://issues.apache.org/jira/browse/SPARK-20971?focusedCommentId=16772632&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16772632

Let me quote it here:

Btw, calling purge breaks CompactibleFileStreamLog since CompactibleFileStreamLog expects non-compacted batches to be exist, but purge just removes all of metadata files matching criteria. The safest way seems to be just disallowing purge for CompactibleFileStreamLog, otherwise we have to concern about the intention of calling purge, like I was curious of rationalization of this issue like above.

So I've got a feeling that this may bring unexpected behavior and should be avoided.

@HeartSaVioR
Copy link
Copy Markdown
Contributor

HeartSaVioR commented Feb 20, 2019

I played a bit with my idea (filter out entities which were in committed batch when compacting), and realized it cannot solve the issue which filtered out files being re-read (not sure I'm just missing here).

HeartSaVioR@00534b8

Once the files are filtered out in metadata they could be included in source of new batch (even if SeenFilesMap could help a bit, it's not persisted to storage so the issue remains same when query is rerun).

I think the best way to do this safely would be incorporating this to #22952 - when files are successfully archived or deleted, we're safe to filter out them in metadata log as well. I'll address it to #22952.

@HeartSaVioR
Copy link
Copy Markdown
Contributor

HeartSaVioR commented Feb 21, 2019

FYI: I've addressed removing obsolete file entries from compacted metadata logs. Please refer #22952 as well as 5dcece0

@github-actions
Copy link
Copy Markdown

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions Bot added the Stale label Jan 16, 2020
@github-actions github-actions Bot closed this Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants