Skip to content

Conversation

@lw-lin
Copy link
Contributor

@lw-lin lw-lin commented Nov 9, 2016

What changes were proposed in this pull request?

Problem

Right now CompactibleFileStreamLog uses compactInterval to check if a batch is a compaction batch. However, since this conf is controlled by the user, they may just change it, and CompactibleFileStreamLog will just silently return the wrong answer.

Changes

With this patch the compactInterval configuration should only be used when deciding if we should perform a new compaction. The identification of a compaction vs a delta is done by looking for the compact suffix.

In more details, upon each start/restart, we first examine what metadata files were there, what files among them are compacted files. We introduce knownCompactionBatches to hold the existing compaction batches before this run, and zeroBatch to hold the first batch this run should write to.

Thus we can support the following situations:

  (1) a fresh run
  (2) the previous run with `compactInterval` = 2
      0
  (3) the previous run with `compactInterval` = 2
      0   1.compact
  (4) previous runs with `compactInterval` = 2 and `compactInterval` = 5
      0   1.compact    2   3.compact   4.compact
  (5) previous runs with `compactInterval` = 2 and `compactInterval` = 5
      0   1.compact    2   3.compact   4.compact   5   6   7   8

with:

  (1) `knownCompactionBatches` = (), `zeroBatch` = 0
  (2) `knownCompactionBatches` = (), `zeroBatch` = 1
  (3) `knownCompactionBatches` = (1), `zeroBatch` = 2
  (4) `knownCompactionBatches` = (1, 3, 4), `zeroBatch` = 5
  (5) `knownCompactionBatches` = (1, 3, 4), `zeroBatch` = 9

Note this appoach does not require the compactInteral for a new run must be consistent with previous runs.

How was this patch tested?

added test cases

@uncleGen
Copy link
Contributor

uncleGen commented Nov 9, 2016

related to #15827

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68402 has finished for PR 15828 at commit 72de68d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CompactibleFileStreamLogSuite extends SparkFunSuite with SharedSQLContext
    • class FakeCompactibleFileStreamLog(

@SparkQA
Copy link

SparkQA commented Nov 9, 2016

Test build #68412 has finished for PR 15828 at commit 58c1031.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin lw-lin changed the title [WIP][SS] CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch [WIP][SPARK-18187][SS] CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch Nov 9, 2016
@SparkQA
Copy link

SparkQA commented Nov 10, 2016

Test build #68454 has finished for PR 15828 at commit 7126665.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin lw-lin force-pushed the test-compact-1109- branch from 7126665 to ca5e984 Compare November 13, 2016 15:02
@lw-lin
Copy link
Contributor Author

lw-lin commented Nov 13, 2016

rebased.

@lw-lin lw-lin changed the title [WIP][SPARK-18187][SS] CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch [SPARK-18187][SS] CompactibleFileStreamLog should not rely on "compactInterval" to detect a compaction batch Nov 13, 2016
@lw-lin
Copy link
Contributor Author

lw-lin commented Nov 13, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Nov 13, 2016

Test build #68584 has finished for PR 15828 at commit ca5e984.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2016

Test build #68585 has finished for PR 15828 at commit ca5e984.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lw-lin lw-lin closed this Nov 19, 2016
asfgit pushed a commit that referenced this pull request Nov 22, 2016
…mLog` directly

## What changes were proposed in this pull request?

Right now we are testing the most of `CompactibleFileStreamLog` in `FileStreamSinkLogSuite` (because `FileStreamSinkLog` once was the only subclass of `CompactibleFileStreamLog`, but now it's not the case any more).

Let's refactor the tests so that `CompactibleFileStreamLog` is directly tested, making future changes (like #15828, #15827) to `CompactibleFileStreamLog` much easier to test and much easier to review.

## How was this patch tested?

the PR itself is about tests

Author: Liwei Lin <lwlin7@gmail.com>

Closes #15870 from lw-lin/test-compact-1113.
asfgit pushed a commit that referenced this pull request Nov 22, 2016
…mLog` directly

## What changes were proposed in this pull request?

Right now we are testing the most of `CompactibleFileStreamLog` in `FileStreamSinkLogSuite` (because `FileStreamSinkLog` once was the only subclass of `CompactibleFileStreamLog`, but now it's not the case any more).

Let's refactor the tests so that `CompactibleFileStreamLog` is directly tested, making future changes (like #15828, #15827) to `CompactibleFileStreamLog` much easier to test and much easier to review.

## How was this patch tested?

the PR itself is about tests

Author: Liwei Lin <lwlin7@gmail.com>

Closes #15870 from lw-lin/test-compact-1113.

(cherry picked from commit ebeb083)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
…mLog` directly

## What changes were proposed in this pull request?

Right now we are testing the most of `CompactibleFileStreamLog` in `FileStreamSinkLogSuite` (because `FileStreamSinkLog` once was the only subclass of `CompactibleFileStreamLog`, but now it's not the case any more).

Let's refactor the tests so that `CompactibleFileStreamLog` is directly tested, making future changes (like apache#15828, apache#15827) to `CompactibleFileStreamLog` much easier to test and much easier to review.

## How was this patch tested?

the PR itself is about tests

Author: Liwei Lin <lwlin7@gmail.com>

Closes apache#15870 from lw-lin/test-compact-1113.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…mLog` directly

## What changes were proposed in this pull request?

Right now we are testing the most of `CompactibleFileStreamLog` in `FileStreamSinkLogSuite` (because `FileStreamSinkLog` once was the only subclass of `CompactibleFileStreamLog`, but now it's not the case any more).

Let's refactor the tests so that `CompactibleFileStreamLog` is directly tested, making future changes (like apache#15828, apache#15827) to `CompactibleFileStreamLog` much easier to test and much easier to review.

## How was this patch tested?

the PR itself is about tests

Author: Liwei Lin <lwlin7@gmail.com>

Closes apache#15870 from lw-lin/test-compact-1113.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants