Skip to content

Fix flaky realtime index task tests#8999

Merged
clintropolis merged 4 commits intoapache:masterfrom
jihoonson:fix-realtime-task-tests
Dec 18, 2019
Merged

Fix flaky realtime index task tests#8999
clintropolis merged 4 commits intoapache:masterfrom
jihoonson:fix-realtime-task-tests

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

@jihoonson jihoonson commented Dec 6, 2019

Fixes #8918.

Description

Those tests have been flaky since they are optimistically waiting for tasks to ingest all data in 5 seconds. This PR modifies those tests to actually check the number of rows ingested.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

This change is Reviewable

@jihoonson jihoonson removed the WIP label Dec 7, 2019
@@ -88,25 +89,30 @@ void doTest()
// the task will run for 3 minutes and then shutdown itself
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment to 10 minutes as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks. Changed back to 5 mins since 10 mins was for testing. I think 3 mins can be too short. Updated the comment as well.


// sleep for a while to let peons finish starting up
TimeUnit.SECONDS.sleep(5);
TimeUnit.SECONDS.sleep(60);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 60? can we use ITRetryUtil.retryUntil... instead of making every test wait 1 minute?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any way to check whether the task is ready to consume data or not for now. 5 seconds was too short and I think 1 min of waiting isn't that harm in a sense that our tests usually runs for 30 mins.

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, once a sleep is added, it's pretty hard to remove it. I think it would be helpful to mention in the comment that we don't know when the start up completes because there is no endpoint and if we had one, we could get rid of the sleep

Try till success, means we'd wait at most 1 minute. Right now every time our tests will run for 31 minutes instead of 30. Happy to discuss this offline - maybe it's not such a big deal if we add an extra minute to the tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you on adding a sleep. But RealtimeIndexTask and AppenderatorRealtimeIndexTask are sort of deprecated (in a sense that Tranquility is deprecated which is the main user of those tasks) and I don't think it's worth to spend much time for them.

Regarding the test run time, I think the most 2 common cases when the test run time matters are the CI and manual running of tests. For CI, LGTM and Teamcity are running more than an hour and Travis is not a bottleneck. For manual running of tests, I usually do something else while tests are running and adding 1 more min doesn't bother me much. What do you think?

BTW, anything happened only offline is regarded as that hasn't happened in Apache way.

Open Communications: as a virtual organization, the ASF requires all communications related to code and decision-making to be publicly accessible to ensure asynchronous collaboration, as necessitated by a globally-distributed community.

We can discuss offline, but that discussion and the conclusion must be filed somewhere public.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - these tests are for parts of the codebase that is currently deprecated. They will likely be removed in the near future. Right now the bottleneck for PR checks is LGTM (90 min) So even though this adds 2 min extra - there is no visible difference to the time it takes to merge a PR.

@Override
int getNumExpectedRowsIngested()
{
return 22;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does 22 come from?

nit: make this a constant so we can add an explanation in the javadocs? Similar comment for ITRealtimeIndexTaskTest

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Added.

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to discuss the sleep statement offline so we can decide if any changes are needed.


// sleep for a while to let peons finish starting up
TimeUnit.SECONDS.sleep(5);
TimeUnit.SECONDS.sleep(60);
Copy link
Copy Markdown
Contributor

@suneet-s suneet-s Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, once a sleep is added, it's pretty hard to remove it. I think it would be helpful to mention in the comment that we don't know when the start up completes because there is no endpoint and if we had one, we could get rid of the sleep

Try till success, means we'd wait at most 1 minute. Right now every time our tests will run for 31 minutes instead of 30. Happy to discuss this offline - maybe it's not such a big deal if we add an extra minute to the tests

private static final String REALTIME_QUERIES_RESOURCE = "/indexer/wikipedia_realtime_index_queries.json";
/**
* The expected number of rows ingested for this test.
* The total number of rows of raw data is 22, but two rows will be rolled up into one row.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon my ignorance, where does the raw data come from?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@clintropolis clintropolis merged commit 94a23fb into apache:master Dec 18, 2019
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test ITAppenderatorDriverRealtimeIndexTaskTest testRealtimeIndexTask()

3 participants