Skip to content

Fix ingestion task failure when no input split to process#11553

Merged
maytasm merged 3 commits intoapache:masterfrom
maytasm:REQ-2430
Aug 9, 2021
Merged

Fix ingestion task failure when no input split to process#11553
maytasm merged 3 commits intoapache:masterfrom
maytasm:REQ-2430

Conversation

@maytasm
Copy link
Copy Markdown
Contributor

@maytasm maytasm commented Aug 5, 2021

Fix ingestion task failure when no input split to process

Description

This PR fix a regression introduced by #11189
The regression causes ingestion task in parallel mode to fail when there is no input split to process (previously the task would succeed).
This is because of a null pointer exception occurred in getReports() at

. This is because this method accesses taskMonitor without checking if it is null or not. This is in contrast to all other code in the class, which check if taskMonitor is null or not. Also see in the run() method of the class ,
that taskMonitor is initialized in line 121, but that does not always happen. If the method returned in line 116, which is when there is no input source to split, then taskMonitor stays null.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@maytasm maytasm requested a review from jihoonson August 5, 2021 09:27
@suneet-s
Copy link
Copy Markdown
Contributor

suneet-s commented Aug 5, 2021

This sounds like an interesting edge case. Hope can there be an interesting y'all with no input split? How can there be an ingestion spec with no input split

I think we should add an integration test for this.

EDIT: Wow auto-correct on my phone - fixed my comment

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @maytasm. Thanks for fixing this bug.

Comment on lines +292 to +294
if (taskMonitor != null) {
taskMonitor.collectReport(report);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectReport should be called only when there is a subtask sending its report. Since TaskMonitor is responsible for spawning subtasks, it seems quite strange if taskMonitor is null when this method is called. How about adding a precondition that explodes when it's null?

Copy link
Copy Markdown
Contributor Author

@maytasm maytasm Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw that the taskMonitor was accessed without a null check so I added the null check just in case. The bug mentioned in this ticket can be fix with just the change in getReports(). Given the context from your comment, maybe it is better to leave it as is and adds a comment mentioning something like... collectReport is only called when there is a subtask sending its report. Since TaskMonitor is responsible for spawning subtasks, the taskMonitor cannot be null

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment sounds good. I think having a null check is also a good convention because someone like me could break the contract around taskMonitor unintentionally in the future. Perhaps we can add assert taskMonitor != null;. This will be ran only in tests, but not in production.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to have static analysis flag all these uses? I've tried using combinations of @Nullable(doesn't realize that once it's set, it's good for the rest of the method) and @MonotonicNonNull but haven't been able to get intelliJ to flag these issues for me locally yet. I figured I would just leave a comment here and maybe you will have better luck

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set the severity of the "Nullability problems" intellij inspection to error instead of warning. I think we haven't done it because there could be lots of places where the inspection will complain about nulls. It could be worth doing it at some point though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public Map<String, SubTaskReportType> getReports()
{
return taskMonitor.getReports();
return taskMonitor == null ? Collections.emptyMap() : taskMonitor.getReports();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented Aug 7, 2021

This sounds like an interesting edge case. Hope can there be an interesting y'all with no input split? How can there be an ingestion spec with no input split

I think we should add an integration test for this.

EDIT: Wow auto-correct on my phone - fixed my comment

To name a few cases:

  • Reindex with datasource name that doesn’t exist
  • Reindex with interval that does not contain any data
  • Ingestion with interval in granularitySpec mismatching the inputSpec

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented Aug 7, 2021

This sounds like an interesting edge case. Hope can there be an interesting y'all with no input split? How can there be an ingestion spec with no input split

I think we should add an integration test for this.

EDIT: Wow auto-correct on my phone - fixed my comment

Added integration test

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks for the integration test!

@suneet-s
Copy link
Copy Markdown
Contributor

suneet-s commented Aug 9, 2021

To name a few cases:

  • Reindex with datasource name that doesn’t exist
  • Reindex with interval that does not contain any data
  • Ingestion with interval in granularitySpec mismatching the inputSpec

The first 2 cases sound like a user doing something they shouldn't be doing - and the re-index is a no-op. Maybe we should surface this as a typo to the end user?

Can you provide an example of the 3rd case?

I'm wondering if all these cases are situations where the end user should be warned of something "unexpected" happening.

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented Aug 9, 2021

To name a few cases:

  • Reindex with datasource name that doesn’t exist
  • Reindex with interval that does not contain any data
  • Ingestion with interval in granularitySpec mismatching the inputSpec

The first 2 cases sound like a user doing something they shouldn't be doing - and the re-index is a no-op. Maybe we should surface this as a typo to the end user?

Can you provide an example of the 3rd case?

I'm wondering if all these cases are situations where the end user should be warned of something "unexpected" happening.

For the 3rd case, imagine you have a hash partitioning with the following spec:

   ...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "druid",
        "dataSource": "mydatasource",
        "interval": "2020-07-31T00:00:00.000Z/2020-08-01T00:00:00.000Z",
        "filter": null,
        "dimensions": null,
        "metrics": [
          ...
        ]
      },
      "inputFormat": null,
      "appendToExisting": false
    },
    ....
    "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "HOUR",
        "rollup": true,
        "intervals": [
          "2020-08-01T00:00:00.000Z/2020-08-02T00:00:00.000Z"
        ]
      },
    ....
    "partitionsSpec": {
        "type": "hashed",
        "numShards": null,
        "partitionDimensions": [],
        "partitionFunction": "murmur3_32_abs",
        "maxRowsPerSegment": 5000000
      },
    ....
    

The partial_segment_merge phase tasks will have no input split to process as the interval in the ioConfig does not match with intervals in granularitySpec

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented Aug 9, 2021

To name a few cases:

  • Reindex with datasource name that doesn’t exist
  • Reindex with interval that does not contain any data
  • Ingestion with interval in granularitySpec mismatching the inputSpec

The first 2 cases sound like a user doing something they shouldn't be doing - and the re-index is a no-op. Maybe we should surface this as a typo to the end user?
Can you provide an example of the 3rd case?
I'm wondering if all these cases are situations where the end user should be warned of something "unexpected" happening.

For the 3rd case, imagine you have a hash partitioning with the following spec:

   ...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "druid",
        "dataSource": "mydatasource",
        "interval": "2020-07-31T00:00:00.000Z/2020-08-01T00:00:00.000Z",
        "filter": null,
        "dimensions": null,
        "metrics": [
          ...
        ]
      },
      "inputFormat": null,
      "appendToExisting": false
    },
    ....
    "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "HOUR",
        "rollup": true,
        "intervals": [
          "2020-08-01T00:00:00.000Z/2020-08-02T00:00:00.000Z"
        ]
      },
    ....
    "partitionsSpec": {
        "type": "hashed",
        "numShards": null,
        "partitionDimensions": [],
        "partitionFunction": "murmur3_32_abs",
        "maxRowsPerSegment": 5000000
      },
    ....
    

The partial_segment_merge phase tasks will have no input split to process as the interval in the ioConfig does not match with intervals in granularitySpec

The cases I thought of are due to user error in their ingestionSpec. I am not sure if there are valid cases where we can run into no input split to process situation though.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I'm wondering if all these cases are situations where the end user should be warned of something "unexpected" happening.

I think this is likely because of the user error in their ingestion spec, and thus it makes sense to bubble up the user error to their end. The thing is how. To me, from the Druid perspective, the most reasonable way is ending the task with the SUCCEEDED state (because the task technically succeeded with no error) and showing the ingestion metrics, such as # of rows ingested, in the task report, so that users can notice the strangeness of their ingestion job from those metrics. We could probably improve our UI to make this more noticeable.

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented Aug 9, 2021

Merging this in to fix regression. Improvement for logging and UI can be done separately

@maytasm maytasm merged commit 06bae29 into apache:master Aug 9, 2021
@maytasm maytasm deleted the REQ-2430 branch August 9, 2021 16:11
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants