Skip to content

KAFKA-8586: Fail source tasks when producers fail to send records#6993

Merged
rhauch merged 5 commits intoapache:trunkfrom
C0urante:kafka-8586
Aug 25, 2019
Merged

KAFKA-8586: Fail source tasks when producers fail to send records#6993
rhauch merged 5 commits intoapache:trunkfrom
C0urante:kafka-8586

Conversation

@C0urante
Copy link
Copy Markdown
Contributor

Jira

Previously, if the producer for a source task failed to send a record with a non-retriable error, the record would be silently skipped over. The source task would be allowed to commit offsets for the skipped record, and its status would remain at RUNNING.

The changes here cause source tasks to transition to the FAILED state if their producers fail to send a record with a non-retriable error, and they also change the logic for offset commits to wait for confirmation that records have made it to Kafka before their offsets can be committed.

Tested by running Connect unit tests locally.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@C0urante
Copy link
Copy Markdown
Contributor Author

@wicknicks would you mind taking a look?

@C0urante
Copy link
Copy Markdown
Contributor Author

@rayokota would you mind taking a look?

@wicknicks
Copy link
Copy Markdown
Contributor

@C0urante do we know what errors cause the producer.send() callback to be invoked? Ideally, with the source connector, we should retry forever (internally by the producer API, and we should never see the if (e!=null) block executed in that callback). I'm wondering if the data loss you observed was a result of overriding any of these configs.

@C0urante
Copy link
Copy Markdown
Contributor Author

@wicknicks no, this is not the result of overriding any producer configs. Like is detailed in the Jira, the producer only retries on retriable errors, and TopicAuthorizationExceptions are not retriable.

Copy link
Copy Markdown
Contributor

@wicknicks wicknicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions/comments

@C0urante
Copy link
Copy Markdown
Contributor Author

C0urante commented Jul 16, 2019

@wicknicks I've rebased on the latest trunk and addressed all of your comments; could you please make a second pass when you get a chance?

Copy link
Copy Markdown
Contributor

@wicknicks wicknicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo some comments where you wanted external inputs, LGTM.

@C0urante
Copy link
Copy Markdown
Contributor Author

Thanks @wicknicks!

@rhauch could you take a look at this when you get a chance?

Copy link
Copy Markdown
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, though I have one comment and one question about potentially removing the second call to maybeThrowProducerSendException().

@rhauch rhauch merged commit 237e83d into apache:trunk Aug 25, 2019
rhauch pushed a commit that referenced this pull request Aug 25, 2019
)

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.
rhauch pushed a commit that referenced this pull request Aug 25, 2019
)

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.
rhauch pushed a commit that referenced this pull request Aug 25, 2019
)

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.
rhauch pushed a commit that referenced this pull request Aug 25, 2019
)

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.
rhauch pushed a commit that referenced this pull request Aug 25, 2019
)

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.
rhauch pushed a commit that referenced this pull request Aug 25, 2019
)

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.
xiowu0 pushed a commit to linkedin/kafka that referenced this pull request Aug 27, 2019
…rs fail to send records (apache#6993)

TICKET = KAFKA-8586
LI_DESCRIPTION =

EXIT_CRITERIA = HASH [1a5062c]
ORIGINAL_DESCRIPTION =

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.

(cherry picked from commit 1a5062c)
mageshn pushed a commit to confluentinc/kafka that referenced this pull request Nov 20, 2019
…ache#6993)

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.

(cherry picked from commit 237e83d)
mageshn added a commit to confluentinc/kafka that referenced this pull request Nov 20, 2019
mageshn pushed a commit to confluentinc/kafka that referenced this pull request Nov 20, 2019
…ache#6993)

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.

(cherry picked from commit 3aa9f99)
mageshn added a commit to confluentinc/kafka that referenced this pull request Nov 23, 2019
…ache#6993) (#239)

Changed Connect's `WorkerSourceTask` to capture non-retriable exceptions from the `producer.send(...)` (e.g., authentication or authorization errors) and to fail the connector task when such an error is encountered. Modified the existing unit tests to verify this functionality.

Note that most producer errors are retriable, and Connect will (by default) set up each producer with 1 max in-flight message and infinite retries. This change only affects non-retriable errors.

(cherry picked from commit 3aa9f99)
@C0urante C0urante deleted the kafka-8586 branch January 12, 2021 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants