KAFKA-15121: Implement the alterOffsets method in the FileStreamSourceConnector and the FileStreamSinkConnector by yashmayya · Pull Request #13945 · apache/kafka

yashmayya · 2023-07-01T07:04:05Z

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…eConnector and the FileStreamSinkConnector

gharris1727

Hey thanks for handling this, Yash!

I had one comment about validating filenames from previous configurations.

gharris1727 · 2023-07-07T20:41:27Z

+
+        // This connector makes use of a single source partition which represents the file that it is configured to read from
+        if (offsets.size() > 1) {
+            throw new ConnectException("The " + FileStreamSourceConnector.class.getSimpleName() + " supports only a single source partition / file");


Suppose that:

The connector is configured with one file a

The task commits offsets for a

The connector is reconfigured for a different file b

The task commits offsets for b

A user calls GET /offsets and sees both a and b

A user calls DELETE /offsets with a, or a and b in the same request.

I think in this situation, it is reasonable for the delete to succeed; The user clearly intends to delete the old offset, and we shouldn't force them to revert the configuration to contain a to delete that offset. Also, a PATCH /offsets which clears a and updates b in the same request should succeed.

I think it is reasonable to fail a request which updates a, since that is not an offset which the current configuration could produce.

So in other words: If the value is non-null, then we should validate the key. If the value is null, then any filename should be allowed, since it might have been from a previous configuration.

Oh wow, that's a really good catch, thanks! One point to note is that in a DELETE /offsets request for a source connector, users don't specify the partitions / offsets (we write null offsets for every known partition).

Thinking about this a bit more, we shouldn't fail the request even if there are multiple partitions with non-null values. Taking the same example you used, if a user calls GET /offsets and copies the response body as the request body for a PATCH /offsets request while only modifying the offset value for file b, I don't think it makes sense to fail that request. Also, requests that don't even include the current source partition are also fairly harmless. I've relaxed the validation criteria in the latest patch.

…for multiple source partitions

C0urante

Thanks Yash! Left some thoughts.

BTW, have you also filed a ticket to implement these APIs in MM2? If not, no worries; I'd be happy to take that work on if you don't have the cycles.

yashmayya

Thanks Chris, I've addressed your review comments and created this ticket for MM2 (but haven't self-assigned it).

C0urante

Thanks Yash! Still have some thoughts. Also wondering if maybe it's time we add integration tests for these connectors, given possible interactions between the Connect runtime and the file Connector and Task classes that are tricky to test. Definitely not a blocker for this PR but interested in your take.

C0urante · 2023-07-11T14:06:04Z

+
+            // The 'position' in the offset represents the position in the file's byte stream and should be a non-negative long value
+            try {
+                long offsetPosition = Long.parseLong(String.valueOf(offset.get(POSITION_FIELD)));


Isn't this too permissive? Based on the testAlterOffsetsOffsetPositionValues test case, this would allow values of both "10" (string) and 10 (number) for the offset. But it looks like the source task class would only work with numbers.

Ah, good catch, thanks. I think it might probably be a bit friendlier if we update the task class instead to do similar parsing, WDYT? I'm okay either way, since the most common use case would be copy pasting the output from GET /offsets and modifying it in which case users would end up using a number rather than a string anyway.

I'd prefer to leave the task parsing the same; less work on our part, and less risk of a regression in existing parts of the code base.

I found something interesting when experimenting with this. When passing a request body to the PATCH /offsets endpoint like:

{ "offsets": [ { "partition": { "filename": "/path/to/filename" }, "offset": { "position": 20 } } ] }

the position 20 is deserialized to an Integer by Jackson (the JSON library we use in the REST layer for Connect) which seems fine because JSON doesn't have separate types for 32 bit and 64 bit numbers. So, the offsets map that is passed to FileStreamSourceConnector::alterOffsets by the runtime also contains 20 as an Integer value. I initially thought that this would cause the FileStreamSourceTask to fail at startup because it uses an instanceof Long check here (and an Integer value is obviously not an instance of Long). However, interestingly, the task did not fail and doing some debugging revealed that after the offsets are serialized and deserialized by the JsonConverter in OffsetsStorageWriter (in Worker::modifySourceConnectorOffsets) and OffsetsStorageReader respectively, the offsets map that is retrieved by the task on startup through its context contains the position 20 as a Long value.

While this particular case is easily handled by simply accepting Integer values as valid in the FileStreamSourceConnector::alterOffsets method, I'm thinking we probably need to make some changes so that the offsets map passed to source connectors in their alterOffsets method is the same as the offsets map that connectors / tasks will retrieve via the OffsetsStorageReader from their context (otherwise, this could lead to some hard to debug issues in other connectors implementing the SourceConnector::alterOffsets method). The easiest way off the top of my head would probably would be to serialize and deserialize the offsets map using the JsonConverter before invoking SourceConnector::alterOffsets. WDYT?

Furthermore, just checking whether the offset position is an instance of Long (Jackson uses a Long if the number doesn't fit in an Integer) or Integer in the FileStreamSourceConnector::alterOffsets method seems sub-optimal because:

To someone just reading through the FileStreamSourceConnector and FileStreamSourceTask classes, accepting Integer instances during validation but requiring Long instances in the actual task would look like a bug since the serialization + deserialization aspect isn't transparent.

It's an extremely unlikely scenario, but any changes in Jackson's deser logic could break things here - for instance, if smaller numbers are deserialized into Shorts instead of Integers. Parsing using a combination of String::valueOf and Long::parseLong seems a lot more robust.

Ah, nice catch! I noticed the discrepancy in numeric types while working on KAFKA-15177 but hadn't even considered the possibility of aligning the types across invocations of alterOffsets and OffsetStorageReader::offset/OffsetStorageReader::offsets.

I think re-deserializing the offsets before passing them to alterOffsets is a great idea. Unless the request body is gigantic there shouldn't be serious performance concerns, and it also acts as a nice preflight check to ensure that the offsets can be successfully propagated to the connector's tasks through the offsets topic.

I still don't love permitting string types for the connector's position offset values--it doesn't seem like a great endorsement of our API if we have to implement workarounds in the file connectors, which are the first example of the connector API that many developers see.

Since the type alignment issue seemed like a broader one (i.e. not scoped to the file connectors being touched here), I've created a separate ticket and PR for it.

it doesn't seem like a great endorsement of our API if we have to implement workarounds in the file connectors, which are the first example of the connector API that many developers see.

I'd argue that it isn't really a workaround and that the current check itself is bad. If the (de)serialization happened to use Integer for values that fit in a 32 bit signed type (which would be perfectly valid and is exactly what we see currently before the values are passed through the converter), the current check in the FileStreamSourceTask would cause it to bail.

Hmmm... wouldn't that be a pretty serious breaking change if we accidentally switched up how the JSON converter deserializes integer types? Not just for the file source connector, but for plenty of others.

It feels like it might be a better use of our time to make note of this possibility and ensure that we have sufficient unit testing in place to prevent that kind of regression (I suspect we already do but haven't verified this yet).

Of course, because things aren't interesting enough already--it turns out that there's actually two different scenarios in which tasks observe offsets for their connector. The first, which we're all familiar with, is when they query them using an OffsetStorageReader, which in distributed mode reflects the contents of the offsets topic. The second is when SourceTask::commitRecord is invoked, which carries with it the just-ack'd SourceRecord instance originally provided by the task, including the original in-memory source partition and source offset, which may use types that get lost when written to and read back from the offsets topic.

I don't know if this significantly changes the conversation but it seems subtle and counterintuitive enough to bring up so that we can avoid accidentally breaking connector code that relies on this behavior.

Hmmm... wouldn't that be a pretty serious breaking change if we accidentally switched up how the JSON converter deserializes integer types? Not just for the file source connector, but for plenty of others.

Okay, that's fair enough, I've changed the check in FileStreamSourceConnector::alterOffsets to mirror the one made in the task at startup for consistency (and avoided making changes in the existing task logic). This does mean that this PR should be merged after #14003 has been merged (assuming that that approach is acceptable).

I don't know if this significantly changes the conversation but it seems subtle and counterintuitive enough to bring up so that we can avoid accidentally breaking connector code that relies on this behavior.

Hm yeah, that's definitely another interesting one to bring up - however, I'd contend that that one kinda makes sense since we're passing the SourceRecord itself - tasks already deal with SourceRecord and their offsets (and associated types) in their regular lifecycle. It would be highly confusing if the SourceRecord that they get in commitRecord doesn't match the one they dispatched to the framework via poll. Of course, ideally, the offsets that they read via OffsetStorageReader should also not have any type mismatches compared to the SourceRecord ones, but I don't think we'd want to (or safely could) change that at this point.

Since the offsets being altered externally would correspond to the ones that the connector / tasks read at startup, I think it makes sense to align the types across invocations to SourceConnector::alterOffsets and offsets queried from an OffsetStorageReader (and an implicit separate alignment between the SourceRecord's offsets types).

💯 Sounds good, thanks for thinking this through.

yashmayya · 2023-07-11T14:32:59Z

I was toying with the idea of adding ITs for these file connectors in this very PR earlier; but after a bit of tinkering, left it for later 😄

I'd be happy to file a separate ticket for that and assign it to myself.

Edit: https://issues.apache.org/jira/browse/KAFKA-15179

C0urante

Thanks Yash! LGTM, will merge once #14003 lands.

* ak/trunk: (110 commits) MINOR: Update docs to include ZK deprecation notice and information (apache#14031) KAFKA-15091: Fix misleading Javadoc for SourceTask::commit (apache#13948) KAFKA-14669: Use the generated docs for MirrorMaker configs in the doc (apache#13658) KAFKA-14953: Add tiered storage related metrics (apache#13944) KAFKA-15121: Implement the alterOffsets method in the FileStreamSourceConnector and the FileStreamSinkConnector (apache#13945) Revert "MINOR: Update .asf.yaml file with refreshed github_whitelist, and collaborators" (apache#14037) MINOR: Update .asf.yaml file with refreshed github_whitelist, and collaborators KAFKA-14737: Move kafka.utils.json to server-common (apache#13585) KAFKA-14647: Move TopicFilter to server-common/utils (apache#13158) MINOR: remove unused variable in examples (apache#14021) ...

…eConnector and the FileStreamSinkConnector (apache#13945) Reviewers: Chris Egerton <chrise@aiven.io>

KAFKA-15121: Implement the alterOffsets method in the FileStreamSourc…

ee7c39d

…eConnector and the FileStreamSinkConnector

yashmayya requested a review from C0urante July 1, 2023 07:04

yashmayya added the connect label Jul 1, 2023

gharris1727 reviewed Jul 7, 2023

View reviewed changes

Relax FileStreamSourceConnector offsets validation criteria to allow …

776e360

…for multiple source partitions

C0urante reviewed Jul 10, 2023

View reviewed changes

Add validation for offset position value; minor test changes

d60ce9e

yashmayya commented Jul 11, 2023

View reviewed changes

C0urante reviewed Jul 11, 2023

View reviewed changes

Remove redundant comment; fix early return bug; refactor test

45f0e53

yashmayya mentioned this pull request Jul 12, 2023

KAFKA-15182: Normalize source connector offsets before invoking SourceConnector::alterOffsets #14003

Merged

3 tasks

C0urante mentioned this pull request Jul 12, 2023

KAFKA-15177: Implement KIP-875 SourceConnector::alterOffset API in MirrorMaker 2 connectors #14005

Merged

3 tasks

Refine check for offset position type

cd46163

C0urante approved these changes Jul 13, 2023

View reviewed changes

C0urante reviewed Jul 17, 2023

View reviewed changes

Comment thread connect/file/src/main/java/org/apache/kafka/connect/file/FileStreamSourceConnector.java

C0urante merged commit 0b025b7 into apache:trunk Jul 18, 2023

jeqo pushed a commit to jeqo/kafka that referenced this pull request Jul 21, 2023

KAFKA-15121: Implement the alterOffsets method in the FileStreamSourc…

8a222ff

…eConnector and the FileStreamSinkConnector (apache#13945) Reviewers: Chris Egerton <chrise@aiven.io>

Cerchie pushed a commit to Cerchie/kafka that referenced this pull request Jul 25, 2023

KAFKA-15121: Implement the alterOffsets method in the FileStreamSourc…

03d8f39

…eConnector and the FileStreamSinkConnector (apache#13945) Reviewers: Chris Egerton <chrise@aiven.io>

jeqo pushed a commit to aiven/kafka that referenced this pull request Aug 15, 2023

KAFKA-15121: Implement the alterOffsets method in the FileStreamSourc…

99654d7

…eConnector and the FileStreamSinkConnector (apache#13945) Reviewers: Chris Egerton <chrise@aiven.io>

This was referenced Aug 17, 2023

MINOR: Allow writing tombstone offsets for arbitrary partitions in the FileStreamSourceConnector #14234

Merged

KAFKA-15179: Add integration tests for the file sink and source connectors #14279

Merged

Conversation

yashmayya commented Jul 1, 2023

Committer Checklist (excluded from commit message)

Uh oh!

gharris1727 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yashmayya left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C0urante Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yashmayya Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yashmayya commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

C0urante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yashmayya left a comment •

edited

Loading

C0urante Jul 12, 2023 •

edited

Loading

yashmayya Jul 13, 2023 •

edited

Loading

yashmayya commented Jul 11, 2023 •

edited

Loading