-
Notifications
You must be signed in to change notification settings - Fork 22
Add Kafka message metadata fields (offset, partition, topic) to batch processing for deduplication support #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| commands: | ||
| - name: attach to motherduck | ||
| sql: | | ||
| ATTACH 'md:my_db' | ||
|
|
||
| - name: create events table | ||
| sql: | | ||
| CREATE TABLE IF NOT EXISTS my_db.events ( | ||
| ip VARCHAR, | ||
| event VARCHAR, | ||
| properties_city VARCHAR, | ||
| properties_country VARCHAR, | ||
| timestamp TIMESTAMP, | ||
| type VARCHAR, | ||
| userId VARCHAR | ||
| ) | ||
|
|
||
| - name: create events metadata table | ||
| sql: | | ||
| CREATE TABLE IF NOT EXISTS my_db.events_metadata ( | ||
| partition INTEGER, | ||
| "offset" BIGINT, | ||
| topic VARCHAR, | ||
| updated_at TIMESTAMP DEFAULT now(), | ||
| PRIMARY KEY (topic, partition) | ||
| ) | ||
| pipeline: | ||
| name: kafka-motherduck-sink | ||
| description: "Sinks data from kafka to motherduck" | ||
| batch_size: {{ SQLFLOW_BATCH_SIZE|default(100000) }} | ||
|
|
||
| source: | ||
| type: kafka | ||
| kafka: | ||
| brokers: [{{ SQLFLOW_KAFKA_BROKERS|default('localhost:9092') }}] | ||
| group_id: motherduck-sink | ||
| auto_offset_reset: earliest | ||
| topics: | ||
| - "input-user-clicks-motherduck" | ||
|
|
||
| handler: | ||
| type: "handlers.InferredMemBatch" | ||
| sql: | | ||
| BEGIN TRANSACTION; | ||
|
|
||
| CREATE OR REPLACE TEMPORARY TABLE filtered_batch AS | ||
| SELECT b.* | ||
| FROM batch b | ||
| LEFT JOIN my_db.events_metadata em | ||
| ON b.kafka_topic = em.topic | ||
| AND b.kafka_partition = em.partition | ||
| WHERE em."offset" IS NULL | ||
| OR b.kafka_offset > em."offset"; | ||
|
|
||
| INSERT INTO my_db.events | ||
| SELECT | ||
| ip, | ||
| event, | ||
| properties ->> 'city' AS properties_city, | ||
| properties ->> 'country' AS properties_country, | ||
| CAST(timestamp AS TIMESTAMP) AS timestamp, | ||
| type, | ||
| userId | ||
| FROM filtered_batch; | ||
|
|
||
| INSERT INTO my_db.events_metadata | ||
| (partition, "offset", topic) | ||
| SELECT | ||
| kafka_partition AS partition, | ||
| MAX(kafka_offset) AS "offset", | ||
| kafka_topic AS topic, | ||
| FROM filtered_batch | ||
| WHERE kafka_offset IS NOT NULL | ||
| GROUP BY kafka_partition, kafka_topic | ||
| ON CONFLICT (topic, partition) | ||
| DO UPDATE SET | ||
| "offset" = EXCLUDED."offset", | ||
| updated_at = now(); | ||
|
|
||
| COMMIT; | ||
|
|
||
| sink: | ||
| type: noop | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -101,8 +101,14 @@ def init(self): | |
| self.rows = [] | ||
| return self | ||
|
|
||
| def write(self, bs): | ||
| def write(self, bs, offset=None, partition=None, topic=None): | ||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is great for right now, and love that you have concrete use cases for this! We may have to rethink this in a future version. Kafka is, by far, the most popular source, but the dream is to have other sources available as well. I think this is great, just commenting for future! |
||
| o = self.deserializer.decode(bs) | ||
| if offset is not None: | ||
| o['kafka_offset'] = offset | ||
| if partition is not None: | ||
| o['kafka_partition'] = partition | ||
| if topic is not None: | ||
| o['kafka_topic'] = topic | ||
| self.rows.append(o) | ||
|
|
||
| def invoke(self) -> Optional[pa.Table]: | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -6,12 +6,23 @@ | |||||
| logger = logging.getLogger(__name__) | ||||||
|
|
||||||
| class Message: | ||||||
| def __init__(self, value: bytes): | ||||||
| def __init__(self, value: bytes, topic: str | None, partition: int | None, offset: int | None): | ||||||
|
||||||
| def __init__(self, value: bytes, topic: str | None, partition: int | None, offset: int | None): | |
| def __init__(self, value: bytes, topic: str | None = None, partition: int | None = None, offset: int | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I need to figure out why CI didn't run on a fork! 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing comma after 'kafka_topic AS topic' in the SELECT statement. This will cause a SQL syntax error.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooo
https://duckdb.org/docs/stable/sql/dialect/friendly_sql.html
<3 I had no idea that duckdb does this!