The tap doesn't respect existing replication state by filter out data older than the replication key value in the state.
How to reproduce
Github tap configuration
- name: tap-github-repos
inherit_from: tap-github
pip_url: git+https://github.com/MeltanoLabs/tap-github.git
config:
user_agent: ''
start_date: '2023-01-01T00:00:00Z'
searches:
- name: All repos
query: apache/*
variant: meltanolabs
select:
- repositories.*
metadata:
repositories:
replication-method: INCREMENTAL
Run a sync that produces 1000 (limit for the 'repositories' stream) records and a state record.
meltano run tap-github-repos target-jsonl
Run the same sync one more time
meltano run tap-github-repos target-jsonl
Result is there are 2000 records in the target json file and each record is fully duplicated.
The issue can be reproduced on the repositories stream.
I couldn't reproduce this on the issues stream.
I haven't tested other streams.
If Github APIs do not allow fetching data from a specific replication point (at least for the repositories stream) then the tap should filter those records instead of sending them down the pipeline.
The tap doesn't respect existing replication state by filter out data older than the replication key value in the state.
How to reproduce
Github tap configuration
Run a sync that produces 1000 (limit for the 'repositories' stream) records and a state record.
Run the same sync one more time
Result is there are 2000 records in the target json file and each record is fully duplicated.
The issue can be reproduced on the
repositoriesstream.I couldn't reproduce this on the
issuesstream.I haven't tested other streams.
If Github APIs do not allow fetching data from a specific replication point (at least for the
repositoriesstream) then the tap should filter those records instead of sending them down the pipeline.