-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
What happened?
Since #22718 our Spotify's Scio based streaming pipelines on Google Cloud Dataflow are failing with the AvroCodec exception while reading data from BigQuery (with TypedRead).
The last released version of Beam that works properly is 2.42.0 and we cannot upgrade some of our pipelines further because of the issue.
We are reading GenericRecords from temporary BigQuery table and apply parseFn function to it to create arbitrary (non-avro) types which is effectively the Case 3 from the table described in the AvroSource.Mode.
// pseudo code
BigQueryIO
.read(SerializableFunction[SchemaAndRecord, T] parseFn)
.withCoder(Coder[T] coder)I analysed the issue. It is complex but the gist of it is that:
- Support custom avro DatumReader when reading from BigQuery #22718 adds the ability to use a custom
AvroSource.DatumReaderFactoryimplementation for reading from BigQuery; - it creates its "default" / "backwards compatibility" implementation and uses it in
BigQueryIO.read; - this "default" implementation is in fact using the
parseFnfunction (supplied toBigQueryIO.read) to actually return the parsed type from customDatumReader; - however, it does not (and cannot) propagate the output
Coderto theAvroSourceused for reading the data; - Dataflow (in the streaming mode) is wrapping the
AvroSourceinUnboundedReadFromBoundedSourcewrapper to use it asUnboundedSource; - on the way it tries to get the output
Coderfrom the underlyingAvroSourceto use it asCheckpointCoderfor checkpointing; AvroSourcedoes not have a clue aboutparseFnbeing actually used and it returns theAvroCoderinstance which of course cannotencodearbitrary (non-avro) types
The biggest issue I see is that the contract between using parseFn in the process and supplying the output Coder that AvroSource enforces is broken by moving the responsibility of applying the parseFn into GenericDatumTransformer.
I am thinking about contributing a fix and I am pondering on the following solution:
- removal of the
BigQueryIO.GenericDatumTransformer - bringing back the
parseFntoBigQueryBaseSourcehierarchy - simplifying the
datumReaderFactorytype toAvroSource.DatumReaderFactory<T>and stop applyingparseFnin it - adding validation that only one of
parseFnordatumReaderFactoryis used - I believe that the purpose of customDatumReaderis to actually readSpecificRecords and output them without the need for additional parsing. - creating
AvroSources accordingly to which param was actually provided inBigQuerySourceBase.createSources
This will of course add more complexity to the already complex process but will keep the backwards compatibility in more scenarios.
CC: @steveniemitz @kkdoon
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner