Skip to content

[Bug]: Configuration row arguments may get misplaced between Python SchemaTransformPayload encoding and Java RowCoder decoding #25669

@ahmedabu98

Description

@ahmedabu98

What happened?

Was testing a SchemaTransform Python wrapper (#25521) and found that I had to have a right ordering of kwargs for the input arguments to reach the Java transform in the right fields. This is weird because the ordering of kwargs should have no impact.

For example, where self._table="my_project:my_dataset.xlang_table",

the following works fine:

external_storage_write = SchemaAwareExternalTransform(
    identifier=self.schematransform_config.identifier,
    expansion_service=self._expansion_service,
    createDisposition=self._create_disposition,
    writeDisposition=self._write_disposition,     #<---
    triggeringFrequencySeconds=self._triggering_frequency,
    useAtLeastOnceSemantics=self._use_at_least_once,
    table=self._table)                            #<---

and I get a configuration object in Java transform that looks like this:

BigQueryStorageWriteApiSchemaTransformConfiguration{
  table=my_project:my_dataset.xlang_table, 
  createDisposition=, 
  writeDisposition=, 
  triggeringFrequencySeconds=0, 
  useAtLeastOnceSemantics=false}

However, if I change the kwargs to look like this (switch places of table and writeDisposition):

external_storage_write = SchemaAwareExternalTransform(
    identifier=self.schematransform_config.identifier,
    expansion_service=self._expansion_service,
    createDisposition=self._create_disposition,
    table=self._table,                            #<---
    triggeringFrequencySeconds=self._triggering_frequency,
    useAtLeastOnceSemantics=self._use_at_least_once,
    writeDisposition=self._write_disposition)     #<---

I get the following configuration object. Notice the value intended for table is now in the writeDisposition field.

BigQueryStorageWriteApiSchemaTransformConfiguration{
  table=, 
  createDisposition=, 
  writeDisposition=my_project:my_dataset.xlang_table, 
  triggeringFrequencySeconds=0, 
  useAtLeastOnceSemantics=false

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions