Skip to content

Conversation

@shunping
Copy link
Collaborator

@shunping shunping commented Jun 25, 2025

fixes #35512

PR #35243 has fixed the logical type for JdbcTimeType (TIME field in SQL) and JdbcDateType (DATE field in SQL). However, it fails when the TIMESTAMP field is used. Notice that when this happens, the schema proto of this field is of logical type MillisInstant.

fields {
  name: "event_timestamp"
  type {
    nullable: true
    logical_type {
      urn: "beam:logical_type:millis_instant:v1"
      representation {
        atomic_type: INT64
      }
    }
  }
  id: 1
  encoding_position: 1
}

This causes failures in one of the post commit xlang test #35285.

There is already a fix proposed #35400, and this PR is another attempt.


As stated in #35243 (comment), the root cause of this problem is that the same language type (e.g. TimeStamp) can be associated to multiple logical types (MillisInstant and MicrosInstant). It becomes a problem because of how these types are passed through pipeline construction and submission.

  • First, during pipeline expansion, when jdbcio is used, expansion service is called to retrieve the sub-graph which includes necessary PCollections, PTransforms, and Coders. For a Row, the schema proto is decoded from the payload of the coder and will be converted into a Beam Schema (of which the typehint starts with "BeamSchema_"). Here we get the logical type (MillisInstant) from urn (beam:logical_type:millis_instant:v1) in the schema proto and convert it into the language type (Timestamp). The typehint of Beam Schema is generated for the pcollection as an element type.
  • Before we submit the pipeline, we need to call to_runner_api to convert the whole pipeline into a proto. When it tries to serialize the RowCoder, it tries to revert what has done in the pipeline expansion by generating the schema proto again based on the typehint from the RowCoder. In our case, Timestamp is a component in the Row, but when the SDK trying to determine its logical type for the schema proto, it resolves it to MicrosInstant because it is defined after MillisInstant.

The proposed approach here has three parts:

  1. We get rid of the "unnecessary" conversion of "schema proto -> typehint -> schema proto" by using schema proto in the schema registry. (In addition, we also propagated field description correctly in schema fields).
  2. We revert the temporary workaround we brought back due to post commit test failure Fix post commit xlang jdbcio test failure #35428
  3. For explicit schema provided to ReadFromJdbc, the fix in 1. won't work because the schema is provided by the users (rather than coming from the Expansion service). There won't be a schema proto in the schema registry. In this case, we apply the workaround inside ReadFromJdbc only in the scope of converting typehint to schema proto, making sure the language type of Timestamp will be mapped to logical type MillisInstant during the conversion.

@shunping
Copy link
Collaborator Author

cc'ed @claudevdm

@shunping shunping changed the title Another fix for jdbc timestamp logical type. Another attempt to fix jdbc timestamp logical type. Jun 25, 2025
- Used schema registry directly if it exists.
- Fixed a bug on schema registry.
- Triggered post commit tests for jdbcio.
- Fixed failed tests and trigger dataflow test on xlang.
- Fix schema test by adding field description in named_fields_to_schema
@shunping shunping force-pushed the jdbc-timestamp-logical-type branch from adc399a to 996fed9 Compare June 29, 2025 04:34

def add(self, typing, schema):
if not schema.id:
if schema.id:
Copy link
Collaborator Author

@shunping shunping Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we change this line in #28169 to disable the look up of schema registry.

Let's see if this breaks anythong.

Copy link
Collaborator Author

@shunping shunping Jun 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abacn, do you know which test workflow covers this test?

sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/providers/BigQueryStorageWriteApiSchemaTransformProviderTest.java

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local tests passed with ./gradlew --info sdk:java:io:google-cloud-platform:test --tests 'org.apache.beam.sdk.io.gcp.bigquery.providers.*'.

@github-actions github-actions bot added the io label Jun 29, 2025
@shunping shunping force-pushed the jdbc-timestamp-logical-type branch from d497915 to 19add94 Compare July 1, 2025 13:22
@shunping
Copy link
Collaborator Author

shunping commented Jul 1, 2025

Both PostCommit Python Xlang Gcp Direct and PostCommit Python Xlang Gcp Dataflow are failing on master for a day, so it may not be related.

@shunping shunping requested review from Abacn and claudevdm July 1, 2025 18:15
@shunping shunping self-assigned this Jul 1, 2025
@shunping shunping marked this pull request as ready for review July 1, 2025 18:15
@github-actions
Copy link
Contributor

github-actions bot commented Jul 1, 2025

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@codecov
Copy link

codecov bot commented Jul 2, 2025

Codecov Report

Attention: Patch coverage is 55.55556% with 8 lines in your changes missing coverage. Please review.

Project coverage is 56.54%. Comparing base (623e998) to head (0c30b08).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
sdks/python/apache_beam/io/jdbc.py 27.27% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #35426      +/-   ##
============================================
+ Coverage     56.51%   56.54%   +0.02%     
  Complexity     3319     3319              
============================================
  Files          1198     1199       +1     
  Lines        182839   183113     +274     
  Branches       3426     3426              
============================================
+ Hits         103332   103535     +203     
- Misses        76208    76279      +71     
  Partials       3299     3299              
Flag Coverage Δ
python 80.78% <55.55%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@shunping shunping marked this pull request as draft July 2, 2025 17:46
@shunping
Copy link
Collaborator Author

shunping commented Jul 2, 2025

I am seeing the following error in the test(test_xlang_jdbc_write_read_0_postgres (apache_beam.io.external.xlang_jdbcio_it_test.CrossLanguageJdbcIOTest):

apache_beam/coders/coder_impl.py:1549: in decode_from_stream
    value = self._value_coder.decode_from_stream(in_stream, nested)
apache_beam/coders/coder_impl.py:1641: in decode_from_stream
    return self._value_coder.decode(in_stream.read(value_length))
apache_beam/coders/coder_impl.py:239: in decode
    return self.decode_from_stream(create_InputStream(encoded), False)
apache_beam/coders/coder_impl.py:1954: in decode_from_stream
    item = component_coder.decode_from_stream(in_stream, True)
apache_beam/coders/coder_impl.py:2007: in decode_from_stream
    return self.logical_type.to_language_type(
apache_beam/typehints/schemas.py:1038: in to_language_type
    return DecimalLogicalType().to_language_type(value)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <apache_beam.typehints.schemas.DecimalLogicalType object at 0x7ecfb6e93520>
value = Decimal('-1.23')

    def to_language_type(self, value):
      # type: (bytes) -> decimal.Decimal
>     return decimal.Decimal(value.decode())
E     AttributeError: 'decimal.Decimal' object has no attribute 'decode'

@shunping
Copy link
Collaborator Author

shunping commented Jul 3, 2025

The failed tests have been red for a few days and are not related to this change.

@shunping shunping marked this pull request as ready for review July 3, 2025 22:50
@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2025

Assigning reviewers:

R: @liferoad for label python.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Abacn
Copy link
Contributor

Abacn commented Jul 11, 2025

Consider run https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python_Xlang_Gcp_Direct.yml?query=event%3Aschedule which contains more relevant tests (e.g. bigquery IO write xlang test that also involves Decimal)

@shunping
Copy link
Collaborator Author

@liferoad liferoad added this to the 2.67.0 Release milestone Jul 15, 2025
Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, feel free to merge after tests passed

@shunping
Copy link
Collaborator Author

Thanks, feel free to merge after tests passed

Thanks. Tests are not healthy at the moment. Hopefully I can push before 2.67 branch cut.

@shunping shunping changed the title Another attempt to fix jdbc timestamp logical type. Another attempt to fix timestamp logical type. Jul 16, 2025
@shunping
Copy link
Collaborator Author

shunping commented Jul 17, 2025

PostCommit Python tests have been red for a while and Python Coverage failure is not related to the commit (Timeout).

Xlang Gcp Direct (https://github.com/apache/beam/actions/runs/16331090330/job/46133582842?pr=35426) is green.

Merging it one week before the branch cut so that we can see if there is any breakage before it is too late.

@shunping shunping merged commit 86a71ca into apache:master Jul 17, 2025
89 of 96 checks passed
@shunping shunping deleted the jdbc-timestamp-logical-type branch July 19, 2025 03:12
@shunping shunping changed the title Another attempt to fix timestamp logical type. Another attempt to fix timestamp and decimal logical type. Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: JDBCIO failed to handle TIMESTAMP field.

3 participants