[BEAM-7246] Add Google Spanner IO Read on Python SDK #9606

mszb · 2019-09-18T10:26:02Z

This is the read implementation of the Spanner.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

mszb · 2019-09-18T15:07:40Z

Run Python PreCommit

mszb · 2019-09-20T13:53:47Z

R: @aaltay @chamikaramj
Hi Ahmet and Chamikara.
We implemented the spanner read transform (write part is in progress). Can you please review it and share your valuable feedback on the approach or point out any missing bit & pieces?
Thanks!

shehzaadn-vd · 2019-10-01T09:38:32Z

@chamikaramj @aaltay please take a look. thanks.

chamikaramj

Thanks.

chamikaramj · 2019-10-01T01:34:56Z

sdks/python/apache_beam/io/gcp/spannerio.py

+  Encapsulates a spanner read operation.
+  """
+
+  __slots__ = ()


Is this needed ?

This prevents the creation of dictionaries instances and minimizes memory usage.

ref: https://docs.python.org/3/library/collections.html#collections.somenamedtuple._field_defaults

chamikaramj · 2019-10-01T01:39:04Z

sdks/python/apache_beam/io/gcp/spannerio.py

+        snapshot_exact_staleness=exact_staleness
+    )
+
+  def with_query(self, sql, params=None, param_types=None):


We usually use keyword arguments instead of builder patter for Beam Python SDK connectors (see textio, bigqueryio, etc).

Sure, i'll look into this.

I've updated the code and remove the builder pattern.

chamikaramj · 2019-10-01T17:15:45Z

sdks/python/apache_beam/io/gcp/spannerio.py

+__all__ = ['ReadFromSpanner', 'ReadOperation',]
+
+
+class ReadOperation(collections.namedtuple("ReadOperation",


Can you document why this has to be a part of the public API ?

As in Java, we have a ReadOperation which executes multiple reads (sql / via table). In python, we also have the read_all method which executes read operation in the same manner!

Unfortunately, I forget to add the test case of read_all. (i will add them now)

Example:

reads = [ ReadOperation.with_query('SELECT * FROM users'), ReadOperation.with_table("roles", ['key', 'rolename']) ] records = pipeline | ReadFromSpanner(...).read_all(reads)

chamikaramj · 2019-10-01T20:23:24Z

sdks/python/apache_beam/io/gcp/spannerio.py

+    return self.read_all(read_operation)
+
+  def read_all(self, read_operations):
+    if self._transaction is None:


Please document why we need to fork here.

Are you asking for the read_all or for the transaction?

chamikaramj · 2019-10-01T20:24:10Z

sdks/python/apache_beam/io/gcp/spannerio.py

+
+  @staticmethod
+  @experimental(extra_message="(ReadFromSpanner)")
+  def create_transaction(project_id, instance_id, database_id, credentials=None,


Does this method have to be public (please precede all methods/classes that should not be a part of public API with _)

Yes, it is a public method. Users can create a transaction with this method and pass it on with the constructor of the spanner read/write operations (same is available in java)

I will add some docs to make it more clear.

Example:

transaction = ReadFromSpanner.create_transaction( project_id=TEST_PROJECT_ID, instance_id=TEST_INSTANCE_ID, database_id=TEST_DATABASE_NAME, exact_staleness=datetime.timedelta(seconds=10)) records = (pipeline | ReadFromSpanner(project_id=TEST_PROJECT_ID, instance_id=TEST_INSTANCE_ID, database_id=TEST_DATABASE_NAME) .with_transaction(transaction) .with_query('SELECT * FROM users'))

chamikaramj · 2019-10-01T20:49:50Z

sdks/python/apache_beam/io/gcp/spannerio.py

+
+class _BatchRead(PTransform):
+  """
+  This transform uses the Cloud Spanner BatchSnapshot to perform reads from


`BatchSnapshot`

Also please describe what BatchSnapshot is.

Sure i will...

chamikaramj · 2019-10-01T20:59:18Z

sdks/python/apache_beam/io/gcp/spannerio.py

+                       "google.cloud.spanner_v1.keyset.KeySet")
+    return cls(
+        read_operation="process_read_batch",
+        batch_action="generate_read_batches", transaction_action="read",


Storing the name of the attribute to execute in string form is pretty brittle. Please update the code to directly invoke the method from the class instead.

Okay.. good idea! I'll update the code... Thanks!

Was this addressed ? Looks like we are string creating magic strings like "process_read_batch" and "generate_read_batches". If you need to enable certain properties, please use booleans instead.

chamikaramj · 2019-10-01T21:01:57Z

sdks/python/apache_beam/io/gcp/spannerio.py

+                                       .snapshot_options)
+
+    reads = [
+        {"read_operation": ro.read_operation, "partitions": p}


Ditto regarding not storing method name in string form. Please fork here instead.

Also we cannot do critical IO operations during job construction. For example, (1) node that submit the job might not have access to Spanner (2) Some jobs (for example, Dataflow templates) will not invoke constructure in repeated executions. So this logic for initial splitting has to be moved to a DoFn.

Sure.. thanks for pointing it out!

chamikaramj · 2019-10-01T21:02:30Z

sdks/python/apache_beam/io/gcp/spannerio.py

+    reads = [
+        {"read_operation": ro.read_operation, "partitions": p}
+        for ro in self._read_operations
+        for p in getattr(snapshot, ro.batch_action)(**ro.kwargs)


Please make sure that the way you generate partitions here is compatible with the Java source.

chamikaramj · 2019-10-01T21:10:10Z

sdks/python/apache_beam/io/gcp/spannerio_test.py

+  @mock.patch('apache_beam.io.gcp.spannerio.BatchSnapshot')
+  def test_read_with_table_batch(self, mock_batch_snapshot_class,
+                                 mock_client_class):
+    mock_client = mock.MagicMock()


Also please consider adding an integration test similar to following for BigQuery.

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py

Sure! I'll add some more tests as per the reference.

I just saw the ticket (https://issues.apache.org/jira/browse/BEAM-7246), it says IT not included on this ticket.
Please suggest!

mszb · 2019-10-22T13:09:24Z

Run Python PreCommit

mszb · 2019-10-23T07:11:53Z

Run Python PreCommit

mszb · 2019-10-25T07:41:35Z

Run Python PreCommit

mszb · 2019-10-27T02:51:07Z

Run Python PreCommit

mszb · 2019-10-28T07:31:26Z

Hi @chamikaramj. I've made the changes you requested. Remove the builder pattern and change it to kwargs value. Also, all the IO operations are now in the pipeline instead of in the constructor.

Thanks!

chamikaramj · 2019-10-28T15:08:29Z

R: @udim will you be able to do a review round on this ?

chamikaramj · 2019-10-28T15:08:49Z

Thanks @mszb for the updates.

chamikaramj · 2019-10-28T15:10:10Z

@mszb also please remove the "DO NOT MERGE" tag and add a JIRA assuming this is ready for review.

mszb · 2019-10-28T15:27:07Z

@mszb also please remove the "DO NOT MERGE" tag and add a JIRA assuming this is ready for review.

sure!

udim

I've made an initial pass. I've tried to understand the Spanner Python API, but it's not clear to me how you're supposed to run the partitions generated by snapshot.generate_read_batches on remote machines.
(https://cloud.google.com/spanner/docs/reads#read_data_in_parallel).

udim · 2019-11-01T22:46:49Z

sdks/python/setup.py

Could this be moved to GCP_REQUIREMENTS?

Also, what is this dependency used for?

This dependency is required by spanner client (ref https://googleapis.dev/python/spanner/latest/_modules/google/cloud/spanner_v1/client.html#Client)
Its is good idea to move it to GCP_REQUIREMENTS.

udim · 2019-11-06T19:15:05Z

sdks/python/apache_beam/io/gcp/spannerio.py

udim · 2019-11-06T19:31:55Z

sdks/python/apache_beam/io/gcp/spannerio.py

Please decorate all DoFns with input and output type hints.
This makes the code easier to read and allow Beam to do type checks.

For example:

Suggested change

class _NaiveSpannerReadDoFn(DoFn):

from typing import Any, Dict

from apache_beam import typehints

SerializedBatchSnapshot = Dict[Any, Any]

@typehints.with_input_types(SerializedBatchSnapshot)

@typehints.with_output_types(<row type>)

class _NaiveSpannerReadDoFn(DoFn):

Sure, i'll work on this.

Was this addressed in all relevant locations ? Please reply to addressed comments with ""Done" or resolve addressed comments.

udim · 2019-11-06T19:35:10Z

sdks/python/apache_beam/io/gcp/spannerio.py

Please add element['read_operation'] to this error message.

Sure, i'll update this.

Looks like this wasn't addressed ?

sdks/python/apache_beam/io/gcp/spannerio.py

udim · 2019-11-12T02:10:52Z

sdks/python/apache_beam/io/gcp/spannerio.py

Should this be a second query example? Perhaps make this an example that uses params?

Sure, I'll add the param example here.
What I intend to mention here is that the user you run both ReadOperation's with sql & with table at the same time.

udim · 2019-11-12T02:11:02Z

sdks/python/apache_beam/io/gcp/spannerio.py

Suggested change

ReadOperation.sql('Select name, email from customers'),

ReadOperation.query('Select name, email from customers'),

Thanks for the change. It looks better, i'll update the example and test cases as well.

udim · 2019-11-12T03:07:36Z

sdks/python/apache_beam/io/gcp/spannerio.py

This needs an integration test before we can be sure it works as expected.
For example, I'm not sure if closing the BatchSnapshot here closes the read transaction.

The jira mentions that the IT and performance testing will be on a separate ticket!
https://issues.apache.org/jira/browse/BEAM-7246

Please change the classes here to private (for example, _ReadFromSpanner) till we have proper integration tests in place to prevent users from using a solution without proper testing in production. Also create a separate ticket for integration tests and refer to it here.

Having some integration tests in place is a must before making this publicly available IMHO. Performance tests are also helpful but optional.

Hi @chamikaramj. Thanks for the feedback. Yea IT tests should be there to make sure it's working properly. It's a good thought to make this transform private while some tests is remaining.

I have made the changes you mentioned. And also created the ticket for the IT tests.
BEAM-8949

mszb · 2019-11-18T14:34:45Z

retest this please

mszb · 2019-11-20T11:47:37Z

Run Portable_Python PreCommit

mszb · 2019-11-20T12:13:36Z

Run Python PreCommit

mszb · 2019-11-25T10:06:15Z

Hi @udim @chamikaramj. I've made the changes you mention. Please review them.

chamikaramj · 2019-12-17T01:19:43Z

Sorry about the delay here. Still reviewing.

chamikaramj

Thanks. Added some more comments.

BTW, to make the review process easier, either resolve already addressed comments or add a comment "Done". Also, no need to reply to comments that you hope to address in the future :).

chamikaramj · 2019-12-15T01:25:23Z

sdks/python/apache_beam/io/gcp/spannerio.py

+                       "google.cloud.spanner_v1.keyset.KeySet")
+    return cls(
+        read_operation="process_read_batch",
+        batch_action="generate_read_batches", transaction_action="read",


Was this addressed ? Looks like we are string creating magic strings like "process_read_batch" and "generate_read_batches". If you need to enable certain properties, please use booleans instead.

chamikaramj · 2019-12-15T19:20:33Z

sdks/python/apache_beam/io/gcp/spannerio.py

DItto. Can we use booleans to configure parameters instead of introducing magic strings ?

chamikaramj · 2019-12-15T19:23:11Z

sdks/python/apache_beam/io/gcp/spannerio.py

Was this addressed in all relevant locations ? Please reply to addressed comments with ""Done" or resolve addressed comments.

chamikaramj · 2019-12-15T19:27:53Z

sdks/python/apache_beam/io/gcp/spannerio.py

Why do we need this in the DoFn implementation ?

Done.
No need for this code. I've used this for local test, thanks for pointing this out!

chamikaramj · 2019-12-15T19:32:19Z

sdks/python/apache_beam/io/gcp/spannerio.py

Do we need to close/shutdown the spanner_client as well ?

There is no close/shutdown methods on the spanner client object.

sdks/python/apache_beam/io/gcp/spannerio.py

chamikaramj · 2019-12-18T19:24:39Z

sdks/python/apache_beam/io/gcp/spannerio.py

I'm not sure why we are forking here based on whether a transaction is provided or not. Seems like, in Java version, we use transactions for both native and batch versions of the read transform: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L420

If user does not provide the transaction, we use BatchSnapshot to generate query batches. The reason we can not use the transaction here because google.cloud.spanner_v1.database.BatchSnapshot.generate_read_batches requires snapshot with multi_use=True and uses a private method to create snapshot instance (google.cloud.spanner_v1.database.BatchSnapshot._get_snapshot).

On the other hand, create_trasaction simply returns the transaction_id and session_id which we reconstruct in the naive read using google.cloud.spanner_v1.database.BatchSnapshot.from_dict with no way to set multi_use=True Ref: https://github.com/googleapis/google-cloud-python/blob/master/spanner/google/cloud/spanner_v1/database.py#L651

So not using a transaction may offer better performance ? We should clarify this in the documentation.

chamikaramj · 2019-12-18T19:31:50Z

sdks/python/apache_beam/io/gcp/spannerio_test.py

Didn't review unit tests in detail yet, but please make sure that we at least have the same set of unit tests as Java.

Yes, i took the references from org.apache.beam.sdk.io.gcp.spanner.SpannerIOReadTest and implementing them in pythonic way.

chamikaramj · 2019-12-18T19:38:02Z

cc: @nielm @nithinsujir who are more familiar regarding Cloud Spanner in case they have additional comments.

nithinsujir · 2019-12-19T01:59:31Z

sdks/python/apache_beam/io/gcp/spannerio.py

What are "naive" reads? Do they mean single reads (point reads)?

Yes, you are right. In Naive reads transform we do not use spanner partitioning query in the transform.

nithinsujir · 2019-12-23T15:20:40Z

sdks/python/apache_beam/io/gcp/spannerio.py

Don't we need an error here if both table and sql are empty?

I implemented the checks in the expand method. https://github.com/apache/beam/pull/9606/files#diff-fbff531869ead87113e7c97e085d7013R513

nithinsujir · 2019-12-23T15:22:43Z

sdks/python/apache_beam/io/gcp/spannerio.py

+1 Also, is there a better name than Naive? I'm not able to deduce what it's trying to convey.

mszb · 2019-12-26T10:19:32Z

Run Python PreCommit

aaltay · 2020-01-03T00:09:18Z

@chamikaramj is this ready to be merged? Are all the open comments resolved?

chamikaramj · 2020-01-06T18:44:19Z

Still reviewing the latest round of updates. Thanks.

chamikaramj · 2020-01-07T02:18:00Z

sdks/python/apache_beam/io/gcp/experimental/spannerio.py

+  def process(self, element, transaction_info):
+    # We used batch snapshot to reuse the same transaction passed through the
+    # side input
+    self._snapshot = BatchSnapshot.from_dict(self._database, transaction_info)


How can we make sure that what's passed by transaction_info is consistent with what BatchSnapshot.from_dict() expects ? (for example, index). Can we introduce some sort of validation before this call ?

chamikaramj · 2020-01-07T02:19:08Z

sdks/python/apache_beam/io/gcp/experimental/spannerio_test.py

+@mock.patch('apache_beam.io.gcp.experimental.spannerio.BatchSnapshot')
+class SpannerReadTest(unittest.TestCase):
+
+  def test_read_with_query_batch(self, mock_batch_snapshot_class,


How about runReadUsingIndex ?

chamikaramj · 2020-01-07T02:19:38Z

Thanks. Mostly looks good.

Added few more comments.

chamikaramj · 2020-01-17T03:51:11Z

Any updates ?

mszb · 2020-01-17T11:13:29Z

retest this please

iemejia · 2020-01-17T13:41:35Z

retest this please

added spanne read io onto python sdk refactor code fix docstrings fix linting issue fix linting issue added display data method and its test cases. fix lint Update sdks/python/apache_beam/io/gcp/spannerio.py Co-Authored-By: Udi Meiri <udim@users.noreply.github.com> Update sdks/python/apache_beam/io/gcp/spannerio.py Co-Authored-By: Udi Meiri <udim@users.noreply.github.com> Update sdks/python/apache_beam/io/gcp/spannerio.py Co-Authored-By: Udi Meiri <udim@users.noreply.github.com> Update sdks/python/apache_beam/io/gcp/spannerio.py Co-Authored-By: Udi Meiri <udim@users.noreply.github.com> Update sdks/python/apache_beam/io/gcp/spannerio.py Co-Authored-By: Udi Meiri <udim@users.noreply.github.com> adds typehints and refactor code fix pylint fix import issues fix type hints changed the classes to private to prevent users from using the functionality while integration tests are in development. moved spannerio files to gcp experimental folder refactor code add docs

chamikaramj · 2020-01-17T19:29:23Z

Retest this please

chamikaramj · 2020-01-17T19:29:51Z

Retest this please

chamikaramj · 2020-01-17T19:38:33Z

LGTM. Thanks.

We can get this in when tests pass.

shehzaadn-vd · 2020-01-17T20:35:57Z

Thanks @chamikaramj for your support! @aaltay looks like the tests are passing. Would you be able to merge this?

chamikaramj · 2020-01-17T20:55:32Z

Thank you.
Let's get integration tests in so that we can move this out of experimental :)

mszb force-pushed the BEAM-7246_gcp_spanner_io branch from a48c033 to 66d0e49 Compare September 20, 2019 10:00

aaltay requested a review from chamikaramj September 20, 2019 16:08

chamikaramj requested changes Oct 1, 2019

View reviewed changes

mszb force-pushed the BEAM-7246_gcp_spanner_io branch 2 times, most recently from 544b3e4 to 470bddf Compare October 21, 2019 07:43

mszb force-pushed the BEAM-7246_gcp_spanner_io branch from 8daaefc to ed90437 Compare October 23, 2019 10:35

mszb requested a review from chamikaramj October 28, 2019 07:31

chamikaramj requested a review from udim October 28, 2019 15:09

mszb changed the title ~~[DO NOT MERGE] Add Google Spanner IO Read on Python SDK~~ [BEAM-7246] Add Google Spanner IO Read on Python SDK Oct 28, 2019

udim reviewed Nov 12, 2019

View reviewed changes

mszb force-pushed the BEAM-7246_gcp_spanner_io branch from a5e299f to 243e04e Compare November 14, 2019 06:44

mszb force-pushed the BEAM-7246_gcp_spanner_io branch from 0d8c4ba to d1326d1 Compare November 20, 2019 08:43

chamikaramj requested changes Dec 18, 2019

View reviewed changes

nithinsujir reviewed Dec 23, 2019

View reviewed changes

mszb force-pushed the BEAM-7246_gcp_spanner_io branch from 88bc4fe to 3502025 Compare December 26, 2019 12:03

mszb requested a review from chamikaramj January 3, 2020 10:54

chamikaramj reviewed Jan 7, 2020

View reviewed changes

mszb requested a review from chamikaramj January 17, 2020 11:12

Shoaib and others added 6 commits January 17, 2020 20:46

fix comments and docs

96b3853

remove code comments and added some docs

d5d3621

fix lint

5e665e6

adds new test case and some code refactoring

7d2cc5f

fix lint issues

f556f22

mszb force-pushed the BEAM-7246_gcp_spanner_io branch from f5e1098 to f556f22 Compare January 17, 2020 16:31

chamikaramj merged commit 5b6a0ea into apache:master Jan 17, 2020

mszb mentioned this pull request Jan 30, 2020

[BEAM-7246] Added Google Spanner Write Transform #10712

Merged

3 tasks

Abacn mentioned this pull request Oct 4, 2022

Bump google-cloud-spanner version for py containers #23480

Merged

4 tasks

		__all__ = ['ReadFromSpanner', 'ReadOperation',]


		class ReadOperation(collections.namedtuple("ReadOperation",

-class _NaiveSpannerReadDoFn(DoFn):
+from typing import Any, Dict
+from apache_beam import typehints
+SerializedBatchSnapshot = Dict[Any, Any]
+@typehints.with_input_types(SerializedBatchSnapshot)
+@typehints.with_output_types(<row type>)
+class _NaiveSpannerReadDoFn(DoFn):

	ReadOperation.sql('Select name, email from customers'),
	ReadOperation.query('Select name, email from customers'),

[BEAM-7246] Add Google Spanner IO Read on Python SDK #9606

[BEAM-7246] Add Google Spanner IO Read on Python SDK #9606

Uh oh!

Conversation

mszb commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This is the read implementation of the Spanner.

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

mszb commented Sep 18, 2019

Uh oh!

mszb commented Sep 20, 2019

Uh oh!

shehzaadn-vd commented Oct 1, 2019

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mszb commented Oct 22, 2019

Uh oh!

mszb commented Oct 23, 2019

Uh oh!

mszb commented Oct 25, 2019

Uh oh!

mszb commented Oct 27, 2019

Uh oh!

mszb commented Oct 28, 2019

Uh oh!

chamikaramj commented Oct 28, 2019

Uh oh!

chamikaramj commented Oct 28, 2019

Uh oh!

chamikaramj commented Oct 28, 2019

Uh oh!

mszb commented Oct 28, 2019

mszb commented Sep 18, 2019 •

edited

Loading