-
Notifications
You must be signed in to change notification settings - Fork 61
Description
Table.read_rows does not set any deadline, so it can hang forever if the Bigtable server connection hangs. We see this happening once every week or two when running inside GCP, which causes our server to get stuck indefinitely. There appears to be no way in the API to set a deadline, even though the documentation says that the retry parameter should do this. Due to a bug, it does not.
Details:
We are calling Table.read_rows to read ~2 rows from BigTable. Using pyflame on a stuck process, both worker threads were waiting on Bigtable, with the stack trace below. I believe the bug is the following:
- Call
Table.read_rows. This callsPartialRowsData, passing theretryargument which defaults toDEFAULT_RETRY_READ_ROWS. The default misleadingly setsdeadline=60.0. ; It also passesread_method=self._instance._client.table_data_client.transport.read_rowstoPartialRowsData, which is a method onBigtableGrpcTransport. PartialRowsData.__init__callsread_method(); this is actually raw gRPC_UnaryStreamMultiCallable, not the gapicBigtableClient.read_rows, which AFAICS, is never called. Hence, this gRPC streaming call is started with any deadline.PartialRowsData.__iter__callsself._read_next_response, which callsreturn self.retry(self._read_next, on_error=self._on_error)(). This gives the impression thatretryis used, but if I understand gRPC streams correctly, I'm not sure that even makes sense. I think even if the gRPC stream return some error, callingnextwon't actually retry the gRPC, it will just immediately raise the same exception. To retry, I believe you need to actually restart it by callingread_rowsagain.- If the Bigtable server now "hangs", the client hangs forever.
Possible fix:
Change Table.read_rows call the gapic BigtableClient.read_rows with the retry parameter., and change PartialRowsData.__init__ to take this response iterator, and not take a retry parameter at all. This would at least allow setting the gRPC streaming call deadline, although I don't think it will make retrying work (since I think the gRPC streaming client will just immediately returns an iterator without actually waiting for a response from the server?)
I haven't actually tried implementing this to see if it works. For now, we will probably just make a raw gRPC read_rows call so we can set an appropriate timeout.
Environment details
OS: Linux, ContainerOS (GKE), Container is Debian9 (using distroless)
Python: 3.5.3
API: google-cloud-bigtable 0.33.0
Steps to reproduce
This program loads the Bigtable emulator with 1000 rows, calls read_rows(retry=DEFAULT.with_deadline(5.0)), then sends SIGSTOP to pause the emulator. This SHOULD cause a DeadlineExceeded exception to be raised after 5 seconds. Instead, it hangs forever.
- Start the Bigtable emulator:
gcloud beta emulators bigtable start - Find the PID:
ps ax | grep cbtemulator - Run the following program with
BIGTABLE_EMULATOR_HOST=localhost:8086 python3 bug.py $PID
from google.api_core import exceptions
from google.cloud import bigtable
from google.rpc import code_pb2
from google.rpc import status_pb2
import os
import signal
import sys
COLUMN_FAMILY_ID = 'column_family_id'
def main():
emulator_pid = int(sys.argv[1])
client = bigtable.Client(project="testing", admin=True)
instance = client.instance("emulator")
# create/open a table
table = instance.table("emulator_table")
column_family = table.column_family(COLUMN_FAMILY_ID)
try:
table.create()
column_family.create()
except exceptions.AlreadyExists:
print('table exists')
# write a bunch of data
for i in range(1000):
k = 'some_key_{:04d}'.format(i)
print(k)
row = table.row(k)
row.set_cell(COLUMN_FAMILY_ID, 'column', 'some_value{:d}'.format(i) * 1000)
result = table.mutate_rows([row])
assert len(result) == 1 and result[0].code == code_pb2.OK
assert table.read_row(k) is not None
print('starting read')
rows = table.read_rows(retry=bigtable.table.DEFAULT_RETRY_READ_ROWS.with_deadline(5.0))
rows_iter = iter(rows)
r1 = next(rows_iter)
print('read', r1)
os.kill(emulator_pid, signal.SIGSTOP)
print('sent sigstop')
for r in rows_iter:
print(r)
print('done')
if __name__ == '__main__':
main()Stack trace of hung server (using slightly older version of the google-cloud-bigtable library
/usr/local/lib/python2.7/threading.py:wait:340
/usr/local/lib/python2.7/site-packages/grpc/_channel.py:_next:348
/usr/local/lib/python2.7/site-packages/grpc/_channel.py:next:366
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:_read_next:426
/usr/local/lib/python2.7/site-packages/google/api_core/retry.py:retry_target:179
/usr/local/lib/python2.7/site-packages/google/api_core/retry.py:retry_wrapped_func:270
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:_read_next_response:430
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:__iter__:441