Skip to content

download data as msgpack#59

Merged
chekunkov merged 2 commits intomasterfrom
msgpack
Sep 27, 2016
Merged

download data as msgpack#59
chekunkov merged 2 commits intomasterfrom
msgpack

Conversation

@bertinatto
Copy link
Contributor

Any thoughts?

I'm assuming current tests, like [1], still cover this patch.

Addresses #58.

[1] https://github.com/scrapinghub/python-hubstorage/blob/msgpack/tests/test_project.py#L228

url = urlpathjoin(self.url, _path)
logger.error('Failed %d times reading items from %s, params %s, '
'last error was: %s', self.MAX_RETRIES, url,
apiparams, lastexc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we don't stream msgpack?
The difference should be at the "accept" header and the decode function right? The iteration & retrying should be reused from json-lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed BytesIO would stream the data while reading it, but I noticed requests downloads everything [1] anyway. I re-factored out the common code. Thanks for pointing it out.

Is there any reason the retry logic isn't done using retrying?

[1] https://github.com/kennethreitz/requests/blob/master/requests/models.py#L737

@laurentsenta
Copy link
Contributor

How test_bulkdata checks the mpack implementation?
I can't find any hint of fixture that uses different accept headers.

@bertinatto bertinatto force-pushed the msgpack branch 2 times, most recently from bcf7f01 to fd72001 Compare March 21, 2016 00:28
@bertinatto
Copy link
Contributor Author

Here [1] resource.iter_values() is called and if the resource in question has msgpack support the job data will be dowloaded as msgpack [2]. If the unpacked data is correct this assert [3] should pass.

[1] https://github.com/scrapinghub/python-hubstorage/blob/master/tests/test_project.py#L241
[2] https://github.com/scrapinghub/python-hubstorage/blob/msgpack/hubstorage/resourcetype.py#L92-L99
[3] https://github.com/scrapinghub/python-hubstorage/blob/master/tests/test_project.py#L242

return r.iter_lines()

def apirequest(self, _path=None, **kwargs):
if self._supports_msgpack and kwargs.get('method') == 'GET':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible to get method='get' or method='gEt' here, it's converted to upper only in requests lib

@bertinatto
Copy link
Contributor Author

Last push has some improvements (suggestions) and fixes for test failures that were shadowed before #64.

@vshlapakov vshlapakov force-pushed the msgpack branch 2 times, most recently from cd81655 to f4ef0a8 Compare September 27, 2016 10:28
@vshlapakov
Copy link
Contributor

@chekunkov Updated, review please.

@chekunkov
Copy link
Contributor

LGTM

@chekunkov chekunkov merged commit 3e5b983 into master Sep 27, 2016
@chekunkov chekunkov deleted the msgpack branch September 27, 2016 12:51
@jdemaeyer
Copy link

Hey guys, there are two issues with this PR:

  • It breaks py3 compatibility (xrange is now range)
  • Items returned by iter_values() no longer contain the _key. This breaks shub log -f. I can possibly change it in shub but I wouldn't be sure that there's no other users using _key

@chekunkov
Copy link
Contributor

@jdemaeyer we will fix py3 compatibility asap

I'm afraid for now you have to pin python-hubstorage version in shub setup.py to <= 0.23.2, logs are using msgpack format by default and it seems we have a bug in Hubstorage - it returns _key field only for json and jsonlines formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants