Skip to content
Closed
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
eb643e4
ARROW-2066 Add documentation for Arrow/Azure/Parquet solution
rjrussell77 Jan 31, 2018
6841116
Polish the formatting
rjrussell77 Jan 31, 2018
5365a9c
Add helpful notes about Azure properties
rjrussell77 Jan 31, 2018
5fbea89
Add a note about keys and add polish
rjrussell77 Jan 31, 2018
26a53e4
Fix formatting
rjrussell77 Jan 31, 2018
7bab640
Refine indented bullet and fix title underline
rjrussell77 Jan 31, 2018
718bd94
Fix unintended italics
rjrussell77 Jan 31, 2018
f130e04
Change wording a bit
rjrussell77 Jan 31, 2018
83a38c4
Try to fix italics
rjrussell77 Feb 1, 2018
6fd9f70
remove inline edits
rjrussell77 Feb 1, 2018
34c5a16
Fix formatting
rjrussell77 Feb 1, 2018
599e04f
Fix formatting
rjrussell77 Feb 1, 2018
1815816
fix formatting
rjrussell77 Feb 1, 2018
a015deb
Fix formatting
rjrussell77 Feb 1, 2018
051b91d
Fix formatting
rjrussell77 Feb 1, 2018
803cbca
Use asterisks for list
rjrussell77 Feb 1, 2018
4c75824
Try moving the bullet to remove italics
rjrussell77 Feb 1, 2018
4770de1
fix
rjrussell77 Feb 1, 2018
5d450fc
fix
rjrussell77 Feb 1, 2018
654a6f9
Add back original Notes bullets
rjrussell77 Feb 1, 2018
36f7378
Replace usage of tempfile buffer with BytesIO stream
rjrussell77 Feb 22, 2018
1fe9866
Add try/except/finally blocks to ensure closure of the byte stream
rjrussell77 Feb 22, 2018
f056888
Clean up white space
rjrussell77 Feb 22, 2018
a5addb0
use more common 'df' instead of 'pd' for pandas dataframe variable, r…
rjrussell77 Feb 22, 2018
0d3972c
Add missing byte_stream declaration/assignment
rjrussell77 Feb 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions python/doc/source/parquet.rst
Original file line number Diff line number Diff line change
Expand Up @@ -237,3 +237,44 @@ throughput:

pq.read_table(where, nthreads=4)
pq.ParquetDataset(where).read(nthreads=4)

Reading a Parquet File from Azure Blob storage
----------------------------------------------

The code below shows how to use Azure's storage sdk along with pyarrow to read
a parquet file into a Pandas dataframe.
This is suitable for executing inside a Jupyter notebook running on a Python 3
kernel.

Dependencies:

* python 3.6.2
* azure-storage 0.36.0
* pyarrow 0.8.0

.. code-block:: python

import pyarrow.parquet as pq
from io import BytesIO
from azure.storage.blob import BlockBlobService

account_name = '...'
account_key = '...'
container_name = '...'
parquet_file = 'mysample.parquet'

byte_stream = io.BytesIO()
block_blob_service = BlockBlobService(account_name=account_name, account_key=account_key)
try:
block_blob_service.get_blob_to_stream(container_name=container_name, blob_name=parquet_file, stream=byte_stream)
df = pq.read_table(source=byte_stream).to_pandas()
# Do work on df ...
finally:
# Add finally block to ensure closure of the stream
byte_stream.close()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy Ok, I've responded to your last set of feedback. How are we looking now?

Notes:

* The ``account_key`` can be found under ``Settings -> Access keys`` in the Microsoft Azure portal for a given container
* The code above works for a container with private access, Lease State = Available, Lease Status = Unlocked
* The parquet file was Blob Type = Block blob