From c02955bfbf51e3415e80b74f4e0b120a49e0bead Mon Sep 17 00:00:00 2001 From: Crystal Yan Date: Fri, 7 Jul 2017 17:53:28 -0700 Subject: [PATCH 01/21] Plasma documentation- initial writeup of installation for linux. Installation for mac incomplete --- python/doc/source/index.rst | 1 + python/doc/source/plasma.rst | 338 +++++++++++++++++++++++++++++++++++ 2 files changed, 339 insertions(+) create mode 100644 python/doc/source/plasma.rst diff --git a/python/doc/source/index.rst b/python/doc/source/index.rst index a12853c4482..c2ae769b23e 100644 --- a/python/doc/source/index.rst +++ b/python/doc/source/index.rst @@ -40,6 +40,7 @@ structures. data ipc filesystems + plasma pandas parquet api diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst new file mode 100644 index 00000000000..00238ceeb23 --- /dev/null +++ b/python/doc/source/plasma.rst @@ -0,0 +1,338 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. currentmodule:: pyarrow +.. _io: + +The Plasma In-Memory Object Store +================================= + +Installation on Ubuntu +---------------------- +The following install instructions have been tested for Ubuntu 16.04. + + +First, install Anaconda in your terminal as follows. This will download +the Anaconda Linux installer and run it. Be sure to invoke the installer +with the ``bash`` command, whether or not you are using the Bash shell. + +.. code-block:: bash + + wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh + bash Anaconda3-4.4.0-Linux-x86_64.sh + +.. note:: + + As an alternative to the wget command above, you can also download the + Anaconda installer script through your web browser at their + `Download Webpage here `_. + + +Accept the Anaconda license agreement and follow the prompt. Allow the +installer to prepend the Anaconda location to your PATH. + +Then, either close and reopen your terminal window, or run the following +command, so that the new PATH takes effect: + +.. code-block:: bash + + source ~/.bashrc + +Anaconda should now be installed. For more information on installing +Anaconda, see their `documentation here `_. + + +Next, update your system and install the following dependency packages +as below: + +.. code-block:: bash + + sudo apt-get update + sudo apt-get install -y cmake build-essential autoconf curl libtool libboost-all-dev + sudo apt-get install -y unzip libjemalloc-dev pkg-config + sudo ldconfig + + +Now, we need to install arrow. First download the arrow package from +github: + +.. code-block:: bash + + cd ~ + git clone https://github.com/apache/arrow + +Next, create a build directory as follows: + +.. code-block:: bash + + cd arrow/cpp + git checkout plasma-cython + mkdir build + cd build + +You should now be in the ~/arrow/cpp/build directory. Run cmake and +make to build Arrow. + +.. code-block:: bash + + cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. + make + sudo make install + +.. note:: + + Running the ``cmake`` command above may give an ``ImportError`` + concerning numpy. If that is the case, see `ImportError when Running Cmake`_. + + +After installing arrow, you need to install pyarrow as follows: + +.. code-block:: bash + + cd ~/arrow/python + python setup.py install + +Once you've installed pyarrow, you should verify that you are able to +import it when running python in the terminal: + +.. code-block:: shell + + ubuntu:~/arrow/cpp/src/plasma$ python + Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) + [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux + Type "help", "copyright", "credits" or "license" for more information. + >>> import pyarrow + >>> + +If you encounter an ImportError when running the above, see `ImportError After Installing Pyarrow`_. + +Finally, you can install Plasma. + +.. code-block:: bash + + cd ~/arrow/cpp/src/plasma + python setup.py install + + +Installation on Mac OS X (TODO) +------------------------------- +The following install instructions have been tested for Mac OS X 10.9 +Mavericks. + + +First, install Anaconda as follows. Download the Graphical MacOS +Installer for your version of Python at the `Anaconda Download Webpage here `_. + +Double-click on the ``.pkg`` file, accept the license agreement, and +follow the step-by-step wizard to install Anaconda. Anaconda will be +installed for the current user's use only, and will require about 1.44 +GB of space. + +To verify that Anaconda has been installed, click on the Launchpad and +select Anaconda Navigator. It should open if you have successfully +installed Anaconda. For more information on installing Anaconda, see +their `documentation here `_. + +The next step is to install the following dependency packages as below: + +.. code-block:: bash + + brew update + brew install cmake autoconf libtool pkg-config jemalloc + +Plasma also requires the build-essential, curl, unzip, libboost-all-dev, +and libjemalloc-dev packages. MacOS should already come with curl, unzip, +and the compilation tools found in build-essential. Ldconfig is not supported +on Mac. + +Now, install arrow as follows. Open your terminal window and download the +arrow package from github with the following commands: + +.. code-block:: bash + + cd ~ + git clone https://github.com/apache/arrow + +Create a directory for the arrow build: + +.. code-block:: bash + + cd arrow/cpp + git checkout plasma-cython + mkdir build + cd build + +You should now be in the ~/arrow/cpp/build directory. Run cmake and +make to build Arrow. + +.. code-block:: bash + + cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. + make + sudo make install + +TODO: + +* Install Pyarrow +* Verify Pyarrow +* Install Plasma + + + +Troubleshooting Installation Issues +----------------------------------- + +ImportError when Running Cmake +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +While installing arrow, if you run into the following error when running +the ``cmake`` command, there may be an issue with finding numpy. + +.. code-block:: shell + + NumPy import failure: + + Traceback (most recent call last): + + File "", line 1, in + + ImportError: No module named numpy + +First, verify that numpy has been installed alongside anaconda. Running +``conda list`` outputs all the packages that have been installed with +anaconda: + +.. code-block:: shell + + ubuntu:~/arrow/cpp/build$ conda list + numpy 1.12.1 py36_0 + +If something similar to the above numpy line is not listed in the +output, numpy has not yet been installed. + +If numpy has not been installed, try running the following command: + +.. code-block:: bash + + conda install numpy + +If numpy is still not installed, try reinstalling anaconda. + +Second, verify that you are running the python version that comes with +anaconda. ``which`` should point to the python in the newly-installed +Anaconda package: + +.. code-block:: shell + + ubuntu:~/arrow/cpp/build$ which python + /home/ubuntu/anaconda3/bin/python + +If this issue comes up, most likely the anaconda library has not yet +been properly prepended to your PATH and the new PATH reloaded. + +If your machine already has other python versions installed, the Anaconda +python path should precede any other python version path. You can find +the paths to all python versions installed on your machine by running +``whereis python`` in the terminal: + +.. code-block:: shell + + ubuntu:~/arrow/cpp/build$ whereis python + python: /usr/bin/python3.5m /usr/bin/python2.7 /usr/bin/python /usr/bin/python2.7-config /usr/bin/python3.5 /usr/lib/python2.7 /usr/lib/python3.5 /etc/python2.7 /etc/python /etc/python3.5 /usr/local/lib/python2.7 /usr/local/lib/python3.5 /usr/include/python2.7 /usr/share/python /home/ubuntu/anaconda3/bin/python3.6m-config /home/ubuntu/anaconda3/bin/python3.6m /home/ubuntu/anaconda3/bin/python3.6 /home/ubuntu/anaconda3/bin/python3.6-config /home/ubuntu/anaconda3/bin/python /usr/share/man/man1/python.1.gz + +Anaconda usually modifies your ``~/.bashrc`` file in its installation. +You may need to manually add the following line or similar to the bottom +of your ``~/.bashrc`` file, then reload your terminal window: + +.. code-block:: bash + + # added by Anaconda3 4.4.0 installer + export PATH="/home/ubuntu/anaconda3/bin:$PATH" + +You can also create a persistent ``python`` shell alias to point to your +Anaconda python version by adding to following to the bottom of your +``~/.bashrc`` file: + +.. code-block:: bash + + alias python=/home/ubuntu/anaconda3/bin/python + +At this point, if you no longer have any issues with your anaconda +installation or with your python version, you should be able to run Python +in the terminal and import numpy with no errors: + +.. code-block:: shell + + ubuntu:~/arrow/cpp/build$ python + Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) + [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux + Type "help", "copyright", "credits" or "license" for more information. + >>> import numpy + >>> + +Finally, if you are confident that numpy has been installed and that you are +using Anaconda's version of python, cmake may be looking for python and +finding the wrong version (not Anaconda's version of python). Run the following +command instead (setting the ``FILEPATH`` to the path of your Anaconda python +version) to force ``cmake`` to use the correct python version: + +.. code-block:: bash + + cmake -DPYTHON_EXECUTABLE:FILEPATH=/home/ubuntu/anaconda3/bin/python -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. + +You may now proceed with the rest of the arrow installation. + + +ImportError After Installing Pyarrow +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You may encounter the following error output when trying to ``import pyarrow`` +inside Python: + +.. code-block:: shell + + >>> import pyarrow + Traceback (most recent call last): + File "", line 1, in + File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in + from pyarrow.lib import cpu_count, set_cpu_count + ImportError: libarrow.so.0: cannot open shared object file: No such file or directory + +If this is the case, after you have built Arrow, try running the following line +again in the terminal to remove this ImportError: + +.. code-block:: bash + + sudo ldconfig + +You may also encounter the following error output when trying to ``import pyarrow`` +inside Python: + +.. code-block:: shell + + >>> import pyarrow + Traceback (most recent call last): + File "", line 1, in + File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in + from pyarrow.lib import cpu_count, set_cpu_count + ImportError: /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so) + +If this is the case, run the following command to remove this ImportError: + +.. code-block:: bash + + conda install -y libgcc From 5cf63e92dbc778ffd56cf273cd04c8b4e0c4af96 Mon Sep 17 00:00:00 2001 From: Crystal Yan Date: Sat, 8 Jul 2017 00:07:11 -0700 Subject: [PATCH 02/21] Plasma documentation- Copied and edited Plasma API section, added a contents header at top, minor tweaks to Linux Installation section. Still need to do Installation on Mac OS and storing Arrow/Panda in Plasma --- python/doc/source/plasma.rst | 155 ++++++++++++++++++++++++++++++++++- 1 file changed, 152 insertions(+), 3 deletions(-) diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index 00238ceeb23..6ca7ae1acb7 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -21,6 +21,9 @@ The Plasma In-Memory Object Store ================================= +.. contents:: Contents + :depth: 3 + Installation on Ubuntu ---------------------- The following install instructions have been tested for Ubuntu 16.04. @@ -67,8 +70,8 @@ as below: sudo ldconfig -Now, we need to install arrow. First download the arrow package from -github: +Now, we need to install arrow. These instructions will install everything +to your home directory. First download the arrow package from github: .. code-block:: bash @@ -93,7 +96,7 @@ make to build Arrow. make sudo make install -.. note:: +.. warning:: Running the ``cmake`` command above may give an ``ImportError`` concerning numpy. If that is the case, see `ImportError when Running Cmake`_. @@ -127,6 +130,24 @@ Finally, you can install Plasma. cd ~/arrow/cpp/src/plasma python setup.py install +Similar to pyarrow, you can verify that Plasma has been installed by +trying to import it when running python. Make sure to try this from +outside of the ~/arrow/cpp/src/plasma directory, otherwise you may +encounter the following error: + +.. code-block:: shell + + ubuntu:~/arrow/cpp/src/plasma$ python + Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) + [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux + Type "help", "copyright", "credits" or "license" for more information. + >>> import plasma + Traceback (most recent call last): + File "", line 1, in + File "/home/ubuntu/arrow/cpp/src/plasma/plasma/__init__.py", line 18, in + from .plasma import * + ModuleNotFoundError: No module named 'plasma.plasma' + Installation on Mac OS X (TODO) ------------------------------- @@ -336,3 +357,131 @@ If this is the case, run the following command to remove this ImportError: .. code-block:: bash conda install -y libgcc + + +The Plasma API +-------------- + +Creating a Plasma client +^^^^^^^^^^^^^^^^^^^^^^^^ + +First locate your plasma directory. This can be printed out by +importing plasma in python and running the command ``print(plasma.__path__)``. +If running python from the terminal, be sure to run this command outside of the ~/arrow/cpp/src/plasma directory, or you may encounter an error. + +For example, to find your plasma directory, you can run the following one-liner +from the terminal like follows: + +.. code-block:: shell + + ubuntu:~$ python -c "import plasma; print(plasma.__path__)" + ['/home/ubuntu/anaconda3/lib/python3.6/site-packages/plasma-0.0.1-py3.6-linux-x86_64.egg/plasma'] + +From inside the plasma directory, you can start the plasma store in the +foreground by issuing a terminal command similar to the following: + +.. code-block:: bash + + ./plasma_store -m 1000000000 -s /tmp/plasma + +This command must be issued inside the plasma directory to work. The -m flag +specifies the size of the store in bytes, and the -s flag specifies the socket +that the store will listen at. Thus, the above command sets the Plasma store +to use up to 1 GB of memory, and sets the socket to ``/tmp/plasma``. + +Leave the current terminal window open as long as Plasma store should keep +running. Error messages, such as disconnecting clients, may occasionally be outputted. +To stop running the plasma store, you can press ``CTRL-C`` in the terminal. + +Finally, from within python, the same socket given to ``./plasma_store`` +should then be passed into the Plasma client as shown below: + +.. code-block:: python + + import plasma + client = plasma.PlasmaClient() + client.connect("/tmp/plasma", "", 0) + +If the following error occurs from running the above Python code, that +means that either the socket given is incorrect, or the ``./plasma_store`` is +not currently running. Make sure that you are still running the ``./plasma_store`` +process in your plasma directory. + +.. code-block:: shell + + >>> client.connect("/tmp/plasma", "", 0) + Connection to socket failed for pathname /tmp/plasma + Could not connect to socket /tmp/plasma + + +Object IDs +^^^^^^^^^^ + +Each object in the Plasma store should be associated with a unique id. The +Object ID then serves as a key for any client to fetch that object from +the Plasma store. You can form an ObjectID object from a byte string of +length 20. + +.. code-block:: shell + + # Create ObjectID of 20 bytes, each being the byte (b) encoding of the letter "a" + >>> id = plasma.ObjectID(20 * b"a") + + # "a" is encoded as 61 + >>> id + ObjectID(6161616161616161616161616161616161616161) + +Creating an Object +^^^^^^^^^^^^^^^^^^ + +Objects are created in Plasma in two stages. First, they are *created*, which +allocates a buffer for the object. At this point, the client can write to the +buffer and construct the object within the allocated buffer. + +.. code-block:: python + + # Create an object. + object_id = plasma.ObjectID(20 * b"a") # Note that this is an ObjectID object, not a string + object_size = 1000 + buffer = memoryview(client.create(object_id, object_size)) + + # Write to the buffer. + for i in range(1000): + buffer[i] = i % 128 + +When the client is done, the client *seals* the buffer, making the object +immutable, and making it available to other Plasma clients. + +.. code-block:: python + + # Seal the object. This makes the object immutable and available to other clients. + client.seal(object_id) + + +Getting an Object +^^^^^^^^^^^^^^^^^ + +After an object has been sealed, any client who knows the object ID can get +the object. + +.. code-block:: python + + # Get the object from a different client. This blocks until the object has been sealed. + object_id = plasma.ObjectID(20 * b"a") + [buffer] = client.get([object_id]) # Note that you pass in as an ObjectID object, not a string + + +If the object has not been sealed yet, then the call to client.get will block +until the object has been sealed by the client constructing the object. + +Storing Arrow Objects and Pandas DataFrames in Plasma (TODO) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can copy the examples from test_store_arrow_objects as well as +test_store_pandas_dataframe in arrow/cpp/src/plasma/test/test.py; maybe explain +a little bit what is going on (should become clear if you look into the pyarrow +documentation a bit, let me know if it is not). Not that the +test_store_pandas_dataframe test doesn't work at the moment, it probably is a +bug in arrow; we are working on it and it will be fixed before the documentation +goes online. + From 25abf830093f8e658a953f8b716315b1082862a0 Mon Sep 17 00:00:00 2001 From: Crystal Yan Date: Sat, 8 Jul 2017 00:41:12 -0700 Subject: [PATCH 03/21] Plasma documentation- tweaked contents headings hierarchy, added a bit to 'Getting an Object' subsection in Plasma API. --- python/doc/source/plasma.rst | 48 +++++++++++++++++++++++++++++------- 1 file changed, 39 insertions(+), 9 deletions(-) diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index 6ca7ae1acb7..a0d202f460e 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -24,8 +24,12 @@ The Plasma In-Memory Object Store .. contents:: Contents :depth: 3 +Installing Plasma +----------------- + Installation on Ubuntu ----------------------- +^^^^^^^^^^^^^^^^^^^^^^ + The following install instructions have been tested for Ubuntu 16.04. @@ -150,7 +154,8 @@ encounter the following error: Installation on Mac OS X (TODO) -------------------------------- +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + The following install instructions have been tested for Mac OS X 10.9 Mavericks. @@ -215,10 +220,10 @@ TODO: Troubleshooting Installation Issues ------------------------------------ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ImportError when Running Cmake -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> While installing arrow, if you run into the following error when running the ``cmake`` command, there may be an issue with finding numpy. @@ -319,7 +324,7 @@ You may now proceed with the rest of the arrow installation. ImportError After Installing Pyarrow -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You may encounter the following error output when trying to ``import pyarrow`` inside Python: @@ -466,16 +471,41 @@ the object. .. code-block:: python - # Get the object from a different client. This blocks until the object has been sealed. - object_id = plasma.ObjectID(20 * b"a") - [buffer] = client.get([object_id]) # Note that you pass in as an ObjectID object, not a string + # Create a different client. Note that this second client could be + # created in the same or in a separate, concurrent Python session. + client2 = plasma.PlasmaClient() + client2.connect("/tmp/plasma", "", 0) + # Get the object in the second client. This blocks until the object has been sealed. + object_id2 = plasma.ObjectID(20 * b"a") + [buffer2] = client2.get([object_id]) # Note that you pass in as an ObjectID object, not a string If the object has not been sealed yet, then the call to client.get will block until the object has been sealed by the client constructing the object. +Note that the buffer fetched is not in the same object type as the buffer the +original client created to store the object in the first place. The +buffer the original client created is a Python ``memoryview`` buffer object, +while the buffer returned from ``client.get`` is a custom ``PlasmaBuffer`` +object. + +However, the ``PlasmaBuffer`` object should behave like a ``memoryview`` +object, and supports slicing and indexing to expose its data. + +.. code-block:: shell + + >>> buffer + + >>> buffer[1] + 1 + >>> buffer2 + + >>> buffer2[1] + 1 + + Storing Arrow Objects and Pandas DataFrames in Plasma (TODO) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +------------------------------------------------------------ You can copy the examples from test_store_arrow_objects as well as test_store_pandas_dataframe in arrow/cpp/src/plasma/test/test.py; maybe explain From a49e122293f4df0777124e895489cf1ab1aa4aa7 Mon Sep 17 00:00:00 2001 From: Crystal Yan Date: Sat, 8 Jul 2017 02:25:31 -0700 Subject: [PATCH 04/21] Plasma documentation- Added parts on using Arrow with Plasma --- cpp/src/plasma/test/test.py | 641 +++++++++++++++++++++++++++++++++++ python/doc/source/plasma.rst | 233 ++++++++++--- 2 files changed, 819 insertions(+), 55 deletions(-) create mode 100644 cpp/src/plasma/test/test.py diff --git a/cpp/src/plasma/test/test.py b/cpp/src/plasma/test/test.py new file mode 100644 index 00000000000..1b2a6d21edf --- /dev/null +++ b/cpp/src/plasma/test/test.py @@ -0,0 +1,641 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import glob +import numpy as np +import os +import random +import signal +import site +import subprocess +import sys +import threading +import time +import unittest + +import plasma +import pyarrow as pa +import pandas as pd + +DEFAULT_PLASMA_STORE_MEMORY = 10 ** 9 + +USE_VALGRIND = False + +def random_name(): + return str(random.randint(0, 99999999)) + +def random_object_id(): + return plasma.ObjectID(np.random.bytes(20)) + +def generate_metadata(length): + metadata_buffer = bytearray(length) + if length > 0: + metadata_buffer[0] = random.randint(0, 255) + metadata_buffer[-1] = random.randint(0, 255) + for _ in range(100): + metadata_buffer[random.randint(0, length - 1)] = random.randint(0, 255) + return metadata_buffer + +def write_to_data_buffer(buff, length): + array = np.frombuffer(buff, dtype="uint8") + if length > 0: + array[0] = random.randint(0, 255) + array[-1] = random.randint(0, 255) + for _ in range(100): + array[random.randint(0, length - 1)] = random.randint(0, 255) + +def create_object_with_id(client, object_id, data_size, metadata_size, + seal=True): + metadata = generate_metadata(metadata_size) + memory_buffer = client.create(object_id, data_size, metadata) + write_to_data_buffer(memory_buffer, data_size) + if seal: + client.seal(object_id) + return memory_buffer, metadata + +def create_object(client, data_size, metadata_size, seal=True): + object_id = random_object_id() + memory_buffer, metadata = create_object_with_id(client, object_id, data_size, + metadata_size, seal=seal) + return object_id, memory_buffer, metadata + +def assert_get_object_equal(unit_test, client1, client2, object_id, + memory_buffer=None, metadata=None): + client1_buff = client1.get([object_id])[0] + client2_buff = client2.get([object_id])[0] + client1_metadata = client1.get_metadata([object_id])[0] + client2_metadata = client2.get_metadata([object_id])[0] + unit_test.assertEqual(len(client1_buff), len(client2_buff)) + unit_test.assertEqual(len(client1_metadata), len(client2_metadata)) + # Check that the buffers from the two clients are the same. + unit_test.assertTrue(plasma.buffers_equal(client1_buff, client2_buff)) + # Check that the metadata buffers from the two clients are the same. + unit_test.assertTrue(plasma.buffers_equal(client1_metadata, + client2_metadata)) + # If a reference buffer was provided, check that it is the same as well. + if memory_buffer is not None: + unit_test.assertTrue(plasma.buffers_equal(memory_buffer, client1_buff)) + # If reference metadata was provided, check that it is the same as well. + if metadata is not None: + unit_test.assertTrue(plasma.buffers_equal(metadata, client1_metadata)) + +def start_plasma_store(plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY, + use_valgrind=False, use_profiler=False, + stdout_file=None, stderr_file=None): + """Start a plasma store process. + Args: + use_valgrind (bool): True if the plasma store should be started inside of + valgrind. If this is True, use_profiler must be False. + use_profiler (bool): True if the plasma store should be started inside a + profiler. If this is True, use_valgrind must be False. + stdout_file: A file handle opened for writing to redirect stdout to. If no + redirection should happen, then this should be None. + stderr_file: A file handle opened for writing to redirect stderr to. If no + redirection should happen, then this should be None. + Return: + A tuple of the name of the plasma store socket and the process ID of the + plasma store process. + """ + if use_valgrind and use_profiler: + raise Exception("Cannot use valgrind and profiler at the same time.") + module_dir = site.getsitepackages() + [plasma_dir] = glob.glob(os.path.join(module_dir[0], "plasma*")) + plasma_store_executable = os.path.join(os.path.abspath(plasma_dir), "plasma/plasma_store") + plasma_store_name = "/tmp/plasma_store{}".format(random_name()) + command = [plasma_store_executable, + "-s", plasma_store_name, + "-m", str(plasma_store_memory)] + if use_valgrind: + pid = subprocess.Popen(["valgrind", + "--track-origins=yes", + "--leak-check=full", + "--show-leak-kinds=all", + "--error-exitcode=1"] + command, + stdout=stdout_file, stderr=stderr_file) + time.sleep(1.0) + elif use_profiler: + pid = subprocess.Popen(["valgrind", "--tool=callgrind"] + command, + stdout=stdout_file, stderr=stderr_file) + time.sleep(1.0) + else: + pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file) + time.sleep(0.1) + return plasma_store_name, pid + +class TestPlasmaClient(unittest.TestCase): + + def setUp(self): + # Start Plasma store. + plasma_store_name, self.p = start_plasma_store( + use_valgrind=USE_VALGRIND) + # Connect to Plasma. + self.plasma_client = plasma.PlasmaClient() + self.plasma_client.connect(plasma_store_name, "", 64) + # For the eviction test + self.plasma_client2 = plasma.PlasmaClient() + self.plasma_client2.connect(plasma_store_name, "", 0) + + def tearDown(self): + # Check that the Plasma store is still alive. + self.assertEqual(self.p.poll(), None) + # Kill the plasma store process. + if USE_VALGRIND: + self.p.send_signal(signal.SIGTERM) + self.p.wait() + if self.p.returncode != 0: + os._exit(-1) + else: + self.p.kill() + + def test_create(self): + # Create an object id string. + object_id = random_object_id() + # Create a new buffer and write to it. + length = 50 + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, length), dtype="uint8") + for i in range(length): + memory_buffer[i] = i % 256 + # Seal the object. + self.plasma_client.seal(object_id) + # Get the object. + memory_buffer = np.frombuffer(self.plasma_client.get([object_id])[0], dtype="uint8") + for i in range(length): + self.assertEqual(memory_buffer[i], i % 256) + + def test_create_with_metadata(self): + for length in range(1000): + # Create an object id string. + object_id = random_object_id() + # Create a random metadata string. + metadata = generate_metadata(length) + # Create a new buffer and write to it. + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, length, metadata), dtype="uint8") + for i in range(length): + memory_buffer[i] = i % 256 + # Seal the object. + self.plasma_client.seal(object_id) + # Get the object. + memory_buffer = np.frombuffer(self.plasma_client.get([object_id])[0], dtype="uint8") + for i in range(length): + self.assertEqual(memory_buffer[i], i % 256) + # Get the metadata. + metadata_buffer = np.frombuffer(self.plasma_client.get_metadata([object_id])[0], dtype="uint8") + self.assertEqual(len(metadata), len(metadata_buffer)) + for i in range(len(metadata)): + self.assertEqual(metadata[i], metadata_buffer[i]) + + def test_create_existing(self): + # This test is partially used to test the code path in which we create an + # object with an ID that already exists + length = 100 + for _ in range(1000): + object_id = random_object_id() + self.plasma_client.create(object_id, length, generate_metadata(length)) + try: + self.plasma_client.create(object_id, length, generate_metadata(length)) + # TODO(pcm): Introduce a more specific error type here + except pa.lib.ArrowException as e: + pass + else: + self.assertTrue(False) + + def test_get(self): + num_object_ids = 100 + # Test timing out of get with various timeouts. + for timeout in [0, 10, 100, 1000]: + object_ids = [random_object_id() for _ in range(num_object_ids)] + results = self.plasma_client.get(object_ids, timeout_ms=timeout) + self.assertEqual(results, num_object_ids * [None]) + + data_buffers = [] + metadata_buffers = [] + for i in range(num_object_ids): + if i % 2 == 0: + data_buffer, metadata_buffer = create_object_with_id( + self.plasma_client, object_ids[i], 2000, 2000) + data_buffers.append(data_buffer) + metadata_buffers.append(metadata_buffer) + + # Test timing out from some but not all get calls with various timeouts. + for timeout in [0, 10, 100, 1000]: + data_results = self.plasma_client.get(object_ids, timeout_ms=timeout) + # metadata_results = self.plasma_client.get_metadata(object_ids, + # timeout_ms=timeout) + for i in range(num_object_ids): + if i % 2 == 0: + array1 = np.frombuffer(data_buffers[i // 2], dtype="uint8") + array2 = np.frombuffer(data_results[i], dtype="uint8") + np.testing.assert_equal(array1, array2) + # TODO(rkn): We should compare the metadata as well. But currently + # the types are different (e.g., memoryview versus bytearray). + # self.assertTrue(plasma.buffers_equal(metadata_buffers[i // 2], + # metadata_results[i])) + else: + self.assertIsNone(results[i]) + + def test_store_arrow_objects(self): + data = np.random.randn(10, 4) + # Write an arrow object. + object_id = random_object_id() + tensor = pa.Tensor.from_numpy(data) + data_size = pa.get_tensor_size(tensor) + buf = self.plasma_client.create(object_id, data_size) + stream = plasma.FixedSizeBufferOutputStream(buf) + pa.write_tensor(tensor, stream) + self.plasma_client.seal(object_id) + # Read the arrow object. + [tensor] = self.plasma_client.get([object_id]) + reader = pa.BufferReader(tensor) + array = pa.read_tensor(reader).to_numpy() + # Assert that they are equal. + np.testing.assert_equal(data, array) + + def test_store_pandas_dataframe(self): + d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), + 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} + df = pd.DataFrame(d) + + # Write the DataFrame. + record_batch = pa.RecordBatch.from_pandas(df) + data_size = pa.get_record_batch_size(record_batch) + object_id = plasma.ObjectID(np.random.bytes(20)) + + buf = self.plasma_client.create(object_id, data_size) + stream = plasma.FixedSizeBufferOutputStream(buf) + stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema) + stream_writer.write_batch(record_batch) + + self.plasma_client.seal(object_id) + + # Read the DataFrame. + [data] = self.plasma_client.get([object_id]) + reader = pa.RecordBatchStreamReader(pa.BufferReader(data)) + result = reader.read_next_batch().to_pandas() + + pd.util.testing.assert_frame_equal(df, result) + + def test_pickle_object_ids(self): + # This can be used for sharing object IDs between processes. + import pickle + object_id = random_object_id() + data = pickle.dumps(object_id) + object_id2 = pickle.loads(data) + self.assertEqual(object_id, object_id2) + + def test_store_full(self): + # The store is started with 1GB, so make sure that create throws an + # exception when it is full. + def assert_create_raises_plasma_full(unit_test, size): + partial_size = np.random.randint(size) + try: + _, memory_buffer, _ = create_object(unit_test.plasma_client, + partial_size, + size - partial_size) + # TODO(pcm): More specific error here. + except pa.lib.ArrowException as e: + pass + else: + # For some reason the above didn't throw an exception, so fail. + unit_test.assertTrue(False) + + # Create a list to keep some of the buffers in scope. + memory_buffers = [] + _, memory_buffer, _ = create_object(self.plasma_client, 5 * 10 ** 8, 0) + memory_buffers.append(memory_buffer) + # Remaining space is 5 * 10 ** 8. Make sure that we can't create an object + # of size 5 * 10 ** 8 + 1, but we can create one of size 2 * 10 ** 8. + assert_create_raises_plasma_full(self, 5 * 10 ** 8 + 1) + _, memory_buffer, _ = create_object(self.plasma_client, 2 * 10 ** 8, 0) + del memory_buffer + _, memory_buffer, _ = create_object(self.plasma_client, 2 * 10 ** 8, 0) + del memory_buffer + assert_create_raises_plasma_full(self, 5 * 10 ** 8 + 1) + + _, memory_buffer, _ = create_object(self.plasma_client, 2 * 10 ** 8, 0) + memory_buffers.append(memory_buffer) + # Remaining space is 3 * 10 ** 8. + assert_create_raises_plasma_full(self, 3 * 10 ** 8 + 1) + + _, memory_buffer, _ = create_object(self.plasma_client, 10 ** 8, 0) + memory_buffers.append(memory_buffer) + # Remaining space is 2 * 10 ** 8. + assert_create_raises_plasma_full(self, 2 * 10 ** 8 + 1) + + def test_contains(self): + fake_object_ids = [random_object_id() for _ in range(100)] + real_object_ids = [random_object_id() for _ in range(100)] + for object_id in real_object_ids: + self.assertFalse(self.plasma_client.contains(object_id)) + self.plasma_client.create(object_id, 100) + self.plasma_client.seal(object_id) + self.assertTrue(self.plasma_client.contains(object_id)) + for object_id in fake_object_ids: + self.assertFalse(self.plasma_client.contains(object_id)) + for object_id in real_object_ids: + self.assertTrue(self.plasma_client.contains(object_id)) + + def test_hash(self): + # Check the hash of an object that doesn't exist. + object_id1 = random_object_id() + try: + self.plasma_client.hash(object_id1) + # TODO(pcm): Introduce a more specific error type here + except pa.lib.ArrowException as e: + pass + else: + self.assertTrue(False) + + length = 1000 + # Create a random object, and check that the hash function always returns + # the same value. + metadata = generate_metadata(length) + memory_buffer = np.frombuffer(self.plasma_client.create(object_id1, length, metadata), dtype="uint8") + for i in range(length): + memory_buffer[i] = i % 256 + self.plasma_client.seal(object_id1) + self.assertEqual(self.plasma_client.hash(object_id1), + self.plasma_client.hash(object_id1)) + + # Create a second object with the same value as the first, and check that + # their hashes are equal. + object_id2 = random_object_id() + memory_buffer = np.frombuffer(self.plasma_client.create(object_id2, length, metadata), dtype="uint8") + for i in range(length): + memory_buffer[i] = i % 256 + self.plasma_client.seal(object_id2) + self.assertEqual(self.plasma_client.hash(object_id1), + self.plasma_client.hash(object_id2)) + + # Create a third object with a different value from the first two, and + # check that its hash is different. + object_id3 = random_object_id() + metadata = generate_metadata(length) + memory_buffer = np.frombuffer(self.plasma_client.create(object_id3, length, metadata), dtype="uint8") + for i in range(length): + memory_buffer[i] = (i + 1) % 256 + self.plasma_client.seal(object_id3) + self.assertNotEqual(self.plasma_client.hash(object_id1), + self.plasma_client.hash(object_id3)) + + # Create a fourth object with the same value as the third, but different + # metadata. Check that its hash is different from any of the previous + # three. + object_id4 = random_object_id() + metadata4 = generate_metadata(length) + memory_buffer = np.frombuffer(self.plasma_client.create(object_id4, length, metadata4), dtype="uint8") + for i in range(length): + memory_buffer[i] = (i + 1) % 256 + self.plasma_client.seal(object_id4) + self.assertNotEqual(self.plasma_client.hash(object_id1), + self.plasma_client.hash(object_id4)) + self.assertNotEqual(self.plasma_client.hash(object_id3), + self.plasma_client.hash(object_id4)) + + def test_many_hashes(self): + hashes = [] + length = 2 ** 10 + + for i in range(256): + object_id = random_object_id() + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, length), dtype="uint8") + for j in range(length): + memory_buffer[j] = i + self.plasma_client.seal(object_id) + hashes.append(self.plasma_client.hash(object_id)) + + # Create objects of varying length. Each pair has two bits different. + for i in range(length): + object_id = random_object_id() + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, length), dtype="uint8") + for j in range(length): + memory_buffer[j] = 0 + memory_buffer[i] = 1 + self.plasma_client.seal(object_id) + hashes.append(self.plasma_client.hash(object_id)) + + # Create objects of varying length, all with value 0. + for i in range(length): + object_id = random_object_id() + memory_buffer = np.frombuffer(self.plasma_client.create(object_id, i), dtype="uint8") + for j in range(i): + memory_buffer[j] = 0 + self.plasma_client.seal(object_id) + hashes.append(self.plasma_client.hash(object_id)) + + # Check that all hashes were unique. + self.assertEqual(len(set(hashes)), 256 + length + length) + + # def test_individual_delete(self): + # length = 100 + # # Create an object id string. + # object_id = random_object_id() + # # Create a random metadata string. + # metadata = generate_metadata(100) + # # Create a new buffer and write to it. + # memory_buffer = self.plasma_client.create(object_id, length, metadata) + # for i in range(length): + # memory_buffer[i] = chr(i % 256) + # # Seal the object. + # self.plasma_client.seal(object_id) + # # Check that the object is present. + # self.assertTrue(self.plasma_client.contains(object_id)) + # # Delete the object. + # self.plasma_client.delete(object_id) + # # Make sure the object is no longer present. + # self.assertFalse(self.plasma_client.contains(object_id)) + # + # def test_delete(self): + # # Create some objects. + # object_ids = [random_object_id() for _ in range(100)] + # for object_id in object_ids: + # length = 100 + # # Create a random metadata string. + # metadata = generate_metadata(100) + # # Create a new buffer and write to it. + # memory_buffer = self.plasma_client.create(object_id, length, metadata) + # for i in range(length): + # memory_buffer[i] = chr(i % 256) + # # Seal the object. + # self.plasma_client.seal(object_id) + # # Check that the object is present. + # self.assertTrue(self.plasma_client.contains(object_id)) + # + # # Delete the objects and make sure they are no longer present. + # for object_id in object_ids: + # # Delete the object. + # self.plasma_client.delete(object_id) + # # Make sure the object is no longer present. + # self.assertFalse(self.plasma_client.contains(object_id)) + + def test_illegal_functionality(self): + # Create an object id string. + object_id = random_object_id() + # Create a new buffer and write to it. + length = 1000 + memory_buffer = self.plasma_client.create(object_id, length) + # Make sure we cannot access memory out of bounds. + self.assertRaises(Exception, lambda: memory_buffer[length]) + # Seal the object. + self.plasma_client.seal(object_id) + # This test is commented out because it currently fails. + # # Make sure the object is ready only now. + # def illegal_assignment(): + # memory_buffer[0] = chr(0) + # self.assertRaises(Exception, illegal_assignment) + # Get the object. + memory_buffer = self.plasma_client.get([object_id])[0] + + # Make sure the object is read only. + def illegal_assignment(): + memory_buffer[0] = chr(0) + self.assertRaises(Exception, illegal_assignment) + + def test_evict(self): + client = self.plasma_client2 + object_id1 = random_object_id() + b1 = client.create(object_id1, 1000) + client.seal(object_id1) + del b1 + self.assertEqual(client.evict(1), 1000) + + object_id2 = random_object_id() + object_id3 = random_object_id() + b2 = client.create(object_id2, 999) + b3 = client.create(object_id3, 998) + client.seal(object_id3) + del b3 + self.assertEqual(client.evict(1000), 998) + + object_id4 = random_object_id() + b4 = client.create(object_id4, 997) + client.seal(object_id4) + del b4 + client.seal(object_id2) + del b2 + self.assertEqual(client.evict(1), 997) + self.assertEqual(client.evict(1), 999) + + object_id5 = random_object_id() + object_id6 = random_object_id() + object_id7 = random_object_id() + b5 = client.create(object_id5, 996) + b6 = client.create(object_id6, 995) + b7 = client.create(object_id7, 994) + client.seal(object_id5) + client.seal(object_id6) + client.seal(object_id7) + del b5 + del b6 + del b7 + self.assertEqual(client.evict(2000), 996 + 995 + 994) + + def test_subscribe(self): + # Subscribe to notifications from the Plasma Store. + self.plasma_client.subscribe() + for i in [1, 10, 100, 1000, 10000, 100000]: + object_ids = [random_object_id() for _ in range(i)] + metadata_sizes = [np.random.randint(1000) for _ in range(i)] + data_sizes = [np.random.randint(1000) for _ in range(i)] + for j in range(i): + self.plasma_client.create( + object_ids[j], data_sizes[j], + metadata=bytearray(np.random.bytes(metadata_sizes[j]))) + self.plasma_client.seal(object_ids[j]) + # Check that we received notifications for all of the objects. + for j in range(i): + notification_info = self.plasma_client.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + self.assertEqual(object_ids[j], recv_objid) + self.assertEqual(data_sizes[j], recv_dsize) + self.assertEqual(metadata_sizes[j], recv_msize) + + def test_subscribe_deletions(self): + # Subscribe to notifications from the Plasma Store. We use plasma_client2 + # to make sure that all used objects will get evicted properly. + self.plasma_client2.subscribe() + for i in [1, 10, 100, 1000, 10000, 100000]: + object_ids = [random_object_id() for _ in range(i)] + # Add 1 to the sizes to make sure we have nonzero object sizes. + metadata_sizes = [np.random.randint(1000) + 1 for _ in range(i)] + data_sizes = [np.random.randint(1000) + 1 for _ in range(i)] + for j in range(i): + x = self.plasma_client2.create( + object_ids[j], data_sizes[j], + metadata=bytearray(np.random.bytes(metadata_sizes[j]))) + self.plasma_client2.seal(object_ids[j]) + del x + # Check that we received notifications for creating all of the objects. + for j in range(i): + notification_info = self.plasma_client2.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + self.assertEqual(object_ids[j], recv_objid) + self.assertEqual(data_sizes[j], recv_dsize) + self.assertEqual(metadata_sizes[j], recv_msize) + + # Check that we receive notifications for deleting all objects, as we + # evict them. + for j in range(i): + self.assertEqual(self.plasma_client2.evict(1), + data_sizes[j] + metadata_sizes[j]) + notification_info = self.plasma_client2.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + self.assertEqual(object_ids[j], recv_objid) + self.assertEqual(-1, recv_dsize) + self.assertEqual(-1, recv_msize) + + # Test multiple deletion notifications. The first 9 object IDs have size 0, + # and the last has a nonzero size. When Plasma evicts 1 byte, it will evict + # all objects, so we should receive deletion notifications for each. + num_object_ids = 10 + object_ids = [random_object_id() for _ in range(num_object_ids)] + metadata_sizes = [0] * (num_object_ids - 1) + data_sizes = [0] * (num_object_ids - 1) + metadata_sizes.append(np.random.randint(1000)) + data_sizes.append(np.random.randint(1000)) + for i in range(num_object_ids): + x = self.plasma_client2.create( + object_ids[i], data_sizes[i], + metadata=bytearray(np.random.bytes(metadata_sizes[i]))) + self.plasma_client2.seal(object_ids[i]) + del x + for i in range(num_object_ids): + notification_info = self.plasma_client2.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + self.assertEqual(object_ids[i], recv_objid) + self.assertEqual(data_sizes[i], recv_dsize) + self.assertEqual(metadata_sizes[i], recv_msize) + self.assertEqual(self.plasma_client2.evict(1), + data_sizes[-1] + metadata_sizes[-1]) + for i in range(num_object_ids): + notification_info = self.plasma_client2.get_next_notification() + recv_objid, recv_dsize, recv_msize = notification_info + self.assertEqual(object_ids[i], recv_objid) + self.assertEqual(-1, recv_dsize) + self.assertEqual(-1, recv_msize) + +if __name__ == "__main__": + if len(sys.argv) > 1: + # Pop the argument so we don't mess with unittest's own argument parser. + if sys.argv[-1] == "valgrind": + arg = sys.argv.pop() + USE_VALGRIND = True + print("Using valgrind for tests") + unittest.main(verbosity=2) \ No newline at end of file diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index a0d202f460e..40305df8448 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -22,7 +22,7 @@ The Plasma In-Memory Object Store ================================= .. contents:: Contents - :depth: 3 + :depth: 3 Installing Plasma ----------------- @@ -39,8 +39,8 @@ with the ``bash`` command, whether or not you are using the Bash shell. .. code-block:: bash - wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh - bash Anaconda3-4.4.0-Linux-x86_64.sh + wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh + bash Anaconda3-4.4.0-Linux-x86_64.sh .. note:: @@ -57,7 +57,7 @@ command, so that the new PATH takes effect: .. code-block:: bash - source ~/.bashrc + source ~/.bashrc Anaconda should now be installed. For more information on installing Anaconda, see their `documentation here `_. @@ -68,10 +68,10 @@ as below: .. code-block:: bash - sudo apt-get update - sudo apt-get install -y cmake build-essential autoconf curl libtool libboost-all-dev - sudo apt-get install -y unzip libjemalloc-dev pkg-config - sudo ldconfig + sudo apt-get update + sudo apt-get install -y cmake build-essential autoconf curl libtool libboost-all-dev + sudo apt-get install -y unzip libjemalloc-dev pkg-config + sudo ldconfig Now, we need to install arrow. These instructions will install everything @@ -79,26 +79,25 @@ to your home directory. First download the arrow package from github: .. code-block:: bash - cd ~ - git clone https://github.com/apache/arrow - + cd ~ + git clone https://github.com/apache/arrow + Next, create a build directory as follows: .. code-block:: bash - cd arrow/cpp - git checkout plasma-cython - mkdir build - cd build + cd arrow/cpp + mkdir build + cd build You should now be in the ~/arrow/cpp/build directory. Run cmake and make to build Arrow. .. code-block:: bash - cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. - make - sudo make install + cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. + make + sudo make install .. warning:: @@ -149,7 +148,7 @@ encounter the following error: Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/arrow/cpp/src/plasma/plasma/__init__.py", line 18, in - from .plasma import * + from .plasma import * ModuleNotFoundError: No module named 'plasma.plasma' @@ -177,8 +176,8 @@ The next step is to install the following dependency packages as below: .. code-block:: bash - brew update - brew install cmake autoconf libtool pkg-config jemalloc + brew update + brew install cmake autoconf libtool pkg-config jemalloc Plasma also requires the build-essential, curl, unzip, libboost-all-dev, and libjemalloc-dev packages. MacOS should already come with curl, unzip, @@ -190,26 +189,26 @@ arrow package from github with the following commands: .. code-block:: bash - cd ~ - git clone https://github.com/apache/arrow - + cd ~ + git clone https://github.com/apache/arrow + Create a directory for the arrow build: .. code-block:: bash - cd arrow/cpp - git checkout plasma-cython - mkdir build - cd build + cd arrow/cpp + git checkout plasma-cython + mkdir build + cd build You should now be in the ~/arrow/cpp/build directory. Run cmake and make to build Arrow. .. code-block:: bash - cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. - make - sudo make install + cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. + make + sudo make install TODO: @@ -230,13 +229,13 @@ the ``cmake`` command, there may be an issue with finding numpy. .. code-block:: shell - NumPy import failure: + NumPy import failure: - Traceback (most recent call last): + Traceback (most recent call last): - File "", line 1, in + File "", line 1, in - ImportError: No module named numpy + ImportError: No module named numpy First, verify that numpy has been installed alongside anaconda. Running ``conda list`` outputs all the packages that have been installed with @@ -335,7 +334,7 @@ inside Python: Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in - from pyarrow.lib import cpu_count, set_cpu_count + from pyarrow.lib import cpu_count, set_cpu_count ImportError: libarrow.so.0: cannot open shared object file: No such file or directory If this is the case, after you have built Arrow, try running the following line @@ -354,7 +353,7 @@ inside Python: Traceback (most recent call last): File "", line 1, in File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in - from pyarrow.lib import cpu_count, set_cpu_count + from pyarrow.lib import cpu_count, set_cpu_count ImportError: /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so) If this is the case, run the following command to remove this ImportError: @@ -395,8 +394,8 @@ that the store will listen at. Thus, the above command sets the Plasma store to use up to 1 GB of memory, and sets the socket to ``/tmp/plasma``. Leave the current terminal window open as long as Plasma store should keep -running. Error messages, such as disconnecting clients, may occasionally be outputted. -To stop running the plasma store, you can press ``CTRL-C`` in the terminal. +running. Messages, concerning such as disconnecting clients, may occasionally be +outputted. To stop running the Plasma store, you can press ``CTRL-C`` in the terminal. Finally, from within python, the same socket given to ``./plasma_store`` should then be passed into the Plasma client as shown below: @@ -425,17 +424,28 @@ Object IDs Each object in the Plasma store should be associated with a unique id. The Object ID then serves as a key for any client to fetch that object from the Plasma store. You can form an ObjectID object from a byte string of -length 20. +20 bytes. .. code-block:: shell - # Create ObjectID of 20 bytes, each being the byte (b) encoding of the letter "a" + # Create ObjectID of 20 bytes, each byte being the byte (b) encoding of the letter "a" >>> id = plasma.ObjectID(20 * b"a") # "a" is encoded as 61 >>> id ObjectID(6161616161616161616161616161616161616161) +Random generation of Object IDs is often good enough to ensure unique ids. +You can easily create a helper function that randomizes object ids as follows: + +.. code-block:: python + + import numpy as np + + def random_object_id(): + return plasma.ObjectID(np.random.bytes(20)) + + Creating an Object ^^^^^^^^^^^^^^^^^^ @@ -443,6 +453,9 @@ Objects are created in Plasma in two stages. First, they are *created*, which allocates a buffer for the object. At this point, the client can write to the buffer and construct the object within the allocated buffer. +To create an object for Plasma, you need to create an object id, as well as +give the object's maximum size in bytes. + .. code-block:: python # Create an object. @@ -452,7 +465,7 @@ buffer and construct the object within the allocated buffer. # Write to the buffer. for i in range(1000): - buffer[i] = i % 128 + buffer[i] = i % 128 When the client is done, the client *seals* the buffer, making the object immutable, and making it available to other Plasma clients. @@ -486,12 +499,9 @@ until the object has been sealed by the client constructing the object. Note that the buffer fetched is not in the same object type as the buffer the original client created to store the object in the first place. The buffer the original client created is a Python ``memoryview`` buffer object, -while the buffer returned from ``client.get`` is a custom ``PlasmaBuffer`` +while the buffer returned from ``client.get`` is a Plasma-specific ``PlasmaBuffer`` object. -However, the ``PlasmaBuffer`` object should behave like a ``memoryview`` -object, and supports slicing and indexing to expose its data. - .. code-block:: shell >>> buffer @@ -502,16 +512,129 @@ object, and supports slicing and indexing to expose its data. >>> buffer2[1] 1 + +However, the ``PlasmaBuffer`` object should behave like a ``memoryview`` +object, and supports slicing and indexing to expose its data. + +.. code-block:: shell + + >>> buffer[5] + 5 + >>> buffer[129] + 1 + >>> bytes(buffer[1:4]) + b'\x01\x02\x03' + >>> bytes(buffer2[1:4]) + b'\x01\x02\x03' -Storing Arrow Objects and Pandas DataFrames in Plasma (TODO) ------------------------------------------------------------- +Using Arrow and Pandas with Plasma +---------------------------------- + +Storing Arrow Objects in Plasma +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Creating an Arrow object still follows the two steps of *creating* it with +a buffer, then *sealing* it, however Arrow objects such as tensors may be +more complicated to write than simple binary data. + +To create the object in Plasma, you still need an ObjectID and a size to +pass in. To find out the size of your Arrow object, you can use pyarrow +API such as ``pyarrow.get_tensor_size``. + +.. code-block:: python + + import numpy as np + import pyarrow as pa + + # Create a pyarrow.Tensor object from a numpy random 2-dimensional array + data = np.random.randn(10, 4) + tensor = pa.Tensor.from_numpy(data) + + # Create the object in Plasma + object_id = random_object_id() + data_size = pa.get_tensor_size(tensor) + buf = client.create(object_id, data_size) + +To write the Arrow tensor object into the buffer, you can use Plasma to +convert the ``memoryview`` buffer into a ``plasma.FixedSizeBufferOutputStream`` +object. A ``plasma.FixedSizeBufferOutputStream`` is a format suitable for Arrow's +``pyarrow.write_tensor``: + +.. code-block:: python + + # Write the tensor into the Plasma-allocated buffer + stream = plasma.FixedSizeBufferOutputStream(buf) + pa.write_tensor(tensor, stream) # Writes tensor's 552 bytes to Plasma stream + +To finish storing the Arrow object in Plasma, you can seal it just like +for any other data: + +.. code-block:: python + + # Seal the Plasma object + client.seal(object_id) + +Getting Arrow Objects from Plasma +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For reading the object from Plasma to Arrow, you can fetch it as a ``PlasmaBuffer`` +using its object id as usual. + +.. code-block:: python + + # Get the arrow object by ObjectID. + [buf2] = client.get([object_id]) + +To convert the ``PlasmaBuffer`` back into the Arrow tensor, first you have to +create a pyarrow ``BufferReader`` object from it. You can then pass the +``BufferReader`` into ``pyarrow.read_tensor`` to reconstruct the Arrow tensor +object: + +.. code-block:: python + + # Reconstruct the Arrow tensor object. + reader = pa.BufferReader(buf2) # Plasma buffer -> Arrow reader + tensor2 = pa.read_tensor(reader) # Arrow reader -> Arrow tensor + +Finally, you can use ``pyarrow.read_tensor`` to convert the Arrow object +back into numpy data: + +.. code-block:: python + + # Convert back to numpy + array = tensor2.to_numpy() # Arrow tensor -> numpy array + +Storing Pandas DataFrames in Plasma (TODO) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Getting Pandas DataFrames from Plasma (TODO) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Example code: + +.. code-block:: python + + import pandas as pd + + d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), + 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} + df = pd.DataFrame(d) + + # Write the DataFrame. + record_batch = pa.RecordBatch.from_pandas(df) + data_size = pa.get_record_batch_size(record_batch) + object_id = plasma.ObjectID(np.random.bytes(20)) + + buf = self.plasma_client.create(object_id, data_size) + stream = plasma.FixedSizeBufferOutputStream(buf) + stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema) + stream_writer.write_batch(record_batch) + + self.plasma_client.seal(object_id) -You can copy the examples from test_store_arrow_objects as well as -test_store_pandas_dataframe in arrow/cpp/src/plasma/test/test.py; maybe explain -a little bit what is going on (should become clear if you look into the pyarrow -documentation a bit, let me know if it is not). Not that the -test_store_pandas_dataframe test doesn't work at the moment, it probably is a -bug in arrow; we are working on it and it will be fixed before the documentation -goes online. + # Read the DataFrame. + [data] = self.plasma_client.get([object_id]) + reader = pa.RecordBatchStreamReader(pa.BufferReader(data)) + result = reader.read_next_batch().to_pandas() From 2be9eab652c0373b4ac86ea355689b308a121915 Mon Sep 17 00:00:00 2001 From: Crystal Yan Date: Sat, 8 Jul 2017 03:23:11 -0700 Subject: [PATCH 05/21] Plasma documentation- Added using Pandas with Plasma sections. --- python/doc/source/plasma.rst | 100 ++++++++++++++++++++++++++++------- 1 file changed, 80 insertions(+), 20 deletions(-) diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index 40305df8448..9e04539b275 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -423,7 +423,7 @@ Object IDs Each object in the Plasma store should be associated with a unique id. The Object ID then serves as a key for any client to fetch that object from -the Plasma store. You can form an ObjectID object from a byte string of +the Plasma store. You can form an ``ObjectID`` object from a byte string of 20 bytes. .. code-block:: shell @@ -535,10 +535,10 @@ Storing Arrow Objects in Plasma ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Creating an Arrow object still follows the two steps of *creating* it with -a buffer, then *sealing* it, however Arrow objects such as tensors may be +a buffer, then *sealing* it, however Arrow objects such as ``Tensors`` may be more complicated to write than simple binary data. -To create the object in Plasma, you still need an ObjectID and a size to +To create the object in Plasma, you still need an ``ObjectID`` and a size to pass in. To find out the size of your Arrow object, you can use pyarrow API such as ``pyarrow.get_tensor_size``. @@ -552,11 +552,11 @@ API such as ``pyarrow.get_tensor_size``. tensor = pa.Tensor.from_numpy(data) # Create the object in Plasma - object_id = random_object_id() + object_id = plasma.ObjectID(np.random.bytes(20)) data_size = pa.get_tensor_size(tensor) buf = client.create(object_id, data_size) -To write the Arrow tensor object into the buffer, you can use Plasma to +To write the Arrow ``Tensor`` object into the buffer, you can use Plasma to convert the ``memoryview`` buffer into a ``plasma.FixedSizeBufferOutputStream`` object. A ``plasma.FixedSizeBufferOutputStream`` is a format suitable for Arrow's ``pyarrow.write_tensor``: @@ -586,9 +586,9 @@ using its object id as usual. # Get the arrow object by ObjectID. [buf2] = client.get([object_id]) -To convert the ``PlasmaBuffer`` back into the Arrow tensor, first you have to +To convert the ``PlasmaBuffer`` back into the Arrow ``Tensor``, first you have to create a pyarrow ``BufferReader`` object from it. You can then pass the -``BufferReader`` into ``pyarrow.read_tensor`` to reconstruct the Arrow tensor +``BufferReader`` into ``pyarrow.read_tensor`` to reconstruct the Arrow ``Tensor`` object: .. code-block:: python @@ -605,36 +605,96 @@ back into numpy data: # Convert back to numpy array = tensor2.to_numpy() # Arrow tensor -> numpy array -Storing Pandas DataFrames in Plasma (TODO) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Storing Pandas DataFrames in Plasma +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Storing a Pandas ``DataFrame`` still follows the *create* then *seal* process +of storing an object in the Plasma store, however one cannot directly write +the ``DataFrame`` to Plasma with Pandas alone. Plasma also needs to know the +size of the ``DataFrame`` to allocate a buffer for. -Getting Pandas DataFrames from Plasma (TODO) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +One can instead use pyarrow and its supportive API as an intermediary step +to import the Pandas ``DataFrame`` into Plasma. Arrow has multiple equivalent +types to the various Pandas structures, see the :ref:`pandas` page for more. -Example code: +You can create the pyarrow equivalent of a Pandas ``DataFrame`` by using +``pyarrow.from_pandas`` to convert it to a ``RecordBatch``. .. code-block:: python + import pyarrow as pa import pandas as pd + # Create a Pandas DataFrame d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) - # Write the DataFrame. + # Convert the Pandas DataFrame into a PyArrow RecordBatch record_batch = pa.RecordBatch.from_pandas(df) - data_size = pa.get_record_batch_size(record_batch) + +Creating the Plasma object requires an ``ObjectID`` and the size of the +data. Now that we have converted the Pandas ``DataFrame`` into a PyArrow +``RecordBatch``, use ``pyarrow.get_record_batch_size`` to determine the +size of the Plasma object. + +.. code-block:: python + + # Create the Plasma object from the PyArrow RecordBatch object_id = plasma.ObjectID(np.random.bytes(20)) + data_size = pa.get_record_batch_size(record_batch) + buf = client.create(object_id, data_size) + +Similar to storing an Arrow object, you have to convert the ``memoryview`` +object into a ``plasma.FixedSizeBufferOutputStream`` object in order to +work with pyarrow's API. Then convert the ``FixedSizeBufferOutputStream`` +object into a pyarrow ``RecordBatchStreamWriter`` object to write out +the PyArrow ``RecordBatch`` into Plasma as follows: + +.. code-block:: python - buf = self.plasma_client.create(object_id, data_size) + # Write the PyArrow RecordBatch to Plasma stream = plasma.FixedSizeBufferOutputStream(buf) stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema) stream_writer.write_batch(record_batch) - self.plasma_client.seal(object_id) +Finally, seal the finished object for use by all clients: + +.. code-block:: python + + # Seal the Plasma object + client.seal(object_id) + +Getting Pandas DataFrames from Plasma +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Since we store the Pandas DataFrame as a PyArrow ``RecordBatch`` object, +to get the object back from the Plasma store, we follow similar steps +to those specified in `Getting Arrow Objects from Plasma`_. + +We first have to convert the ``PlasmaBuffer`` returned from ``client.get`` +into an Arrow ``BufferReader`` object. + +.. code-block:: python + + # Fetch the Plasma object + [data] = client.get([object_id]) # Get PlasmaBuffer from ObjectID + buffer = pa.BufferReader(data) # PlasmaBuffer -> Arrow BufferReader + +From the ``BufferReader``, we can create a specific ``RecordBatchStreamReader`` +in Arrow to reconstruct the stored PyArrow ``RecordBatch`` object. + +.. code-block:: python + + # Convert object back into an Arrow RecordBatch + reader = pa.RecordBatchStreamReader(buffer) # Arrow BufferReader -> Arrow RecordBatchStreamReader + rec_batch = reader.read_next_batch() # Arrow RecordBatchStreamReader -> Arrow RecordBatch + +The last step is to convert the PyArrow ``RecordBatch`` object back into +the original Pandas ``DataFrame`` structure. + +.. code-block:: python - # Read the DataFrame. - [data] = self.plasma_client.get([object_id]) - reader = pa.RecordBatchStreamReader(pa.BufferReader(data)) - result = reader.read_next_batch().to_pandas() + # Convert back into Pandas + result = rec_batch.to_pandas() # Arrow RecordBatch -> Pandas DataFrame From f51f41e080ff9c3570de270ec6c59216c0dc7ad1 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 24 Jul 2017 13:26:06 -0700 Subject: [PATCH 06/21] remove old test.py --- cpp/src/plasma/test/test.py | 641 ------------------------------------ 1 file changed, 641 deletions(-) delete mode 100644 cpp/src/plasma/test/test.py diff --git a/cpp/src/plasma/test/test.py b/cpp/src/plasma/test/test.py deleted file mode 100644 index 1b2a6d21edf..00000000000 --- a/cpp/src/plasma/test/test.py +++ /dev/null @@ -1,641 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -from __future__ import absolute_import -from __future__ import division -from __future__ import print_function - -import glob -import numpy as np -import os -import random -import signal -import site -import subprocess -import sys -import threading -import time -import unittest - -import plasma -import pyarrow as pa -import pandas as pd - -DEFAULT_PLASMA_STORE_MEMORY = 10 ** 9 - -USE_VALGRIND = False - -def random_name(): - return str(random.randint(0, 99999999)) - -def random_object_id(): - return plasma.ObjectID(np.random.bytes(20)) - -def generate_metadata(length): - metadata_buffer = bytearray(length) - if length > 0: - metadata_buffer[0] = random.randint(0, 255) - metadata_buffer[-1] = random.randint(0, 255) - for _ in range(100): - metadata_buffer[random.randint(0, length - 1)] = random.randint(0, 255) - return metadata_buffer - -def write_to_data_buffer(buff, length): - array = np.frombuffer(buff, dtype="uint8") - if length > 0: - array[0] = random.randint(0, 255) - array[-1] = random.randint(0, 255) - for _ in range(100): - array[random.randint(0, length - 1)] = random.randint(0, 255) - -def create_object_with_id(client, object_id, data_size, metadata_size, - seal=True): - metadata = generate_metadata(metadata_size) - memory_buffer = client.create(object_id, data_size, metadata) - write_to_data_buffer(memory_buffer, data_size) - if seal: - client.seal(object_id) - return memory_buffer, metadata - -def create_object(client, data_size, metadata_size, seal=True): - object_id = random_object_id() - memory_buffer, metadata = create_object_with_id(client, object_id, data_size, - metadata_size, seal=seal) - return object_id, memory_buffer, metadata - -def assert_get_object_equal(unit_test, client1, client2, object_id, - memory_buffer=None, metadata=None): - client1_buff = client1.get([object_id])[0] - client2_buff = client2.get([object_id])[0] - client1_metadata = client1.get_metadata([object_id])[0] - client2_metadata = client2.get_metadata([object_id])[0] - unit_test.assertEqual(len(client1_buff), len(client2_buff)) - unit_test.assertEqual(len(client1_metadata), len(client2_metadata)) - # Check that the buffers from the two clients are the same. - unit_test.assertTrue(plasma.buffers_equal(client1_buff, client2_buff)) - # Check that the metadata buffers from the two clients are the same. - unit_test.assertTrue(plasma.buffers_equal(client1_metadata, - client2_metadata)) - # If a reference buffer was provided, check that it is the same as well. - if memory_buffer is not None: - unit_test.assertTrue(plasma.buffers_equal(memory_buffer, client1_buff)) - # If reference metadata was provided, check that it is the same as well. - if metadata is not None: - unit_test.assertTrue(plasma.buffers_equal(metadata, client1_metadata)) - -def start_plasma_store(plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY, - use_valgrind=False, use_profiler=False, - stdout_file=None, stderr_file=None): - """Start a plasma store process. - Args: - use_valgrind (bool): True if the plasma store should be started inside of - valgrind. If this is True, use_profiler must be False. - use_profiler (bool): True if the plasma store should be started inside a - profiler. If this is True, use_valgrind must be False. - stdout_file: A file handle opened for writing to redirect stdout to. If no - redirection should happen, then this should be None. - stderr_file: A file handle opened for writing to redirect stderr to. If no - redirection should happen, then this should be None. - Return: - A tuple of the name of the plasma store socket and the process ID of the - plasma store process. - """ - if use_valgrind and use_profiler: - raise Exception("Cannot use valgrind and profiler at the same time.") - module_dir = site.getsitepackages() - [plasma_dir] = glob.glob(os.path.join(module_dir[0], "plasma*")) - plasma_store_executable = os.path.join(os.path.abspath(plasma_dir), "plasma/plasma_store") - plasma_store_name = "/tmp/plasma_store{}".format(random_name()) - command = [plasma_store_executable, - "-s", plasma_store_name, - "-m", str(plasma_store_memory)] - if use_valgrind: - pid = subprocess.Popen(["valgrind", - "--track-origins=yes", - "--leak-check=full", - "--show-leak-kinds=all", - "--error-exitcode=1"] + command, - stdout=stdout_file, stderr=stderr_file) - time.sleep(1.0) - elif use_profiler: - pid = subprocess.Popen(["valgrind", "--tool=callgrind"] + command, - stdout=stdout_file, stderr=stderr_file) - time.sleep(1.0) - else: - pid = subprocess.Popen(command, stdout=stdout_file, stderr=stderr_file) - time.sleep(0.1) - return plasma_store_name, pid - -class TestPlasmaClient(unittest.TestCase): - - def setUp(self): - # Start Plasma store. - plasma_store_name, self.p = start_plasma_store( - use_valgrind=USE_VALGRIND) - # Connect to Plasma. - self.plasma_client = plasma.PlasmaClient() - self.plasma_client.connect(plasma_store_name, "", 64) - # For the eviction test - self.plasma_client2 = plasma.PlasmaClient() - self.plasma_client2.connect(plasma_store_name, "", 0) - - def tearDown(self): - # Check that the Plasma store is still alive. - self.assertEqual(self.p.poll(), None) - # Kill the plasma store process. - if USE_VALGRIND: - self.p.send_signal(signal.SIGTERM) - self.p.wait() - if self.p.returncode != 0: - os._exit(-1) - else: - self.p.kill() - - def test_create(self): - # Create an object id string. - object_id = random_object_id() - # Create a new buffer and write to it. - length = 50 - memory_buffer = np.frombuffer(self.plasma_client.create(object_id, length), dtype="uint8") - for i in range(length): - memory_buffer[i] = i % 256 - # Seal the object. - self.plasma_client.seal(object_id) - # Get the object. - memory_buffer = np.frombuffer(self.plasma_client.get([object_id])[0], dtype="uint8") - for i in range(length): - self.assertEqual(memory_buffer[i], i % 256) - - def test_create_with_metadata(self): - for length in range(1000): - # Create an object id string. - object_id = random_object_id() - # Create a random metadata string. - metadata = generate_metadata(length) - # Create a new buffer and write to it. - memory_buffer = np.frombuffer(self.plasma_client.create(object_id, length, metadata), dtype="uint8") - for i in range(length): - memory_buffer[i] = i % 256 - # Seal the object. - self.plasma_client.seal(object_id) - # Get the object. - memory_buffer = np.frombuffer(self.plasma_client.get([object_id])[0], dtype="uint8") - for i in range(length): - self.assertEqual(memory_buffer[i], i % 256) - # Get the metadata. - metadata_buffer = np.frombuffer(self.plasma_client.get_metadata([object_id])[0], dtype="uint8") - self.assertEqual(len(metadata), len(metadata_buffer)) - for i in range(len(metadata)): - self.assertEqual(metadata[i], metadata_buffer[i]) - - def test_create_existing(self): - # This test is partially used to test the code path in which we create an - # object with an ID that already exists - length = 100 - for _ in range(1000): - object_id = random_object_id() - self.plasma_client.create(object_id, length, generate_metadata(length)) - try: - self.plasma_client.create(object_id, length, generate_metadata(length)) - # TODO(pcm): Introduce a more specific error type here - except pa.lib.ArrowException as e: - pass - else: - self.assertTrue(False) - - def test_get(self): - num_object_ids = 100 - # Test timing out of get with various timeouts. - for timeout in [0, 10, 100, 1000]: - object_ids = [random_object_id() for _ in range(num_object_ids)] - results = self.plasma_client.get(object_ids, timeout_ms=timeout) - self.assertEqual(results, num_object_ids * [None]) - - data_buffers = [] - metadata_buffers = [] - for i in range(num_object_ids): - if i % 2 == 0: - data_buffer, metadata_buffer = create_object_with_id( - self.plasma_client, object_ids[i], 2000, 2000) - data_buffers.append(data_buffer) - metadata_buffers.append(metadata_buffer) - - # Test timing out from some but not all get calls with various timeouts. - for timeout in [0, 10, 100, 1000]: - data_results = self.plasma_client.get(object_ids, timeout_ms=timeout) - # metadata_results = self.plasma_client.get_metadata(object_ids, - # timeout_ms=timeout) - for i in range(num_object_ids): - if i % 2 == 0: - array1 = np.frombuffer(data_buffers[i // 2], dtype="uint8") - array2 = np.frombuffer(data_results[i], dtype="uint8") - np.testing.assert_equal(array1, array2) - # TODO(rkn): We should compare the metadata as well. But currently - # the types are different (e.g., memoryview versus bytearray). - # self.assertTrue(plasma.buffers_equal(metadata_buffers[i // 2], - # metadata_results[i])) - else: - self.assertIsNone(results[i]) - - def test_store_arrow_objects(self): - data = np.random.randn(10, 4) - # Write an arrow object. - object_id = random_object_id() - tensor = pa.Tensor.from_numpy(data) - data_size = pa.get_tensor_size(tensor) - buf = self.plasma_client.create(object_id, data_size) - stream = plasma.FixedSizeBufferOutputStream(buf) - pa.write_tensor(tensor, stream) - self.plasma_client.seal(object_id) - # Read the arrow object. - [tensor] = self.plasma_client.get([object_id]) - reader = pa.BufferReader(tensor) - array = pa.read_tensor(reader).to_numpy() - # Assert that they are equal. - np.testing.assert_equal(data, array) - - def test_store_pandas_dataframe(self): - d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), - 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} - df = pd.DataFrame(d) - - # Write the DataFrame. - record_batch = pa.RecordBatch.from_pandas(df) - data_size = pa.get_record_batch_size(record_batch) - object_id = plasma.ObjectID(np.random.bytes(20)) - - buf = self.plasma_client.create(object_id, data_size) - stream = plasma.FixedSizeBufferOutputStream(buf) - stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema) - stream_writer.write_batch(record_batch) - - self.plasma_client.seal(object_id) - - # Read the DataFrame. - [data] = self.plasma_client.get([object_id]) - reader = pa.RecordBatchStreamReader(pa.BufferReader(data)) - result = reader.read_next_batch().to_pandas() - - pd.util.testing.assert_frame_equal(df, result) - - def test_pickle_object_ids(self): - # This can be used for sharing object IDs between processes. - import pickle - object_id = random_object_id() - data = pickle.dumps(object_id) - object_id2 = pickle.loads(data) - self.assertEqual(object_id, object_id2) - - def test_store_full(self): - # The store is started with 1GB, so make sure that create throws an - # exception when it is full. - def assert_create_raises_plasma_full(unit_test, size): - partial_size = np.random.randint(size) - try: - _, memory_buffer, _ = create_object(unit_test.plasma_client, - partial_size, - size - partial_size) - # TODO(pcm): More specific error here. - except pa.lib.ArrowException as e: - pass - else: - # For some reason the above didn't throw an exception, so fail. - unit_test.assertTrue(False) - - # Create a list to keep some of the buffers in scope. - memory_buffers = [] - _, memory_buffer, _ = create_object(self.plasma_client, 5 * 10 ** 8, 0) - memory_buffers.append(memory_buffer) - # Remaining space is 5 * 10 ** 8. Make sure that we can't create an object - # of size 5 * 10 ** 8 + 1, but we can create one of size 2 * 10 ** 8. - assert_create_raises_plasma_full(self, 5 * 10 ** 8 + 1) - _, memory_buffer, _ = create_object(self.plasma_client, 2 * 10 ** 8, 0) - del memory_buffer - _, memory_buffer, _ = create_object(self.plasma_client, 2 * 10 ** 8, 0) - del memory_buffer - assert_create_raises_plasma_full(self, 5 * 10 ** 8 + 1) - - _, memory_buffer, _ = create_object(self.plasma_client, 2 * 10 ** 8, 0) - memory_buffers.append(memory_buffer) - # Remaining space is 3 * 10 ** 8. - assert_create_raises_plasma_full(self, 3 * 10 ** 8 + 1) - - _, memory_buffer, _ = create_object(self.plasma_client, 10 ** 8, 0) - memory_buffers.append(memory_buffer) - # Remaining space is 2 * 10 ** 8. - assert_create_raises_plasma_full(self, 2 * 10 ** 8 + 1) - - def test_contains(self): - fake_object_ids = [random_object_id() for _ in range(100)] - real_object_ids = [random_object_id() for _ in range(100)] - for object_id in real_object_ids: - self.assertFalse(self.plasma_client.contains(object_id)) - self.plasma_client.create(object_id, 100) - self.plasma_client.seal(object_id) - self.assertTrue(self.plasma_client.contains(object_id)) - for object_id in fake_object_ids: - self.assertFalse(self.plasma_client.contains(object_id)) - for object_id in real_object_ids: - self.assertTrue(self.plasma_client.contains(object_id)) - - def test_hash(self): - # Check the hash of an object that doesn't exist. - object_id1 = random_object_id() - try: - self.plasma_client.hash(object_id1) - # TODO(pcm): Introduce a more specific error type here - except pa.lib.ArrowException as e: - pass - else: - self.assertTrue(False) - - length = 1000 - # Create a random object, and check that the hash function always returns - # the same value. - metadata = generate_metadata(length) - memory_buffer = np.frombuffer(self.plasma_client.create(object_id1, length, metadata), dtype="uint8") - for i in range(length): - memory_buffer[i] = i % 256 - self.plasma_client.seal(object_id1) - self.assertEqual(self.plasma_client.hash(object_id1), - self.plasma_client.hash(object_id1)) - - # Create a second object with the same value as the first, and check that - # their hashes are equal. - object_id2 = random_object_id() - memory_buffer = np.frombuffer(self.plasma_client.create(object_id2, length, metadata), dtype="uint8") - for i in range(length): - memory_buffer[i] = i % 256 - self.plasma_client.seal(object_id2) - self.assertEqual(self.plasma_client.hash(object_id1), - self.plasma_client.hash(object_id2)) - - # Create a third object with a different value from the first two, and - # check that its hash is different. - object_id3 = random_object_id() - metadata = generate_metadata(length) - memory_buffer = np.frombuffer(self.plasma_client.create(object_id3, length, metadata), dtype="uint8") - for i in range(length): - memory_buffer[i] = (i + 1) % 256 - self.plasma_client.seal(object_id3) - self.assertNotEqual(self.plasma_client.hash(object_id1), - self.plasma_client.hash(object_id3)) - - # Create a fourth object with the same value as the third, but different - # metadata. Check that its hash is different from any of the previous - # three. - object_id4 = random_object_id() - metadata4 = generate_metadata(length) - memory_buffer = np.frombuffer(self.plasma_client.create(object_id4, length, metadata4), dtype="uint8") - for i in range(length): - memory_buffer[i] = (i + 1) % 256 - self.plasma_client.seal(object_id4) - self.assertNotEqual(self.plasma_client.hash(object_id1), - self.plasma_client.hash(object_id4)) - self.assertNotEqual(self.plasma_client.hash(object_id3), - self.plasma_client.hash(object_id4)) - - def test_many_hashes(self): - hashes = [] - length = 2 ** 10 - - for i in range(256): - object_id = random_object_id() - memory_buffer = np.frombuffer(self.plasma_client.create(object_id, length), dtype="uint8") - for j in range(length): - memory_buffer[j] = i - self.plasma_client.seal(object_id) - hashes.append(self.plasma_client.hash(object_id)) - - # Create objects of varying length. Each pair has two bits different. - for i in range(length): - object_id = random_object_id() - memory_buffer = np.frombuffer(self.plasma_client.create(object_id, length), dtype="uint8") - for j in range(length): - memory_buffer[j] = 0 - memory_buffer[i] = 1 - self.plasma_client.seal(object_id) - hashes.append(self.plasma_client.hash(object_id)) - - # Create objects of varying length, all with value 0. - for i in range(length): - object_id = random_object_id() - memory_buffer = np.frombuffer(self.plasma_client.create(object_id, i), dtype="uint8") - for j in range(i): - memory_buffer[j] = 0 - self.plasma_client.seal(object_id) - hashes.append(self.plasma_client.hash(object_id)) - - # Check that all hashes were unique. - self.assertEqual(len(set(hashes)), 256 + length + length) - - # def test_individual_delete(self): - # length = 100 - # # Create an object id string. - # object_id = random_object_id() - # # Create a random metadata string. - # metadata = generate_metadata(100) - # # Create a new buffer and write to it. - # memory_buffer = self.plasma_client.create(object_id, length, metadata) - # for i in range(length): - # memory_buffer[i] = chr(i % 256) - # # Seal the object. - # self.plasma_client.seal(object_id) - # # Check that the object is present. - # self.assertTrue(self.plasma_client.contains(object_id)) - # # Delete the object. - # self.plasma_client.delete(object_id) - # # Make sure the object is no longer present. - # self.assertFalse(self.plasma_client.contains(object_id)) - # - # def test_delete(self): - # # Create some objects. - # object_ids = [random_object_id() for _ in range(100)] - # for object_id in object_ids: - # length = 100 - # # Create a random metadata string. - # metadata = generate_metadata(100) - # # Create a new buffer and write to it. - # memory_buffer = self.plasma_client.create(object_id, length, metadata) - # for i in range(length): - # memory_buffer[i] = chr(i % 256) - # # Seal the object. - # self.plasma_client.seal(object_id) - # # Check that the object is present. - # self.assertTrue(self.plasma_client.contains(object_id)) - # - # # Delete the objects and make sure they are no longer present. - # for object_id in object_ids: - # # Delete the object. - # self.plasma_client.delete(object_id) - # # Make sure the object is no longer present. - # self.assertFalse(self.plasma_client.contains(object_id)) - - def test_illegal_functionality(self): - # Create an object id string. - object_id = random_object_id() - # Create a new buffer and write to it. - length = 1000 - memory_buffer = self.plasma_client.create(object_id, length) - # Make sure we cannot access memory out of bounds. - self.assertRaises(Exception, lambda: memory_buffer[length]) - # Seal the object. - self.plasma_client.seal(object_id) - # This test is commented out because it currently fails. - # # Make sure the object is ready only now. - # def illegal_assignment(): - # memory_buffer[0] = chr(0) - # self.assertRaises(Exception, illegal_assignment) - # Get the object. - memory_buffer = self.plasma_client.get([object_id])[0] - - # Make sure the object is read only. - def illegal_assignment(): - memory_buffer[0] = chr(0) - self.assertRaises(Exception, illegal_assignment) - - def test_evict(self): - client = self.plasma_client2 - object_id1 = random_object_id() - b1 = client.create(object_id1, 1000) - client.seal(object_id1) - del b1 - self.assertEqual(client.evict(1), 1000) - - object_id2 = random_object_id() - object_id3 = random_object_id() - b2 = client.create(object_id2, 999) - b3 = client.create(object_id3, 998) - client.seal(object_id3) - del b3 - self.assertEqual(client.evict(1000), 998) - - object_id4 = random_object_id() - b4 = client.create(object_id4, 997) - client.seal(object_id4) - del b4 - client.seal(object_id2) - del b2 - self.assertEqual(client.evict(1), 997) - self.assertEqual(client.evict(1), 999) - - object_id5 = random_object_id() - object_id6 = random_object_id() - object_id7 = random_object_id() - b5 = client.create(object_id5, 996) - b6 = client.create(object_id6, 995) - b7 = client.create(object_id7, 994) - client.seal(object_id5) - client.seal(object_id6) - client.seal(object_id7) - del b5 - del b6 - del b7 - self.assertEqual(client.evict(2000), 996 + 995 + 994) - - def test_subscribe(self): - # Subscribe to notifications from the Plasma Store. - self.plasma_client.subscribe() - for i in [1, 10, 100, 1000, 10000, 100000]: - object_ids = [random_object_id() for _ in range(i)] - metadata_sizes = [np.random.randint(1000) for _ in range(i)] - data_sizes = [np.random.randint(1000) for _ in range(i)] - for j in range(i): - self.plasma_client.create( - object_ids[j], data_sizes[j], - metadata=bytearray(np.random.bytes(metadata_sizes[j]))) - self.plasma_client.seal(object_ids[j]) - # Check that we received notifications for all of the objects. - for j in range(i): - notification_info = self.plasma_client.get_next_notification() - recv_objid, recv_dsize, recv_msize = notification_info - self.assertEqual(object_ids[j], recv_objid) - self.assertEqual(data_sizes[j], recv_dsize) - self.assertEqual(metadata_sizes[j], recv_msize) - - def test_subscribe_deletions(self): - # Subscribe to notifications from the Plasma Store. We use plasma_client2 - # to make sure that all used objects will get evicted properly. - self.plasma_client2.subscribe() - for i in [1, 10, 100, 1000, 10000, 100000]: - object_ids = [random_object_id() for _ in range(i)] - # Add 1 to the sizes to make sure we have nonzero object sizes. - metadata_sizes = [np.random.randint(1000) + 1 for _ in range(i)] - data_sizes = [np.random.randint(1000) + 1 for _ in range(i)] - for j in range(i): - x = self.plasma_client2.create( - object_ids[j], data_sizes[j], - metadata=bytearray(np.random.bytes(metadata_sizes[j]))) - self.plasma_client2.seal(object_ids[j]) - del x - # Check that we received notifications for creating all of the objects. - for j in range(i): - notification_info = self.plasma_client2.get_next_notification() - recv_objid, recv_dsize, recv_msize = notification_info - self.assertEqual(object_ids[j], recv_objid) - self.assertEqual(data_sizes[j], recv_dsize) - self.assertEqual(metadata_sizes[j], recv_msize) - - # Check that we receive notifications for deleting all objects, as we - # evict them. - for j in range(i): - self.assertEqual(self.plasma_client2.evict(1), - data_sizes[j] + metadata_sizes[j]) - notification_info = self.plasma_client2.get_next_notification() - recv_objid, recv_dsize, recv_msize = notification_info - self.assertEqual(object_ids[j], recv_objid) - self.assertEqual(-1, recv_dsize) - self.assertEqual(-1, recv_msize) - - # Test multiple deletion notifications. The first 9 object IDs have size 0, - # and the last has a nonzero size. When Plasma evicts 1 byte, it will evict - # all objects, so we should receive deletion notifications for each. - num_object_ids = 10 - object_ids = [random_object_id() for _ in range(num_object_ids)] - metadata_sizes = [0] * (num_object_ids - 1) - data_sizes = [0] * (num_object_ids - 1) - metadata_sizes.append(np.random.randint(1000)) - data_sizes.append(np.random.randint(1000)) - for i in range(num_object_ids): - x = self.plasma_client2.create( - object_ids[i], data_sizes[i], - metadata=bytearray(np.random.bytes(metadata_sizes[i]))) - self.plasma_client2.seal(object_ids[i]) - del x - for i in range(num_object_ids): - notification_info = self.plasma_client2.get_next_notification() - recv_objid, recv_dsize, recv_msize = notification_info - self.assertEqual(object_ids[i], recv_objid) - self.assertEqual(data_sizes[i], recv_dsize) - self.assertEqual(metadata_sizes[i], recv_msize) - self.assertEqual(self.plasma_client2.evict(1), - data_sizes[-1] + metadata_sizes[-1]) - for i in range(num_object_ids): - notification_info = self.plasma_client2.get_next_notification() - recv_objid, recv_dsize, recv_msize = notification_info - self.assertEqual(object_ids[i], recv_objid) - self.assertEqual(-1, recv_dsize) - self.assertEqual(-1, recv_msize) - -if __name__ == "__main__": - if len(sys.argv) > 1: - # Pop the argument so we don't mess with unittest's own argument parser. - if sys.argv[-1] == "valgrind": - arg = sys.argv.pop() - USE_VALGRIND = True - print("Using valgrind for tests") - unittest.main(verbosity=2) \ No newline at end of file From 3f3f373bfdeb2125a9f9dbdf145991cec7128e3f Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 24 Jul 2017 15:09:36 -0700 Subject: [PATCH 07/21] fix plasma documentation --- python/doc/source/plasma.rst | 433 ++++++++++++++++------------------- 1 file changed, 199 insertions(+), 234 deletions(-) diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index 9e04539b275..ad6beeac1bd 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -22,7 +22,7 @@ The Plasma In-Memory Object Store ================================= .. contents:: Contents - :depth: 3 + :depth: 3 Installing Plasma ----------------- @@ -39,14 +39,14 @@ with the ``bash`` command, whether or not you are using the Bash shell. .. code-block:: bash - wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh - bash Anaconda3-4.4.0-Linux-x86_64.sh + wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh + bash Anaconda3-4.4.0-Linux-x86_64.sh .. note:: - As an alternative to the wget command above, you can also download the - Anaconda installer script through your web browser at their - `Download Webpage here `_. + As an alternative to the wget command above, you can also download the + Anaconda installer script through your web browser at their + `Download Webpage here `_. Accept the Anaconda license agreement and follow the prompt. Allow the @@ -57,7 +57,7 @@ command, so that the new PATH takes effect: .. code-block:: bash - source ~/.bashrc + source ~/.bashrc Anaconda should now be installed. For more information on installing Anaconda, see their `documentation here `_. @@ -68,10 +68,8 @@ as below: .. code-block:: bash - sudo apt-get update - sudo apt-get install -y cmake build-essential autoconf curl libtool libboost-all-dev - sudo apt-get install -y unzip libjemalloc-dev pkg-config - sudo ldconfig + sudo apt-get update + sudo apt-get install -y cmake build-essential autoconf curl libtool libboost-all-dev unzip libjemalloc-dev pkg-config Now, we need to install arrow. These instructions will install everything @@ -79,78 +77,58 @@ to your home directory. First download the arrow package from github: .. code-block:: bash - cd ~ - git clone https://github.com/apache/arrow - + cd ~ + git clone https://github.com/apache/arrow + Next, create a build directory as follows: .. code-block:: bash - cd arrow/cpp - mkdir build - cd build + cd arrow/cpp + mkdir build + cd build You should now be in the ~/arrow/cpp/build directory. Run cmake and make to build Arrow. .. code-block:: bash - cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. - make - sudo make install + cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. + make + sudo make install .. warning:: - Running the ``cmake`` command above may give an ``ImportError`` - concerning numpy. If that is the case, see `ImportError when Running Cmake`_. + Running the ``cmake`` command above may give an ``ImportError`` + concerning numpy. If that is the case, see `ImportError when Running Cmake`_. -After installing arrow, you need to install pyarrow as follows: +After installing arrow, you need to install pyarrow with the Plasma client as follows: .. code-block:: bash - cd ~/arrow/python - python setup.py install + cd ~/arrow/python + PYARROW_WITH_PLASMA=1 python setup.py install Once you've installed pyarrow, you should verify that you are able to -import it when running python in the terminal: +import it when running python in the terminal. Also make sure you can import +the Plasma client library. Make sure to try this from +outside of the ``~/arrow/cpp/src/plasma`` directory, otherwise you may +encounter a ModuleNotFoundError. .. code-block:: shell - ubuntu:~/arrow/cpp/src/plasma$ python - Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) - [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux - Type "help", "copyright", "credits" or "license" for more information. - >>> import pyarrow - >>> + ubuntu:~/arrow/cpp/src/plasma$ cd ~ + ubuntu:~/$ python + Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) + [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux + Type "help", "copyright", "credits" or "license" for more information. + >>> import pyarrow + >>> import pyarrow.plasma If you encounter an ImportError when running the above, see `ImportError After Installing Pyarrow`_. -Finally, you can install Plasma. - -.. code-block:: bash - - cd ~/arrow/cpp/src/plasma - python setup.py install - -Similar to pyarrow, you can verify that Plasma has been installed by -trying to import it when running python. Make sure to try this from -outside of the ~/arrow/cpp/src/plasma directory, otherwise you may -encounter the following error: - -.. code-block:: shell - - ubuntu:~/arrow/cpp/src/plasma$ python - Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) - [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux - Type "help", "copyright", "credits" or "license" for more information. - >>> import plasma - Traceback (most recent call last): - File "", line 1, in - File "/home/ubuntu/arrow/cpp/src/plasma/plasma/__init__.py", line 18, in - from .plasma import * - ModuleNotFoundError: No module named 'plasma.plasma' - +Congratulations! Plasma is now set up and you can look at `The Plasma API`_. Installation on Mac OS X (TODO) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -176,8 +154,8 @@ The next step is to install the following dependency packages as below: .. code-block:: bash - brew update - brew install cmake autoconf libtool pkg-config jemalloc + brew update + brew install cmake autoconf libtool pkg-config jemalloc Plasma also requires the build-essential, curl, unzip, libboost-all-dev, and libjemalloc-dev packages. MacOS should already come with curl, unzip, @@ -189,26 +167,26 @@ arrow package from github with the following commands: .. code-block:: bash - cd ~ - git clone https://github.com/apache/arrow - + cd ~ + git clone https://github.com/apache/arrow + Create a directory for the arrow build: .. code-block:: bash - cd arrow/cpp - git checkout plasma-cython - mkdir build - cd build + cd arrow/cpp + git checkout plasma-cython + mkdir build + cd build You should now be in the ~/arrow/cpp/build directory. Run cmake and make to build Arrow. .. code-block:: bash - cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. - make - sudo make install + cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. + make + sudo make install TODO: @@ -229,13 +207,13 @@ the ``cmake`` command, there may be an issue with finding numpy. .. code-block:: shell - NumPy import failure: + NumPy import failure: - Traceback (most recent call last): + Traceback (most recent call last): - File "", line 1, in + File "", line 1, in - ImportError: No module named numpy + ImportError: No module named numpy First, verify that numpy has been installed alongside anaconda. Running ``conda list`` outputs all the packages that have been installed with @@ -243,8 +221,8 @@ anaconda: .. code-block:: shell - ubuntu:~/arrow/cpp/build$ conda list - numpy 1.12.1 py36_0 + ubuntu:~/arrow/cpp/build$ conda list + numpy 1.12.1 py36_0 If something similar to the above numpy line is not listed in the output, numpy has not yet been installed. @@ -253,7 +231,7 @@ If numpy has not been installed, try running the following command: .. code-block:: bash - conda install numpy + conda install numpy If numpy is still not installed, try reinstalling anaconda. @@ -263,8 +241,8 @@ Anaconda package: .. code-block:: shell - ubuntu:~/arrow/cpp/build$ which python - /home/ubuntu/anaconda3/bin/python + ubuntu:~/arrow/cpp/build$ which python + /home/ubuntu/anaconda3/bin/python If this issue comes up, most likely the anaconda library has not yet been properly prepended to your PATH and the new PATH reloaded. @@ -276,8 +254,8 @@ the paths to all python versions installed on your machine by running .. code-block:: shell - ubuntu:~/arrow/cpp/build$ whereis python - python: /usr/bin/python3.5m /usr/bin/python2.7 /usr/bin/python /usr/bin/python2.7-config /usr/bin/python3.5 /usr/lib/python2.7 /usr/lib/python3.5 /etc/python2.7 /etc/python /etc/python3.5 /usr/local/lib/python2.7 /usr/local/lib/python3.5 /usr/include/python2.7 /usr/share/python /home/ubuntu/anaconda3/bin/python3.6m-config /home/ubuntu/anaconda3/bin/python3.6m /home/ubuntu/anaconda3/bin/python3.6 /home/ubuntu/anaconda3/bin/python3.6-config /home/ubuntu/anaconda3/bin/python /usr/share/man/man1/python.1.gz + ubuntu:~/arrow/cpp/build$ whereis python + python: /usr/bin/python3.5m /usr/bin/python2.7 /usr/bin/python /usr/bin/python2.7-config /usr/bin/python3.5 /usr/lib/python2.7 /usr/lib/python3.5 /etc/python2.7 /etc/python /etc/python3.5 /usr/local/lib/python2.7 /usr/local/lib/python3.5 /usr/include/python2.7 /usr/share/python /home/ubuntu/anaconda3/bin/python3.6m-config /home/ubuntu/anaconda3/bin/python3.6m /home/ubuntu/anaconda3/bin/python3.6 /home/ubuntu/anaconda3/bin/python3.6-config /home/ubuntu/anaconda3/bin/python /usr/share/man/man1/python.1.gz Anaconda usually modifies your ``~/.bashrc`` file in its installation. You may need to manually add the following line or similar to the bottom @@ -285,8 +263,8 @@ of your ``~/.bashrc`` file, then reload your terminal window: .. code-block:: bash - # added by Anaconda3 4.4.0 installer - export PATH="/home/ubuntu/anaconda3/bin:$PATH" + # added by Anaconda3 4.4.0 installer + export PATH="/home/ubuntu/anaconda3/bin:$PATH" You can also create a persistent ``python`` shell alias to point to your Anaconda python version by adding to following to the bottom of your @@ -294,7 +272,7 @@ Anaconda python version by adding to following to the bottom of your .. code-block:: bash - alias python=/home/ubuntu/anaconda3/bin/python + alias python=/home/ubuntu/anaconda3/bin/python At this point, if you no longer have any issues with your anaconda installation or with your python version, you should be able to run Python @@ -302,12 +280,12 @@ in the terminal and import numpy with no errors: .. code-block:: shell - ubuntu:~/arrow/cpp/build$ python - Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) - [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux - Type "help", "copyright", "credits" or "license" for more information. - >>> import numpy - >>> + ubuntu:~/arrow/cpp/build$ python + Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) + [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux + Type "help", "copyright", "credits" or "license" for more information. + >>> import numpy + >>> Finally, if you are confident that numpy has been installed and that you are using Anaconda's version of python, cmake may be looking for python and @@ -317,7 +295,7 @@ version) to force ``cmake`` to use the correct python version: .. code-block:: bash - cmake -DPYTHON_EXECUTABLE:FILEPATH=/home/ubuntu/anaconda3/bin/python -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. + cmake -DPYTHON_EXECUTABLE:FILEPATH=/home/ubuntu/anaconda3/bin/python -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. You may now proceed with the rest of the arrow installation. @@ -330,81 +308,71 @@ inside Python: .. code-block:: shell - >>> import pyarrow - Traceback (most recent call last): - File "", line 1, in - File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in - from pyarrow.lib import cpu_count, set_cpu_count - ImportError: libarrow.so.0: cannot open shared object file: No such file or directory + >>> import pyarrow + Traceback (most recent call last): + File "", line 1, in + File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in + from pyarrow.lib import cpu_count, set_cpu_count + ImportError: libarrow.so.0: cannot open shared object file: No such file or directory If this is the case, after you have built Arrow, try running the following line again in the terminal to remove this ImportError: .. code-block:: bash - - sudo ldconfig + + sudo ldconfig You may also encounter the following error output when trying to ``import pyarrow`` inside Python: .. code-block:: shell - >>> import pyarrow - Traceback (most recent call last): - File "", line 1, in - File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in - from pyarrow.lib import cpu_count, set_cpu_count - ImportError: /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so) + >>> import pyarrow + Traceback (most recent call last): + File "", line 1, in + File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in + from pyarrow.lib import cpu_count, set_cpu_count + ImportError: /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so) If this is the case, run the following command to remove this ImportError: .. code-block:: bash - - conda install -y libgcc + + conda install -y libgcc The Plasma API -------------- -Creating a Plasma client -^^^^^^^^^^^^^^^^^^^^^^^^ - -First locate your plasma directory. This can be printed out by -importing plasma in python and running the command ``print(plasma.__path__)``. -If running python from the terminal, be sure to run this command outside of the ~/arrow/cpp/src/plasma directory, or you may encounter an error. +Starting the Plasma store +^^^^^^^^^^^^^^^^^^^^^^^^^ -For example, to find your plasma directory, you can run the following one-liner -from the terminal like follows: - -.. code-block:: shell - - ubuntu:~$ python -c "import plasma; print(plasma.__path__)" - ['/home/ubuntu/anaconda3/lib/python3.6/site-packages/plasma-0.0.1-py3.6-linux-x86_64.egg/plasma'] - -From inside the plasma directory, you can start the plasma store in the -foreground by issuing a terminal command similar to the following: +You can start the Plasma store by issuing a terminal command similar to the +following: .. code-block:: bash - ./plasma_store -m 1000000000 -s /tmp/plasma + plasma_store -m 1000000000 -s /tmp/plasma -This command must be issued inside the plasma directory to work. The -m flag -specifies the size of the store in bytes, and the -s flag specifies the socket -that the store will listen at. Thus, the above command sets the Plasma store -to use up to 1 GB of memory, and sets the socket to ``/tmp/plasma``. +The -m flag specifies the size of the store in bytes, and the -s flag specifies +the socket that the store will listen at. Thus, the above command sets the +Plasma store to use up to 1 GB of memory, and sets the socket to +``/tmp/plasma``. Leave the current terminal window open as long as Plasma store should keep running. Messages, concerning such as disconnecting clients, may occasionally be outputted. To stop running the Plasma store, you can press ``CTRL-C`` in the terminal. -Finally, from within python, the same socket given to ``./plasma_store`` -should then be passed into the Plasma client as shown below: +Creating a Plasma client +^^^^^^^^^^^^^^^^^^^^^^^^ + +To start the Plasma client, from within python, the same socket given to +``./plasma_store`` should then be passed into the Plasma client as shown below: .. code-block:: python - import plasma - client = plasma.PlasmaClient() - client.connect("/tmp/plasma", "", 0) + import pyarrow.plasma as plasma + client = plasma.PlasmaClient("/tmp/plasma", "", 0) If the following error occurs from running the above Python code, that means that either the socket given is incorrect, or the ``./plasma_store`` is @@ -413,9 +381,9 @@ process in your plasma directory. .. code-block:: shell - >>> client.connect("/tmp/plasma", "", 0) - Connection to socket failed for pathname /tmp/plasma - Could not connect to socket /tmp/plasma + >>> client = plasma.PlasmaClient("/tmp/plasma", "", 0) + Connection to socket failed for pathname /tmp/plasma + Could not connect to socket /tmp/plasma Object IDs @@ -428,22 +396,22 @@ the Plasma store. You can form an ``ObjectID`` object from a byte string of .. code-block:: shell - # Create ObjectID of 20 bytes, each byte being the byte (b) encoding of the letter "a" - >>> id = plasma.ObjectID(20 * b"a") + # Create ObjectID of 20 bytes, each byte being the byte (b) encoding of the letter "a" + >>> id = plasma.ObjectID(20 * b"a") - # "a" is encoded as 61 - >>> id - ObjectID(6161616161616161616161616161616161616161) + # "a" is encoded as 61 + >>> id + ObjectID(6161616161616161616161616161616161616161) Random generation of Object IDs is often good enough to ensure unique ids. You can easily create a helper function that randomizes object ids as follows: .. code-block:: python - import numpy as np + import numpy as np - def random_object_id(): - return plasma.ObjectID(np.random.bytes(20)) + def random_object_id(): + return plasma.ObjectID(np.random.bytes(20)) Creating an Object @@ -458,22 +426,22 @@ give the object's maximum size in bytes. .. code-block:: python - # Create an object. - object_id = plasma.ObjectID(20 * b"a") # Note that this is an ObjectID object, not a string - object_size = 1000 - buffer = memoryview(client.create(object_id, object_size)) + # Create an object. + object_id = plasma.ObjectID(20 * b"a") # Note that this is an ObjectID object, not a string + object_size = 1000 + buffer = memoryview(client.create(object_id, object_size)) - # Write to the buffer. - for i in range(1000): - buffer[i] = i % 128 + # Write to the buffer. + for i in range(1000): + buffer[i] = i % 128 When the client is done, the client *seals* the buffer, making the object immutable, and making it available to other Plasma clients. .. code-block:: python - # Seal the object. This makes the object immutable and available to other clients. - client.seal(object_id) + # Seal the object. This makes the object immutable and available to other clients. + client.seal(object_id) Getting an Object @@ -484,49 +452,44 @@ the object. .. code-block:: python - # Create a different client. Note that this second client could be - # created in the same or in a separate, concurrent Python session. - client2 = plasma.PlasmaClient() - client2.connect("/tmp/plasma", "", 0) + # Create a different client. Note that this second client could be + # created in the same or in a separate, concurrent Python session. + client2 = plasma.PlasmaClient("/tmp/plasma", "", 0) - # Get the object in the second client. This blocks until the object has been sealed. - object_id2 = plasma.ObjectID(20 * b"a") - [buffer2] = client2.get([object_id]) # Note that you pass in as an ObjectID object, not a string + # Get the object in the second client. This blocks until the object has been sealed. + object_id2 = plasma.ObjectID(20 * b"a") + [buffer2] = client2.get([object_id]) # Note that you pass in as an ObjectID object, not a string If the object has not been sealed yet, then the call to client.get will block -until the object has been sealed by the client constructing the object. +until the object has been sealed by the client constructing the object. Using +the ``timeout_ms`` argument to get, you can specify a timeout for this (in +milliseconds). After the timeout, the interpreter will yield control back. Note that the buffer fetched is not in the same object type as the buffer the original client created to store the object in the first place. The buffer the original client created is a Python ``memoryview`` buffer object, while the buffer returned from ``client.get`` is a Plasma-specific ``PlasmaBuffer`` -object. +object. It supports the Python buffer protocol, so you can create a memoryview +from it, which supports slicing and indexing to expose its data. .. code-block:: shell - >>> buffer - - >>> buffer[1] - 1 - >>> buffer2 - - >>> buffer2[1] - 1 - -However, the ``PlasmaBuffer`` object should behave like a ``memoryview`` -object, and supports slicing and indexing to expose its data. + >>> buffer + + >>> buffer[1] + 1 + >>> buffer2 + + >>> view2 = memoryview(buffer2) + >>> view2[1] + 1 + >>> view2[129] + 1 + >>> bytes(buffer[1:4]) + b'\x01\x02\x03' + >>> bytes(view2[1:4]) + b'\x01\x02\x03' -.. code-block:: shell - - >>> buffer[5] - 5 - >>> buffer[129] - 1 - >>> bytes(buffer[1:4]) - b'\x01\x02\x03' - >>> bytes(buffer2[1:4]) - b'\x01\x02\x03' - Using Arrow and Pandas with Plasma ---------------------------------- @@ -544,36 +507,36 @@ API such as ``pyarrow.get_tensor_size``. .. code-block:: python - import numpy as np - import pyarrow as pa + import numpy as np + import pyarrow as pa - # Create a pyarrow.Tensor object from a numpy random 2-dimensional array - data = np.random.randn(10, 4) - tensor = pa.Tensor.from_numpy(data) + # Create a pyarrow.Tensor object from a numpy random 2-dimensional array + data = np.random.randn(10, 4) + tensor = pa.Tensor.from_numpy(data) - # Create the object in Plasma - object_id = plasma.ObjectID(np.random.bytes(20)) - data_size = pa.get_tensor_size(tensor) - buf = client.create(object_id, data_size) + # Create the object in Plasma + object_id = plasma.ObjectID(np.random.bytes(20)) + data_size = pa.get_tensor_size(tensor) + buf = client.create(object_id, data_size) To write the Arrow ``Tensor`` object into the buffer, you can use Plasma to -convert the ``memoryview`` buffer into a ``plasma.FixedSizeBufferOutputStream`` -object. A ``plasma.FixedSizeBufferOutputStream`` is a format suitable for Arrow's +convert the ``memoryview`` buffer into a ``pyarrow.FixedSizeBufferOutputStream`` +object. A ``pyarrow.FixedSizeBufferOutputStream`` is a format suitable for Arrow's ``pyarrow.write_tensor``: .. code-block:: python - # Write the tensor into the Plasma-allocated buffer - stream = plasma.FixedSizeBufferOutputStream(buf) - pa.write_tensor(tensor, stream) # Writes tensor's 552 bytes to Plasma stream + # Write the tensor into the Plasma-allocated buffer + stream = pa.FixedSizeBufferOutputStream(buf) + pa.write_tensor(tensor, stream) # Writes tensor's 552 bytes to Plasma stream To finish storing the Arrow object in Plasma, you can seal it just like for any other data: .. code-block:: python - # Seal the Plasma object - client.seal(object_id) + # Seal the Plasma object + client.seal(object_id) Getting Arrow Objects from Plasma ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -583,8 +546,8 @@ using its object id as usual. .. code-block:: python - # Get the arrow object by ObjectID. - [buf2] = client.get([object_id]) + # Get the arrow object by ObjectID. + [buf2] = client.get([object_id]) To convert the ``PlasmaBuffer`` back into the Arrow ``Tensor``, first you have to create a pyarrow ``BufferReader`` object from it. You can then pass the @@ -593,17 +556,17 @@ object: .. code-block:: python - # Reconstruct the Arrow tensor object. - reader = pa.BufferReader(buf2) # Plasma buffer -> Arrow reader - tensor2 = pa.read_tensor(reader) # Arrow reader -> Arrow tensor + # Reconstruct the Arrow tensor object. + reader = pa.BufferReader(buf2) # Plasma buffer -> Arrow reader + tensor2 = pa.read_tensor(reader) # Arrow reader -> Arrow tensor Finally, you can use ``pyarrow.read_tensor`` to convert the Arrow object back into numpy data: .. code-block:: python - # Convert back to numpy - array = tensor2.to_numpy() # Arrow tensor -> numpy array + # Convert back to numpy + array = tensor2.to_numpy() # Arrow tensor -> numpy array Storing Pandas DataFrames in Plasma ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -622,28 +585,31 @@ You can create the pyarrow equivalent of a Pandas ``DataFrame`` by using .. code-block:: python - import pyarrow as pa - import pandas as pd + import pyarrow as pa + import pandas as pd - # Create a Pandas DataFrame - d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), - 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} - df = pd.DataFrame(d) + # Create a Pandas DataFrame + d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), + 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])} + df = pd.DataFrame(d) - # Convert the Pandas DataFrame into a PyArrow RecordBatch - record_batch = pa.RecordBatch.from_pandas(df) + # Convert the Pandas DataFrame into a PyArrow RecordBatch + record_batch = pa.RecordBatch.from_pandas(df) Creating the Plasma object requires an ``ObjectID`` and the size of the data. Now that we have converted the Pandas ``DataFrame`` into a PyArrow -``RecordBatch``, use ``pyarrow.get_record_batch_size`` to determine the +``RecordBatch``, use the ``MockOutputStream`` to determine the size of the Plasma object. .. code-block:: python - # Create the Plasma object from the PyArrow RecordBatch - object_id = plasma.ObjectID(np.random.bytes(20)) - data_size = pa.get_record_batch_size(record_batch) - buf = client.create(object_id, data_size) + # Create the Plasma object from the PyArrow RecordBatch + object_id = plasma.ObjectID(np.random.bytes(20)) + mock_sink = pa.MockOutputStream() + stream_writer = pa.RecordBatchStreamWriter(mock_sink, record_batch.schema) + stream_writer.write_batch(record_batch) + data_size = mock_sink.size() + buf = client.create(object_id, data_size) Similar to storing an Arrow object, you have to convert the ``memoryview`` object into a ``plasma.FixedSizeBufferOutputStream`` object in order to @@ -653,17 +619,17 @@ the PyArrow ``RecordBatch`` into Plasma as follows: .. code-block:: python - # Write the PyArrow RecordBatch to Plasma - stream = plasma.FixedSizeBufferOutputStream(buf) - stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema) - stream_writer.write_batch(record_batch) + # Write the PyArrow RecordBatch to Plasma + stream = pa.FixedSizeBufferOutputStream(buf) + stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema) + stream_writer.write_batch(record_batch) Finally, seal the finished object for use by all clients: .. code-block:: python - # Seal the Plasma object - client.seal(object_id) + # Seal the Plasma object + client.seal(object_id) Getting Pandas DataFrames from Plasma ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -677,24 +643,23 @@ into an Arrow ``BufferReader`` object. .. code-block:: python - # Fetch the Plasma object - [data] = client.get([object_id]) # Get PlasmaBuffer from ObjectID - buffer = pa.BufferReader(data) # PlasmaBuffer -> Arrow BufferReader + # Fetch the Plasma object + [data] = client.get([object_id]) # Get PlasmaBuffer from ObjectID + buffer = pa.BufferReader(data) # PlasmaBuffer -> Arrow BufferReader From the ``BufferReader``, we can create a specific ``RecordBatchStreamReader`` in Arrow to reconstruct the stored PyArrow ``RecordBatch`` object. .. code-block:: python - # Convert object back into an Arrow RecordBatch - reader = pa.RecordBatchStreamReader(buffer) # Arrow BufferReader -> Arrow RecordBatchStreamReader - rec_batch = reader.read_next_batch() # Arrow RecordBatchStreamReader -> Arrow RecordBatch + # Convert object back into an Arrow RecordBatch + reader = pa.RecordBatchStreamReader(buffer) # Arrow BufferReader -> Arrow RecordBatchStreamReader + rec_batch = reader.read_next_batch() # Arrow RecordBatchStreamReader -> Arrow RecordBatch The last step is to convert the PyArrow ``RecordBatch`` object back into the original Pandas ``DataFrame`` structure. .. code-block:: python - # Convert back into Pandas - result = rec_batch.to_pandas() # Arrow RecordBatch -> Pandas DataFrame - + # Convert back into Pandas + result = rec_batch.to_pandas() # Arrow RecordBatch -> Pandas DataFrame From bc078ff81df1badbf6a9a008c4b2195eb8e6d869 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 24 Jul 2017 15:40:06 -0700 Subject: [PATCH 08/21] complete installation instructions on macOS --- python/doc/source/plasma.rst | 37 +++++++++++++++++++++++++----------- 1 file changed, 26 insertions(+), 11 deletions(-) diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index ad6beeac1bd..9e1cdb6d05f 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -130,11 +130,11 @@ If you encounter an ImportError when running the above, see `ImportError After I Congratulations! Plasma is now set up and you can look at `The Plasma API`_. -Installation on Mac OS X (TODO) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Installation on Mac OS X +^^^^^^^^^^^^^^^^^^^^^^^^ -The following install instructions have been tested for Mac OS X 10.9 -Mavericks. +The following install instructions have been tested for Mac OS X 10.11 +El Capitan. First, install Anaconda as follows. Download the Graphical MacOS @@ -159,8 +159,7 @@ The next step is to install the following dependency packages as below: Plasma also requires the build-essential, curl, unzip, libboost-all-dev, and libjemalloc-dev packages. MacOS should already come with curl, unzip, -and the compilation tools found in build-essential. Ldconfig is not supported -on Mac. +and the compilation tools found in build-essential. Now, install arrow as follows. Open your terminal window and download the arrow package from github with the following commands: @@ -175,7 +174,6 @@ Create a directory for the arrow build: .. code-block:: bash cd arrow/cpp - git checkout plasma-cython mkdir build cd build @@ -188,13 +186,30 @@ make to build Arrow. make sudo make install -TODO: +After installing arrow, you need to install pyarrow with the Plasma client as follows: -* Install Pyarrow -* Verify Pyarrow -* Install Plasma +.. code-block:: bash + cd ~/arrow/python + PYARROW_WITH_PLASMA=1 python setup.py install +Once you've installed pyarrow, you should verify that you are able to +import it when running python in the terminal. Also make sure you can import +the Plasma client library. Make sure to try this from +outside of the ``~/arrow/cpp/src/plasma`` directory, otherwise you may +encounter a ModuleNotFoundError. + +.. code-block:: shell + + $ cd ~ + $ python + Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) + [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux + Type "help", "copyright", "credits" or "license" for more information. + >>> import pyarrow + >>> import pyarrow.plasma + +Congratulations! Plasma is now set up and you can look at `The Plasma API`_. Troubleshooting Installation Issues ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ From 5a8433e9fed846b808992dfe695e5b345e03103c Mon Sep 17 00:00:00 2001 From: Crystal Yan Date: Sat, 22 Jul 2017 05:24:55 -0700 Subject: [PATCH 09/21] Plasma C++ tutorial documentation - created a tutorial on C++ Plasma for Starting the Object Store, Creating Clients, Creating Objects, Getting Objects, Transferring to Remote Stores, Querying Status, Releasing Objects, and Shutting Down Clients and Stores. Basically all of the PlasmaClient API. Warning- I could not get C++ running on my machine to verify that any of the code runs properly/works. Please verify all code and tutorial content --- cpp/apidoc/index.md | 1 + cpp/apidoc/tutorials/plasma.md | 547 +++++++++++++++++++++++++++++++++ 2 files changed, 548 insertions(+) create mode 100644 cpp/apidoc/tutorials/plasma.md diff --git a/cpp/apidoc/index.md b/cpp/apidoc/index.md index 8389d16b4aa..ab9bbaa405a 100644 --- a/cpp/apidoc/index.md +++ b/cpp/apidoc/index.md @@ -39,6 +39,7 @@ Table of Contents * How to access [HDFS](HDFS.md) * Tutorials * [Convert a vector of row-wise data into an Arrow table](tutorials/row_wise_conversion.md) + * [Using the Plasma In-Memory Object Store](tutorials/plasma.md) Getting Started --------------- diff --git a/cpp/apidoc/tutorials/plasma.md b/cpp/apidoc/tutorials/plasma.md new file mode 100644 index 00000000000..3f9e4287818 --- /dev/null +++ b/cpp/apidoc/tutorials/plasma.md @@ -0,0 +1,547 @@ + + +Using the Plasma In-Memory Object Store +======================================= + +Apache Arrow offers the ability to share your data structures among multiple +processes simultaneously through Plasma, an in-memory object store. + +Plasma object stores can be local, as in being on the same node, or remote. +Plasma can communicate between local and remote object stores to share +objects between nodes as well. + +Like in Apache Arrow, Plasma objects are immutable. + +The following goes over the basics so you can begin using Plasma in your big +data applications. + +Starting up Plasma +------------------ + +Any process trying to access the Plasma object store needs to be set up as a +Plasma client. To start running the Plasma object store so that clients may +connect, you'll need to first locate your plasma directory. + +Most likely, your plasma directory is inside your anaconda installation. If you +have python installed in your machine, you can easily find out the location of +your plasma directory by running the following one-liner from the terminal: + +``` +ubuntu:~$ python -c "import plasma; print(plasma.__path__)" +['/home/ubuntu/anaconda3/lib/python3.6/site-packages/plasma-0.0.1-py3.6-linux-x86_64.egg/plasma'] + +``` + +Cd into your plasma directory. To start running the plasma object store, you can +run the `plasma_store` process in the foreground with a terminal command similar +to below: + +``` +./plasma_store -m 1000000000 -s /tmp/plasma + +``` + +This command only works from inside the plasma directory where the `plasma_store` +executable is located. This command takes in two flags-- the -m flag specifies +the size of the object store in bytes, and the -s flag specifies the filepath of +the UNIX domain socket that the store will listen at. + +Therefore, the above command initializes a Plasma store up to 1 GB of memory, and +sets the socket to `/tmp/plasma.` + +The Plasma store will remain available as long as the `plasma_store` process is +running in a terminal window. Messages, such as alerts for disconnecting clients, +may occasionally be outputted. To stop running the Plasma store, you can press +`CTRL-C` in the terminal window. + +Alternatively, you can run the Plasma store in the background and ignore all +message output with the following terminal command: + +``` +./plasma_store -m 1000000000 -s /tmp/plasma 1> /dev/null 2> /dev/null & + +``` + +The Plasma store will instead run silently in the background without having your +current window hang. To stop running the Plasma store in this case, issue the +below terminal command: + +``` +killall plasma_store & + +``` + +Creating a Plasma client +------------------------ + +Now that the Plasma object store is up and running, it's time to make client +processes (such as an instance of your C++ program) connect to it. To use the +Plasma object store as a client, your application should initialize a +`plasma::PlasmaClient` object and tell it to connect to socket specified when +starting up the Plasma object store. + +``` +#include +#include +#include +#include +using namespace plasma; + +int main(int argc, char** argv) { + // Start up and connect a Plasma client. + PlasmaClient client_; + client_.Connect("/tmp/plasma", "", PLASMA_DEFAULT_RELEASE_DELAY); +} + +``` + +Note that multiple clients can be created within the same process, and +clients can be created among multiple concurrent processes. + + +Object IDs +---------- + +The Plasma object store uses a key-value system for accessing objects stored +in the shared memory. Each object in the Plasma store should be associated +with a unique id. The Object ID then serves as a key for *any* client to fetch +that object from the Plasma store. You can form an ``ObjectID`` object from a +byte string of 20 bytes. + +``` +// Create an Object ID. +ObjectID object_id; +uint8_t* data = id.mutable_data(); + +// Write out the byte string 'aaaaaaaaaaaaaaaaaaaa'. +// plasma::kUniqueIDSize = 20 +for (int i = 0; i < kUniqueIDSize; i++) { + data[i] = (uint8_t)'a'; +} + +``` + +Random generation of Object IDs is often good enough to ensure unique ids. +Alternatively, you can simply create a random Object ID as follows: + +``` +// Randomly generate an Object ID. +ObjectID object_id = ObjectID::from_random(); + +``` + +Now, any connected client that knows the object's Object ID can access the +same object from the Plasma object store. For easy transportation of Object IDs, +you can convert/serialize an Object ID into a binary string and back as +follows: + +``` +// From ObjectID to binary string +std:string id_string = object_id.binary(); + +// From binary string to ObjectID +ObjectID id_object = ObjectID::from_binary(&id_string); + +``` + +Creating an Object +------------------ + +Now that you have an Object ID to refer with, you can now create an object +to store into Plasma with that Object ID. + +Objects are created in Plasma in two stages. First, they are *created*, in +which you specify a pointer for which the object's data and contents will be +constructed from. At this point, the client can still modify the contents +of the data array. + +To create an object for Plasma, you need to create an object id, as well as +give the object's maximum data size in bytes. All metadata for the object +should be passed in at point of creation as well: + +``` +// Create the Plasma object by specifying its size and metadata. +int64_t data_size = 100; +uint8_t metadata[] = {5}; +int64_t metadata_size = sizeof(metadata); +uint8_t* data; +client_.Create(object_id, data_size, metadata, metadata_size, &data); + +``` + +If there is no metadata for the object, you should pass in NULL instead: + +``` +// Create a Plasma object without metadata. +int64_t data_size = 100; +uint8_t* metadata = NULL; +int64_t metadata_size = 0; +uint8_t* data; +client_.Create(object_id, data_size, metadata, metadata_size, &data); + +``` + +Now that we've specified the pointer to our object's data, we can +write our data to it: + +``` +// Write the data for the Plasma object. +for (int64_t i = 0; i < data_size; i++) { + data[i] = static_cast(i % 4); +} + +``` + +When the client is done, the client *seals* the buffer, making the object +immutable, and making it available to other Plasma clients: + +``` +// Seal the object. This makes it available for all clients. +client_.Seal(object_id); + +``` + +To verify that an object exists in the Plasma object store, you can +call `PlasmaClient::Contains()` to check if an object has +been created and sealed for a given Object ID. Note that this function +will still return False if the object has been created, but not yet +sealed: + +``` +// Check if an object has been created and sealed. +bool has_object; +client_.Contains(object_id, &has_object); + +``` + +Getting an Object +----------------- + +After an object has been sealed, any client who knows the Object ID can get +the object. To store the retrieved object contents, you should create an +`ObjectBuffer,` then call `PlasmaClient::Get()` as follows: + +``` +// Get from the Plasma store by Object ID. +ObjectBuffer object_buffer; +client_.Get(&object_id, 1, -1, &object_buffer); + +``` + +`PlasmaClient::Get()` isn't limited to fetching a single object +from the Plasma store at once. You can specify an array of Object IDs and +`ObjectBuffers` to fetch at once, so long as you also specify the +number of objects being fetched: + +``` +// Get two objects at once from the Plasma store. This function +// call will block until both objects have been fetched. +ObjectBuffer multiple_buffers[2]; +ObjectID multiple_ids[2] = {object_id1, object_id2}; +int64_t number_of_objects = 2; +client_.Get(multiple_ids, number_of_objects, -1, multiple_buffers); + +``` + +Since `PlasmaClient::Get()` is a blocking function call, it may be +necessary to limit the amount of time the function is allowed to take +when trying to fetch from the Plasma store. You can pass in a timeout +in milliseconds when calling `PlasmaClient::Get().` To use `PlasmaClient::Get()` +without a timeout, just pass in -1 like in the previous example calls: + +``` +// Make the function call give up fetching the object if it takes +// more than 100 milliseconds. +int64_t timeout = 100; +client_.Get(object_id, 1, timeout, object_buffer); + +``` + +Finally, to reconstruct the object, you can access the `data` and +`metadata` attributes of the `ObjectBuffer.` The `data` can be indexed +like any array: + +``` +// Reconstruct object data +uint8_t* retrieved_data = object_buffer.data; +uint8_t retrieved_data_length = object_buffer.data_size; + +// Reconstruct object metadata +uint8_t* retrieved_metadata = object_buffer.metadata; +uint8_t retrieved_metadata_length = object_buffer.metadata_size; + +// Index into data array +uint8_t first_data_byte = retrieved_data[0]; + +``` + +Working with Remote Plasma Stores +--------------------------------- + +So far, we've worked with making our client store and get from the +local Plasma store instance. This is enough if we want to share our +data among processes on the same node/machine. However, if we want +to share data across networks, we'll have to expand our API a little. + +* ** Transfer Objects to a Remote Plasma Instance** + + If we know the IP address and port of a remote Plasma manager, we can + transfer a local object over to the remote Plasma store as follows: + + ``` + // Transferring an object to a remote Plasma manager. + const char* addr = "192.168.0.25"; // Dummy value + int port = 50108; // Dummy value + client_.Transfer(addr, port, &object_id); + + ``` + +* ** Fetching Objects from Remote Plasma Stores** + + If we know their Object IDs, we can attempt to fetch objects from remote + Plasma managers into our local Plasma store by calling `PlasmaClient::Fetch().` + This method is safe in that it is non-blocking, checks if the object is in the + local object store already, and can be called multiple times without side effects. + + ``` + // Fetching an object from remote Plasma managers. + int number_of_ids = 5; + ObjectID obj_ids[5] = {obj_id1, obj_id2, obj_id3, obj_id4, obj_id5}; + client_.Fetch(number_of_ids, obj_ids); + + ``` + + Of course, since `PlasmaClient::Fetch()` is non-blocking, the objects won't + necessarily be ready right after you call the function. This is where the next + section of this tutorial comes in. + + +Querying Status from Plasma +--------------------------- + +The power of Plasma is that we are able to share our data structures +between different processes and even different nodes. However, it may +be difficult for your process to know what is going with the other processes, +have objects been stored into Plasma yet, etc. + +Plasma provides the following API to query the status of objects and to +coordinate among different Plasma clients. + +* **Object Location and Status** + + You can find out the current status of an object in the Plasma store by + querying using its Object ID. From the status, you can find out if the + object doesn't exist, if the object is in a local vs. a remote Plasma + store, and if the object is in the middle of being transferred: + + ``` + // Query the object's status + int object_status; + client_.Info(object_id, &object_status); + + switch(object_status) { + case PLASMA_CLIENT_LOCAL : + // Object is in a local Plasma store + break; + case PLASMA_CLIENT_TRANSFER : + // Object is being transferred + break; + case PLASMA_CLIENT_REMOTE : + // Object is in a remote Plasma store + break; + case PLASMA_CLIENT_DOES_NOT_EXIST : + // Object does not exist in the system + break; + } + + ``` +* **Sealed Object Notifications** + + Additionally, you can arrange Plasma to notify you when objects are + sealed in the object store. This may especially be handy when your + program is collaborating with other Plasma clients, and needs to know + when they make objects available. + + First, you can subscribe your current Plasma client to such notifications + by getting a file descriptor: + + ``` + // Start receiving notifications into file_descriptor. + int file_descriptor; + client_.Subscribe(&fd); + + ``` + + Once you have the file descriptor, you can have your current Plasma client + wait to receive the next object notification. Object notifications + include information such as Object ID, data size, and metadata size of + the next newly available object: + + ``` + // Receive notification of the next newly available object. + // Notification information is stored in new_object_id, new_data_size, and new_metadata_size + ObjectID new_object_id; + int64_t new_data_size; + int64_t new_metadata_size; + client_.GetNotification(file_descriptor, &new_object_id, &new_data_size, &new_metadata_size); + + // Fetch the newly available object. + ObjectBuffer object_buffer; + client_.Get(&new_object_id, 1, -1, &object_buffer); + + ``` + +* **Waiting for Objects to be Ready** + + If your program already has the Object IDs from other clients that it wants to + process (whether they be in a local or remote store), however said objects have + yet to be sealed, you can instead call `PlasmaClient::Wait()` to block your program's + control flow until the objects have been sealed. + + For each object desired, you have to form an `ObjectRequest` from its Object ID + as follows: + + ``` + // Request the objects by Object ID by forming ObjectRequests + ObjectRequest obj1; + obj1.object_id = obj_ID_1; + obj1.type = PLASMA_QUERY_ANYWHERE; + + ``` + + You can specify an `ObjectRequest` to wait for an object anywhere, or to + wait for an object from its local object store. The latter would be + created instead as follows: + + ``` + ObjectRequest obj2; + obj2.object_id = obj_ID_2; + obj2.type = PLASMA_QUERY_LOCAL; + + ``` + + You can also form an `ObjectRequest` to wait for any object in general, and + not for a particular Object ID as follows: + + ``` + ObjectRequest obj3; + obj3.object_id = ID_NIL; + obj3.type = PLASMA_QUERY_ANYWHERE; + + ``` + + Once you have formed your `ObjectRequests,` you can call `PlasmaClient.Wait()`: + + ``` + ObjectRequest requests[3] = {obj1, obj2, obj3}; + + // Block until 2 of 3 desired objects become available. + int64_t num_of_desired_objects = 3; + int64_t num_of_objects_min = 2; + + // Where to return how many objects did successfully become available. + int64_t num_of_objects_satisfied; + + client_.Wait(num_of_desired_objects, requests, num_of_objects_min, -1, &num_of_objects_satisfied); + + ``` + +Finish Using Objects in Plasma +------------------------------ + +* **Releasing Objects from Get** + + Once your client is done with using an object in the Plasma store, you should + call `PlasmaClient::Release()` to notify Plasma. `PlasmaClient::Release()` + should be called once for every call made to `PlasmaClient::Get()` for this + specific Object ID. Note that after calling this function, the address + returned by `PlasmaClient::Get()` will no longer be valid. + + ``` + // Free the fetched object from the client. + client_.Release(object_id); + + ``` + +* **Delete Objects from the Plasma Store** + + You can also choose to delete an object from the Plasma object store entirely. + This should only be done for objects that are present and sealed: + + ``` + // Verify object is present and sealed first + bool has_object; + client_.Contains(object_id, &has_object); + + if (has_object) { + // Delete object by Object ID + client_.Delete(object_id); + } + + ``` + +* **Clearing Memory from the Plasma Store** + + Occasionally, the Plasma store may become too full if not allocated enough + memory, and creating new objects in Plasma will fail. You can check if + the Plasma store is full by checking the `arrow::Status` that `PlasmaClient::Create` + returns. + + If the Plasma store is too full, you can force Plasma to try to clear up + a given amount of memory (in bytes) by asking it to delete objects that + haven't been used in a while: + + ``` + // Attempt to create a new object + int64_t data_size2 = 100; + uint8_t* metadata2 = NULL; + int64_t metadata_size2 = 0; + uint8_t* data2; + Status returnStatus = client_.Create(object_id2, data_size2, metadata2, metadata_size2, &data2); + + // If Plasma is too full, evict to make more room + if (returnStatus.IsPlasmaStoreFull()) { + num_bytes = data_size2 + metadata_size2; + int64_t bytes_successfully_evicted; + client_.Evict(num_bytes, &bytes_successfully_evicted); + } + + ``` + +Shutting Down Plasma +-------------------- + +* **Disconnecting the Client from the Local Plasma Store** + + Once your program finishes using the Plasma object store, you should disconnect + your client as follows: + + ``` + // Disconnect the client from the Plasma store's socket. + client_.Disconnect(); + + ``` + +* **Shut Down the Plasma Object Store** + + Finally, to shut down the Plasma object store itself, you can terminate the + `plasma_store` process from within your C++ program as follows: + + ``` + // Shut down the Plasma object store. + system("killall plasma_store &"); + + ``` + From caac479166b6f926456acefce841dba3d34b6562 Mon Sep 17 00:00:00 2001 From: Crystal Yan Date: Sat, 22 Jul 2017 05:40:46 -0700 Subject: [PATCH 10/21] Plasma C++ tutorial documentation - minor formatting fixes --- cpp/apidoc/tutorials/plasma.md | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/cpp/apidoc/tutorials/plasma.md b/cpp/apidoc/tutorials/plasma.md index 3f9e4287818..c3238a89753 100644 --- a/cpp/apidoc/tutorials/plasma.md +++ b/cpp/apidoc/tutorials/plasma.md @@ -44,7 +44,7 @@ ubuntu:~$ python -c "import plasma; print(plasma.__path__)" ``` -Cd into your plasma directory. To start running the plasma object store, you can +`cd` into your plasma directory. To start running the plasma object store, you can run the `plasma_store` process in the foreground with a terminal command similar to below: @@ -54,8 +54,8 @@ to below: ``` This command only works from inside the plasma directory where the `plasma_store` -executable is located. This command takes in two flags-- the -m flag specifies -the size of the object store in bytes, and the -s flag specifies the filepath of +executable is located. This command takes in two flags-- the `-m` flag specifies +the size of the object store in bytes, and the `-s` flag specifies the filepath of the UNIX domain socket that the store will listen at. Therefore, the above command initializes a Plasma store up to 1 GB of memory, and @@ -197,7 +197,7 @@ Now that we've specified the pointer to our object's data, we can write our data to it: ``` -// Write the data for the Plasma object. +// Write some data for the Plasma object. Writes the values '012301230123...' for (int64_t i = 0; i < data_size; i++) { data[i] = static_cast(i % 4); } @@ -223,6 +223,9 @@ sealed: // Check if an object has been created and sealed. bool has_object; client_.Contains(object_id, &has_object); +if (has_object) { + // Object has been created and sealed, proceed +} ``` @@ -265,7 +268,7 @@ without a timeout, just pass in -1 like in the previous example calls: // Make the function call give up fetching the object if it takes // more than 100 milliseconds. int64_t timeout = 100; -client_.Get(object_id, 1, timeout, object_buffer); +client_.Get(&object_id, 1, timeout, &object_buffer); ``` @@ -295,7 +298,7 @@ local Plasma store instance. This is enough if we want to share our data among processes on the same node/machine. However, if we want to share data across networks, we'll have to expand our API a little. -* ** Transfer Objects to a Remote Plasma Instance** +* **Transfer Objects to a Remote Plasma Instance** If we know the IP address and port of a remote Plasma manager, we can transfer a local object over to the remote Plasma store as follows: @@ -308,7 +311,7 @@ to share data across networks, we'll have to expand our API a little. ``` -* ** Fetching Objects from Remote Plasma Stores** +* **Fetching Objects from Remote Plasma Stores** If we know their Object IDs, we can attempt to fetch objects from remote Plasma managers into our local Plasma store by calling `PlasmaClient::Fetch().` @@ -458,6 +461,17 @@ coordinate among different Plasma clients. ``` + Similar to `PlasmaClient.Get()`, since `PlasmaClient.Wait()` is a blocking function + call, you can specify a timeout in milliseconds for when the function should + return regardless of success. Otherwise, pass in -1 to have no timeout: + + ``` + // Wait. Timeout if it takes more than 100 milliseconds. + int64_t timeout = 100; + client_.Wait(num_of_desired_objects, requests, num_of_objects_min, timeout, &num_of_objects_satisfied); + + ``` + Finish Using Objects in Plasma ------------------------------ @@ -534,7 +548,7 @@ Shutting Down Plasma ``` -* **Shut Down the Plasma Object Store** +* **Shut Down the Plasma Object Store** Finally, to shut down the Plasma object store itself, you can terminate the `plasma_store` process from within your C++ program as follows: From 9a8437c9c5c403e173e037e8a582537723a2a992 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 24 Jul 2017 22:00:59 -0700 Subject: [PATCH 11/21] edit the C++ tutorial (work in progress) --- cpp/apidoc/tutorials/plasma.md | 370 +++++++++++++++------------------ 1 file changed, 170 insertions(+), 200 deletions(-) diff --git a/cpp/apidoc/tutorials/plasma.md b/cpp/apidoc/tutorials/plasma.md index c3238a89753..3ac2be3c77e 100644 --- a/cpp/apidoc/tutorials/plasma.md +++ b/cpp/apidoc/tutorials/plasma.md @@ -12,139 +12,115 @@ limitations under the License. See accompanying LICENSE file. --> -Using the Plasma In-Memory Object Store -======================================= +Using the Plasma In-Memory Object Store from C++ +================================================ -Apache Arrow offers the ability to share your data structures among multiple +Apache Arrow offers the ability to share your data structures among multiple processes simultaneously through Plasma, an in-memory object store. -Plasma object stores can be local, as in being on the same node, or remote. -Plasma can communicate between local and remote object stores to share +Plasma object stores can be local, as in being on the same node, or remote. +Plasma can communicate between local and remote object stores to share objects between nodes as well. -Like in Apache Arrow, Plasma objects are immutable. +Like in Apache Arrow, Plasma objects are immutable. -The following goes over the basics so you can begin using Plasma in your big +The following goes over the basics so you can begin using Plasma in your big data applications. -Starting up Plasma ------------------- - -Any process trying to access the Plasma object store needs to be set up as a -Plasma client. To start running the Plasma object store so that clients may -connect, you'll need to first locate your plasma directory. - -Most likely, your plasma directory is inside your anaconda installation. If you -have python installed in your machine, you can easily find out the location of -your plasma directory by running the following one-liner from the terminal: - -``` -ubuntu:~$ python -c "import plasma; print(plasma.__path__)" -['/home/ubuntu/anaconda3/lib/python3.6/site-packages/plasma-0.0.1-py3.6-linux-x86_64.egg/plasma'] - -``` +Starting the Plasma store +------------------------- -`cd` into your plasma directory. To start running the plasma object store, you can -run the `plasma_store` process in the foreground with a terminal command similar -to below: +To start running the Plasma object store so that clients may +connect and access the data, run the following command: ``` -./plasma_store -m 1000000000 -s /tmp/plasma - +plasma_store -m 1000000000 -s /tmp/plasma ``` -This command only works from inside the plasma directory where the `plasma_store` -executable is located. This command takes in two flags-- the `-m` flag specifies -the size of the object store in bytes, and the `-s` flag specifies the filepath of +This command takes in two flags -- the `-m` flag specifies +the size of the object store in bytes, and the `-s` flag specifies the path of the UNIX domain socket that the store will listen at. -Therefore, the above command initializes a Plasma store up to 1 GB of memory, and +Therefore, the above command initializes a Plasma store up to 1 GB of memory, and sets the socket to `/tmp/plasma.` -The Plasma store will remain available as long as the `plasma_store` process is +The Plasma store will remain available as long as the `plasma_store` process is running in a terminal window. Messages, such as alerts for disconnecting clients, may occasionally be outputted. To stop running the Plasma store, you can press `CTRL-C` in the terminal window. -Alternatively, you can run the Plasma store in the background and ignore all +Alternatively, you can run the Plasma store in the background and ignore all message output with the following terminal command: ``` -./plasma_store -m 1000000000 -s /tmp/plasma 1> /dev/null 2> /dev/null & - +plasma_store -m 1000000000 -s /tmp/plasma 1> /dev/null 2> /dev/null & ``` -The Plasma store will instead run silently in the background without having your -current window hang. To stop running the Plasma store in this case, issue the -below terminal command: +The Plasma store will instead run silently in the background. To stop running the Plasma store in this case, issue the below terminal command: ``` killall plasma_store & - ``` Creating a Plasma client ------------------------ -Now that the Plasma object store is up and running, it's time to make client -processes (such as an instance of your C++ program) connect to it. To use the -Plasma object store as a client, your application should initialize a -`plasma::PlasmaClient` object and tell it to connect to socket specified when +Now that the Plasma object store is up and running, it is time to make client +processes (such as an instance of your C++ program) connect to it. To use the +Plasma object store as a client, your application should initialize a +`plasma::PlasmaClient` object and tell it to connect to socket specified when starting up the Plasma object store. ``` #include -#include -#include -#include + using namespace plasma; int main(int argc, char** argv) { - // Start up and connect a Plasma client. - PlasmaClient client_; - client_.Connect("/tmp/plasma", "", PLASMA_DEFAULT_RELEASE_DELAY); + // Start up and connect a Plasma client. + PlasmaClient client; + ARROW_CHECK_OK(client.Connect("/tmp/plasma", "", PLASMA_DEFAULT_RELEASE_DELAY)); + // Disconnect the Plasma client. + ARROW_CHECK_OK(client.Disconnect()); } - ``` -Note that multiple clients can be created within the same process, and -clients can be created among multiple concurrent processes. +Save this program in a file `test.cc` and compile it with +``` +g++ test.cc `pkg-config --cflags --libs plasma` --std=c++11 +``` -Object IDs ----------- +Note that multiple clients can be created within the same process, and +clients can be created among multiple concurrent processes. -The Plasma object store uses a key-value system for accessing objects stored -in the shared memory. Each object in the Plasma store should be associated -with a unique id. The Object ID then serves as a key for *any* client to fetch -that object from the Plasma store. You can form an ``ObjectID`` object from a -byte string of 20 bytes. +If the Plasma store is still running, you can now execute the `a.out` executable +and the store will print something like ``` -// Create an Object ID. -ObjectID object_id; -uint8_t* data = id.mutable_data(); +Disconnecting client on fd 5 +``` -// Write out the byte string 'aaaaaaaaaaaaaaaaaaaa'. -// plasma::kUniqueIDSize = 20 -for (int i = 0; i < kUniqueIDSize; i++) { - data[i] = (uint8_t)'a'; -} +which shows that the client was successfully disconnected. -``` +Object IDs +---------- + +The Plasma object store uses SHA-1 identifiers for accessing objects stored +in shared memory. Each object in the Plasma store should be associated +with a unique id. The Object ID then serves as a key for *any* client to fetch +that object from the Plasma store. -Random generation of Object IDs is often good enough to ensure unique ids. -Alternatively, you can simply create a random Object ID as follows: +Random generation of Object IDs is often good enough to ensure unique ids: ``` // Randomly generate an Object ID. ObjectID object_id = ObjectID::from_random(); - ``` -Now, any connected client that knows the object's Object ID can access the -same object from the Plasma object store. For easy transportation of Object IDs, -you can convert/serialize an Object ID into a binary string and back as +Now, any connected client that knows the object's Object ID can access the +same object from the Plasma object store. For easy transportation of Object IDs, +you can convert/serialize an Object ID into a binary string and back as follows: ``` @@ -153,22 +129,43 @@ std:string id_string = object_id.binary(); // From binary string to ObjectID ObjectID id_object = ObjectID::from_binary(&id_string); +``` + +You can also get a human readable representation of ObjectIDs in the same +format that git uses for commit hashes by running `ObjectID::hex`. + +Here is a test program you can run: + +``` +#include +#include +#include +using namespace plasma; + +int main(int argc, char** argv) { + ObjectID object_id1 = ObjectID::from_random(); + std::cout << "object_id1 is " << object_id1.hex() << std::endl; + + std::string id_string = object_id1.binary(); + ObjectID object_id2 = ObjectID::from_binary(id_string); + std::cout << "object_id2 is " << object_id2.hex() << std::endl; +} ``` Creating an Object ------------------ -Now that you have an Object ID to refer with, you can now create an object -to store into Plasma with that Object ID. +Now that you learned about Object IDs that are used to refer to objects, +let's look into how objects can be stored in Plasma. -Objects are created in Plasma in two stages. First, they are *created*, in -which you specify a pointer for which the object's data and contents will be -constructed from. At this point, the client can still modify the contents +Storing objects is a two-stage process. First, an object is *created*, in +which you specify a pointer for which the object's data and contents will be +constructed from. At this point, the client can still modify the contents of the data array. -To create an object for Plasma, you need to create an object id, as well as -give the object's maximum data size in bytes. All metadata for the object +To create an object for Plasma, you need to create an object id, as well as +give the object's maximum data size in bytes. All metadata for the object should be passed in at point of creation as well: ``` @@ -177,8 +174,7 @@ int64_t data_size = 100; uint8_t metadata[] = {5}; int64_t metadata_size = sizeof(metadata); uint8_t* data; -client_.Create(object_id, data_size, metadata, metadata_size, &data); - +ARROW_CHECK_OK(client.Create(object_id, data_size, metadata, metadata_size, &data)); ``` If there is no metadata for the object, you should pass in NULL instead: @@ -186,94 +182,85 @@ If there is no metadata for the object, you should pass in NULL instead: ``` // Create a Plasma object without metadata. int64_t data_size = 100; -uint8_t* metadata = NULL; -int64_t metadata_size = 0; uint8_t* data; -client_.Create(object_id, data_size, metadata, metadata_size, &data); - +client.Create(object_id, data_size, NULL, 0, &data); ``` -Now that we've specified the pointer to our object's data, we can +Now that we've specified the pointer to our object's data, we can write our data to it: ``` -// Write some data for the Plasma object. Writes the values '012301230123...' +// Write some data for the Plasma object. for (int64_t i = 0; i < data_size; i++) { data[i] = static_cast(i % 4); } - ``` -When the client is done, the client *seals* the buffer, making the object +When the client is done, the client *seals* the buffer, making the object immutable, and making it available to other Plasma clients: ``` // Seal the object. This makes it available for all clients. -client_.Seal(object_id); - +client.Seal(object_id); ``` -To verify that an object exists in the Plasma object store, you can -call `PlasmaClient::Contains()` to check if an object has -been created and sealed for a given Object ID. Note that this function -will still return False if the object has been created, but not yet +To verify that an object exists in the Plasma object store, you can +call `PlasmaClient::Contains()` to check if an object has +been created and sealed for a given Object ID. Note that this function +will still return False if the object has been created, but not yet sealed: ``` // Check if an object has been created and sealed. bool has_object; -client_.Contains(object_id, &has_object); +client.Contains(object_id, &has_object); if (has_object) { // Object has been created and sealed, proceed } - ``` Getting an Object ----------------- -After an object has been sealed, any client who knows the Object ID can get -the object. To store the retrieved object contents, you should create an +After an object has been sealed, any client who knows the Object ID can get +the object. To store the retrieved object contents, you should create an `ObjectBuffer,` then call `PlasmaClient::Get()` as follows: ``` // Get from the Plasma store by Object ID. ObjectBuffer object_buffer; -client_.Get(&object_id, 1, -1, &object_buffer); - +client.Get(&object_id, 1, -1, &object_buffer); ``` -`PlasmaClient::Get()` isn't limited to fetching a single object -from the Plasma store at once. You can specify an array of Object IDs and -`ObjectBuffers` to fetch at once, so long as you also specify the +`PlasmaClient::Get()` isn't limited to fetching a single object +from the Plasma store at once. You can specify an array of Object IDs and +`ObjectBuffers` to fetch at once, so long as you also specify the number of objects being fetched: ``` -// Get two objects at once from the Plasma store. This function +// Get two objects at once from the Plasma store. This function // call will block until both objects have been fetched. ObjectBuffer multiple_buffers[2]; ObjectID multiple_ids[2] = {object_id1, object_id2}; int64_t number_of_objects = 2; -client_.Get(multiple_ids, number_of_objects, -1, multiple_buffers); - +client.Get(multiple_ids, number_of_objects, -1, multiple_buffers); ``` -Since `PlasmaClient::Get()` is a blocking function call, it may be +Since `PlasmaClient::Get()` is a blocking function call, it may be necessary to limit the amount of time the function is allowed to take -when trying to fetch from the Plasma store. You can pass in a timeout -in milliseconds when calling `PlasmaClient::Get().` To use `PlasmaClient::Get()` +when trying to fetch from the Plasma store. You can pass in a timeout +in milliseconds when calling `PlasmaClient::Get().` To use `PlasmaClient::Get()` without a timeout, just pass in -1 like in the previous example calls: ``` -// Make the function call give up fetching the object if it takes +// Make the function call give up fetching the object if it takes // more than 100 milliseconds. int64_t timeout = 100; -client_.Get(&object_id, 1, timeout, &object_buffer); - +client.Get(&object_id, 1, timeout, &object_buffer); ``` -Finally, to reconstruct the object, you can access the `data` and -`metadata` attributes of the `ObjectBuffer.` The `data` can be indexed +Finally, to reconstruct the object, you can access the `data` and +`metadata` attributes of the `ObjectBuffer.` The `data` can be indexed like any array: ``` @@ -287,15 +274,14 @@ uint8_t retrieved_metadata_length = object_buffer.metadata_size; // Index into data array uint8_t first_data_byte = retrieved_data[0]; - ``` Working with Remote Plasma Stores --------------------------------- -So far, we've worked with making our client store and get from the -local Plasma store instance. This is enough if we want to share our -data among processes on the same node/machine. However, if we want +So far, we've worked with making our client store and get from the +local Plasma store instance. This is enough if we want to share our +data among processes on the same node/machine. However, if we want to share data across networks, we'll have to expand our API a little. * **Transfer Objects to a Remote Plasma Instance** @@ -307,52 +293,50 @@ to share data across networks, we'll have to expand our API a little. // Transferring an object to a remote Plasma manager. const char* addr = "192.168.0.25"; // Dummy value int port = 50108; // Dummy value - client_.Transfer(addr, port, &object_id); - + client.Transfer(addr, port, &object_id); ``` * **Fetching Objects from Remote Plasma Stores** - - If we know their Object IDs, we can attempt to fetch objects from remote - Plasma managers into our local Plasma store by calling `PlasmaClient::Fetch().` - This method is safe in that it is non-blocking, checks if the object is in the + + If we know their Object IDs, we can attempt to fetch objects from remote + Plasma managers into our local Plasma store by calling `PlasmaClient::Fetch().` + This method is safe in that it is non-blocking, checks if the object is in the local object store already, and can be called multiple times without side effects. ``` // Fetching an object from remote Plasma managers. int number_of_ids = 5; ObjectID obj_ids[5] = {obj_id1, obj_id2, obj_id3, obj_id4, obj_id5}; - client_.Fetch(number_of_ids, obj_ids); - + client.Fetch(number_of_ids, obj_ids); ``` - Of course, since `PlasmaClient::Fetch()` is non-blocking, the objects won't - necessarily be ready right after you call the function. This is where the next + Of course, since `PlasmaClient::Fetch()` is non-blocking, the objects won't + necessarily be ready right after you call the function. This is where the next section of this tutorial comes in. Querying Status from Plasma --------------------------- -The power of Plasma is that we are able to share our data structures -between different processes and even different nodes. However, it may -be difficult for your process to know what is going with the other processes, -have objects been stored into Plasma yet, etc. +The power of Plasma is that we are able to share our data structures +between different processes and even different nodes. However, it may +be difficult for your process to know what is going with the other processes, +have objects been stored into Plasma yet, etc. -Plasma provides the following API to query the status of objects and to +Plasma provides the following API to query the status of objects and to coordinate among different Plasma clients. * **Object Location and Status** - You can find out the current status of an object in the Plasma store by - querying using its Object ID. From the status, you can find out if the - object doesn't exist, if the object is in a local vs. a remote Plasma + You can find out the current status of an object in the Plasma store by + querying using its Object ID. From the status, you can find out if the + object doesn't exist, if the object is in a local vs. a remote Plasma store, and if the object is in the middle of being transferred: ``` // Query the object's status int object_status; - client_.Info(object_id, &object_status); + client.Info(object_id, &object_status); switch(object_status) { case PLASMA_CLIENT_LOCAL : @@ -368,81 +352,75 @@ coordinate among different Plasma clients. // Object does not exist in the system break; } - ``` * **Sealed Object Notifications** - Additionally, you can arrange Plasma to notify you when objects are - sealed in the object store. This may especially be handy when your - program is collaborating with other Plasma clients, and needs to know + Additionally, you can arrange Plasma to notify you when objects are + sealed in the object store. This may especially be handy when your + program is collaborating with other Plasma clients, and needs to know when they make objects available. - First, you can subscribe your current Plasma client to such notifications + First, you can subscribe your current Plasma client to such notifications by getting a file descriptor: ``` // Start receiving notifications into file_descriptor. int file_descriptor; - client_.Subscribe(&fd); - + client.Subscribe(&fd); ``` - Once you have the file descriptor, you can have your current Plasma client - wait to receive the next object notification. Object notifications - include information such as Object ID, data size, and metadata size of + Once you have the file descriptor, you can have your current Plasma client + wait to receive the next object notification. Object notifications + include information such as Object ID, data size, and metadata size of the next newly available object: ``` - // Receive notification of the next newly available object. + // Receive notification of the next newly available object. // Notification information is stored in new_object_id, new_data_size, and new_metadata_size ObjectID new_object_id; int64_t new_data_size; int64_t new_metadata_size; - client_.GetNotification(file_descriptor, &new_object_id, &new_data_size, &new_metadata_size); + client.GetNotification(file_descriptor, &new_object_id, &new_data_size, &new_metadata_size); // Fetch the newly available object. ObjectBuffer object_buffer; - client_.Get(&new_object_id, 1, -1, &object_buffer); - + client.Get(&new_object_id, 1, -1, &object_buffer); ``` * **Waiting for Objects to be Ready** - If your program already has the Object IDs from other clients that it wants to - process (whether they be in a local or remote store), however said objects have - yet to be sealed, you can instead call `PlasmaClient::Wait()` to block your program's - control flow until the objects have been sealed. + If your program already has the Object IDs from other clients that it wants to + process (whether they be in a local or remote store), however said objects have + yet to be sealed, you can instead call `PlasmaClient::Wait()` to block your program's + control flow until the objects have been sealed. - For each object desired, you have to form an `ObjectRequest` from its Object ID + For each object desired, you have to form an `ObjectRequest` from its Object ID as follows: ``` // Request the objects by Object ID by forming ObjectRequests - ObjectRequest obj1; + ObjectRequest obj1; obj1.object_id = obj_ID_1; obj1.type = PLASMA_QUERY_ANYWHERE; - ``` - You can specify an `ObjectRequest` to wait for an object anywhere, or to - wait for an object from its local object store. The latter would be + You can specify an `ObjectRequest` to wait for an object anywhere, or to + wait for an object from its local object store. The latter would be created instead as follows: ``` - ObjectRequest obj2; + ObjectRequest obj2; obj2.object_id = obj_ID_2; obj2.type = PLASMA_QUERY_LOCAL; - ``` - You can also form an `ObjectRequest` to wait for any object in general, and + You can also form an `ObjectRequest` to wait for any object in general, and not for a particular Object ID as follows: ``` - ObjectRequest obj3; + ObjectRequest obj3; obj3.object_id = ID_NIL; obj3.type = PLASMA_QUERY_ANYWHERE; - ``` Once you have formed your `ObjectRequests,` you can call `PlasmaClient.Wait()`: @@ -457,19 +435,17 @@ coordinate among different Plasma clients. // Where to return how many objects did successfully become available. int64_t num_of_objects_satisfied; - client_.Wait(num_of_desired_objects, requests, num_of_objects_min, -1, &num_of_objects_satisfied); - + client.Wait(num_of_desired_objects, requests, num_of_objects_min, -1, &num_of_objects_satisfied); ``` - Similar to `PlasmaClient.Get()`, since `PlasmaClient.Wait()` is a blocking function - call, you can specify a timeout in milliseconds for when the function should + Similar to `PlasmaClient.Get()`, since `PlasmaClient.Wait()` is a blocking function + call, you can specify a timeout in milliseconds for when the function should return regardless of success. Otherwise, pass in -1 to have no timeout: ``` // Wait. Timeout if it takes more than 100 milliseconds. int64_t timeout = 100; - client_.Wait(num_of_desired_objects, requests, num_of_objects_min, timeout, &num_of_objects_satisfied); - + client.Wait(num_of_desired_objects, requests, num_of_objects_min, timeout, &num_of_objects_satisfied); ``` Finish Using Objects in Plasma @@ -477,44 +453,42 @@ Finish Using Objects in Plasma * **Releasing Objects from Get** - Once your client is done with using an object in the Plasma store, you should - call `PlasmaClient::Release()` to notify Plasma. `PlasmaClient::Release()` - should be called once for every call made to `PlasmaClient::Get()` for this - specific Object ID. Note that after calling this function, the address + Once your client is done with using an object in the Plasma store, you should + call `PlasmaClient::Release()` to notify Plasma. `PlasmaClient::Release()` + should be called once for every call made to `PlasmaClient::Get()` for this + specific Object ID. Note that after calling this function, the address returned by `PlasmaClient::Get()` will no longer be valid. ``` // Free the fetched object from the client. - client_.Release(object_id); - + client.Release(object_id); ``` * **Delete Objects from the Plasma Store** - You can also choose to delete an object from the Plasma object store entirely. + You can also choose to delete an object from the Plasma object store entirely. This should only be done for objects that are present and sealed: ``` // Verify object is present and sealed first bool has_object; - client_.Contains(object_id, &has_object); + client.Contains(object_id, &has_object); if (has_object) { // Delete object by Object ID - client_.Delete(object_id); + client.Delete(object_id); } - ``` * **Clearing Memory from the Plasma Store** - Occasionally, the Plasma store may become too full if not allocated enough - memory, and creating new objects in Plasma will fail. You can check if - the Plasma store is full by checking the `arrow::Status` that `PlasmaClient::Create` - returns. + Occasionally, the Plasma store may become too full if not allocated enough + memory, and creating new objects in Plasma will fail. You can check if + the Plasma store is full by checking the `arrow::Status` that `PlasmaClient::Create` + returns. - If the Plasma store is too full, you can force Plasma to try to clear up - a given amount of memory (in bytes) by asking it to delete objects that + If the Plasma store is too full, you can force Plasma to try to clear up + a given amount of memory (in bytes) by asking it to delete objects that haven't been used in a while: ``` @@ -523,15 +497,14 @@ Finish Using Objects in Plasma uint8_t* metadata2 = NULL; int64_t metadata_size2 = 0; uint8_t* data2; - Status returnStatus = client_.Create(object_id2, data_size2, metadata2, metadata_size2, &data2); + Status returnStatus = client.Create(object_id2, data_size2, metadata2, metadata_size2, &data2); // If Plasma is too full, evict to make more room if (returnStatus.IsPlasmaStoreFull()) { num_bytes = data_size2 + metadata_size2; int64_t bytes_successfully_evicted; - client_.Evict(num_bytes, &bytes_successfully_evicted); + client.Evict(num_bytes, &bytes_successfully_evicted); } - ``` Shutting Down Plasma @@ -539,23 +512,20 @@ Shutting Down Plasma * **Disconnecting the Client from the Local Plasma Store** - Once your program finishes using the Plasma object store, you should disconnect + Once your program finishes using the Plasma object store, you should disconnect your client as follows: ``` // Disconnect the client from the Plasma store's socket. - client_.Disconnect(); - + client.Disconnect(); ``` * **Shut Down the Plasma Object Store** - Finally, to shut down the Plasma object store itself, you can terminate the + Finally, to shut down the Plasma object store itself, you can terminate the `plasma_store` process from within your C++ program as follows: ``` // Shut down the Plasma object store. system("killall plasma_store &"); - ``` - From 84141b6c361693e4e29a1617ebb4f6e7217deae3 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Sat, 29 Jul 2017 18:37:15 -0700 Subject: [PATCH 12/21] update C++ documentation --- cpp/apidoc/tutorials/plasma.md | 446 +++++++++++++-------------------- 1 file changed, 175 insertions(+), 271 deletions(-) diff --git a/cpp/apidoc/tutorials/plasma.md b/cpp/apidoc/tutorials/plasma.md index 3ac2be3c77e..89b84c4d0fd 100644 --- a/cpp/apidoc/tutorials/plasma.md +++ b/cpp/apidoc/tutorials/plasma.md @@ -165,25 +165,24 @@ constructed from. At this point, the client can still modify the contents of the data array. To create an object for Plasma, you need to create an object id, as well as -give the object's maximum data size in bytes. All metadata for the object -should be passed in at point of creation as well: +give the object's maximum data size in bytes. ``` -// Create the Plasma object by specifying its size and metadata. +// Create the Plasma object by specifying its size. int64_t data_size = 100; -uint8_t metadata[] = {5}; -int64_t metadata_size = sizeof(metadata); uint8_t* data; -ARROW_CHECK_OK(client.Create(object_id, data_size, metadata, metadata_size, &data)); +ARROW_CHECK_OK(client.Create(object_id, data_size, NULL, 0, &data)); ``` -If there is no metadata for the object, you should pass in NULL instead: +You can also specify metadata for the object; the third argument is the +metadata (as raw bytes) and the forth argument is the size of the metadata. ``` // Create a Plasma object without metadata. int64_t data_size = 100; +std::string metadata = "{'author': 'john'}"; uint8_t* data; -client.Create(object_id, data_size, NULL, 0, &data); +client.Create(object_id, data_size, (uint8_t*) metadata.data(), metadata.size(), &data); ``` Now that we've specified the pointer to our object's data, we can @@ -204,6 +203,39 @@ immutable, and making it available to other Plasma clients: client.Seal(object_id); ``` +Here is an example that combines all these features: + +``` +#include + +using namespace plasma; + +int main(int argc, char** argv) { + // Start up and connect a Plasma client. + PlasmaClient client; + ARROW_CHECK_OK(client.Connect("/tmp/plasma", "", PLASMA_DEFAULT_RELEASE_DELAY)); + // Create an object with a random ObjectID. + ObjectID object_id = ObjectID::from_binary("00000000000000000000"); + int64_t data_size = 1000; + uint8_t *data; + std::string metadata = "{'author': 'john'}"; + ARROW_CHECK_OK(client.Create(object_id, data_size, (uint8_t*) metadata.data(), metadata.size(), &data)); + // Write some data into the object. + for (int64_t i = 0; i < data_size; i++) { + data[i] = static_cast(i % 4); + } + // Seal the object. + ARROW_CHECK_OK(client.Seal(object_id)); + // Disconnect the client. + ARROW_CHECK_OK(client.Disconnect()); +} +``` + +This example can be compiled with +``` +g++ create.cc `pkg-config --cflags --libs plasma` --std=c++11 -o create +``` + To verify that an object exists in the Plasma object store, you can call `PlasmaClient::Contains()` to check if an object has been created and sealed for a given Object ID. Note that this function @@ -224,7 +256,7 @@ Getting an Object After an object has been sealed, any client who knows the Object ID can get the object. To store the retrieved object contents, you should create an -`ObjectBuffer,` then call `PlasmaClient::Get()` as follows: +`ObjectBuffer`, then call `PlasmaClient::Get()` as follows: ``` // Get from the Plasma store by Object ID. @@ -242,8 +274,7 @@ number of objects being fetched: // call will block until both objects have been fetched. ObjectBuffer multiple_buffers[2]; ObjectID multiple_ids[2] = {object_id1, object_id2}; -int64_t number_of_objects = 2; -client.Get(multiple_ids, number_of_objects, -1, multiple_buffers); +client.Get(multiple_ids, 2, -1, multiple_buffers); ``` Since `PlasmaClient::Get()` is a blocking function call, it may be @@ -259,273 +290,146 @@ int64_t timeout = 100; client.Get(&object_id, 1, timeout, &object_buffer); ``` -Finally, to reconstruct the object, you can access the `data` and -`metadata` attributes of the `ObjectBuffer.` The `data` can be indexed +Finally, to access the object, you can access the `data` and +`metadata` attributes of the `ObjectBuffer`. The `data` can be indexed like any array: ``` -// Reconstruct object data -uint8_t* retrieved_data = object_buffer.data; -uint8_t retrieved_data_length = object_buffer.data_size; - -// Reconstruct object metadata -uint8_t* retrieved_metadata = object_buffer.metadata; -uint8_t retrieved_metadata_length = object_buffer.metadata_size; - -// Index into data array -uint8_t first_data_byte = retrieved_data[0]; -``` - -Working with Remote Plasma Stores ---------------------------------- - -So far, we've worked with making our client store and get from the -local Plasma store instance. This is enough if we want to share our -data among processes on the same node/machine. However, if we want -to share data across networks, we'll have to expand our API a little. - -* **Transfer Objects to a Remote Plasma Instance** - - If we know the IP address and port of a remote Plasma manager, we can - transfer a local object over to the remote Plasma store as follows: - - ``` - // Transferring an object to a remote Plasma manager. - const char* addr = "192.168.0.25"; // Dummy value - int port = 50108; // Dummy value - client.Transfer(addr, port, &object_id); - ``` - -* **Fetching Objects from Remote Plasma Stores** - - If we know their Object IDs, we can attempt to fetch objects from remote - Plasma managers into our local Plasma store by calling `PlasmaClient::Fetch().` - This method is safe in that it is non-blocking, checks if the object is in the - local object store already, and can be called multiple times without side effects. - - ``` - // Fetching an object from remote Plasma managers. - int number_of_ids = 5; - ObjectID obj_ids[5] = {obj_id1, obj_id2, obj_id3, obj_id4, obj_id5}; - client.Fetch(number_of_ids, obj_ids); - ``` - - Of course, since `PlasmaClient::Fetch()` is non-blocking, the objects won't - necessarily be ready right after you call the function. This is where the next - section of this tutorial comes in. - - -Querying Status from Plasma ---------------------------- - -The power of Plasma is that we are able to share our data structures -between different processes and even different nodes. However, it may -be difficult for your process to know what is going with the other processes, -have objects been stored into Plasma yet, etc. - -Plasma provides the following API to query the status of objects and to -coordinate among different Plasma clients. - -* **Object Location and Status** - - You can find out the current status of an object in the Plasma store by - querying using its Object ID. From the status, you can find out if the - object doesn't exist, if the object is in a local vs. a remote Plasma - store, and if the object is in the middle of being transferred: - - ``` - // Query the object's status - int object_status; - client.Info(object_id, &object_status); - - switch(object_status) { - case PLASMA_CLIENT_LOCAL : - // Object is in a local Plasma store - break; - case PLASMA_CLIENT_TRANSFER : - // Object is being transferred - break; - case PLASMA_CLIENT_REMOTE : - // Object is in a remote Plasma store - break; - case PLASMA_CLIENT_DOES_NOT_EXIST : - // Object does not exist in the system - break; - } - ``` -* **Sealed Object Notifications** - - Additionally, you can arrange Plasma to notify you when objects are - sealed in the object store. This may especially be handy when your - program is collaborating with other Plasma clients, and needs to know - when they make objects available. - - First, you can subscribe your current Plasma client to such notifications - by getting a file descriptor: - - ``` - // Start receiving notifications into file_descriptor. - int file_descriptor; - client.Subscribe(&fd); - ``` - - Once you have the file descriptor, you can have your current Plasma client - wait to receive the next object notification. Object notifications - include information such as Object ID, data size, and metadata size of - the next newly available object: - - ``` - // Receive notification of the next newly available object. - // Notification information is stored in new_object_id, new_data_size, and new_metadata_size - ObjectID new_object_id; - int64_t new_data_size; - int64_t new_metadata_size; - client.GetNotification(file_descriptor, &new_object_id, &new_data_size, &new_metadata_size); - - // Fetch the newly available object. - ObjectBuffer object_buffer; - client.Get(&new_object_id, 1, -1, &object_buffer); - ``` - -* **Waiting for Objects to be Ready** - - If your program already has the Object IDs from other clients that it wants to - process (whether they be in a local or remote store), however said objects have - yet to be sealed, you can instead call `PlasmaClient::Wait()` to block your program's - control flow until the objects have been sealed. - - For each object desired, you have to form an `ObjectRequest` from its Object ID - as follows: - - ``` - // Request the objects by Object ID by forming ObjectRequests - ObjectRequest obj1; - obj1.object_id = obj_ID_1; - obj1.type = PLASMA_QUERY_ANYWHERE; - ``` - - You can specify an `ObjectRequest` to wait for an object anywhere, or to - wait for an object from its local object store. The latter would be - created instead as follows: - - ``` - ObjectRequest obj2; - obj2.object_id = obj_ID_2; - obj2.type = PLASMA_QUERY_LOCAL; - ``` - - You can also form an `ObjectRequest` to wait for any object in general, and - not for a particular Object ID as follows: - - ``` - ObjectRequest obj3; - obj3.object_id = ID_NIL; - obj3.type = PLASMA_QUERY_ANYWHERE; - ``` - - Once you have formed your `ObjectRequests,` you can call `PlasmaClient.Wait()`: - - ``` - ObjectRequest requests[3] = {obj1, obj2, obj3}; - - // Block until 2 of 3 desired objects become available. - int64_t num_of_desired_objects = 3; - int64_t num_of_objects_min = 2; - - // Where to return how many objects did successfully become available. - int64_t num_of_objects_satisfied; - - client.Wait(num_of_desired_objects, requests, num_of_objects_min, -1, &num_of_objects_satisfied); - ``` - - Similar to `PlasmaClient.Get()`, since `PlasmaClient.Wait()` is a blocking function - call, you can specify a timeout in milliseconds for when the function should - return regardless of success. Otherwise, pass in -1 to have no timeout: - - ``` - // Wait. Timeout if it takes more than 100 milliseconds. - int64_t timeout = 100; - client.Wait(num_of_desired_objects, requests, num_of_objects_min, timeout, &num_of_objects_satisfied); - ``` - -Finish Using Objects in Plasma ------------------------------- - -* **Releasing Objects from Get** - - Once your client is done with using an object in the Plasma store, you should - call `PlasmaClient::Release()` to notify Plasma. `PlasmaClient::Release()` - should be called once for every call made to `PlasmaClient::Get()` for this - specific Object ID. Note that after calling this function, the address - returned by `PlasmaClient::Get()` will no longer be valid. - - ``` - // Free the fetched object from the client. - client.Release(object_id); - ``` - -* **Delete Objects from the Plasma Store** - - You can also choose to delete an object from the Plasma object store entirely. - This should only be done for objects that are present and sealed: - - ``` - // Verify object is present and sealed first - bool has_object; - client.Contains(object_id, &has_object); - - if (has_object) { - // Delete object by Object ID - client.Delete(object_id); - } - ``` - -* **Clearing Memory from the Plasma Store** - - Occasionally, the Plasma store may become too full if not allocated enough - memory, and creating new objects in Plasma will fail. You can check if - the Plasma store is full by checking the `arrow::Status` that `PlasmaClient::Create` - returns. - - If the Plasma store is too full, you can force Plasma to try to clear up - a given amount of memory (in bytes) by asking it to delete objects that - haven't been used in a while: - - ``` - // Attempt to create a new object - int64_t data_size2 = 100; - uint8_t* metadata2 = NULL; - int64_t metadata_size2 = 0; - uint8_t* data2; - Status returnStatus = client.Create(object_id2, data_size2, metadata2, metadata_size2, &data2); - - // If Plasma is too full, evict to make more room - if (returnStatus.IsPlasmaStoreFull()) { - num_bytes = data_size2 + metadata_size2; - int64_t bytes_successfully_evicted; - client.Evict(num_bytes, &bytes_successfully_evicted); - } - ``` - -Shutting Down Plasma +// Access object data. +uint8_t* data = object_buffer.data; +int64_t data_size = object_buffer.data_size; + +// Access object metadata. +uint8_t* metadata = object_buffer.metadata; +uint8_t metadata_size = object_buffer.metadata_size; + +// Index into data array. +uint8_t first_data_byte = data[0]; +``` + +Here is a longer example that shows these capabilities: + +``` +#include + +using namespace plasma; + +int main(int argc, char** argv) { + // Start up and connect a Plasma client. + PlasmaClient client; + ARROW_CHECK_OK(client.Connect("/tmp/plasma", "", PLASMA_DEFAULT_RELEASE_DELAY)); + ObjectID object_id = ObjectID::from_binary("00000000000000000000"); + ObjectBuffer object_buffer; + ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer)); + + // Retrieve object data. + uint8_t* data = object_buffer.data; + int64_t data_size = object_buffer.data_size; + + // Check that the data agrees with what was written in the other process. + for (int64_t i = 0; i < data_size; i++) { + ARROW_CHECK(data[i] == static_cast(i % 4)); + } + + // Disconnect the client. + ARROW_CHECK_OK(client.Disconnect()); +} +``` + +If you compile it with + +``` +g++ get.cc `pkg-config --cflags --libs plasma` --std=c++11 -o get +``` + +and run it with `./get`, all the assertions will pass if you run the `create` +example from above on the same Plasma store. + + +Object Lifetime Management +-------------------------- + +The Plasma store internally does reference counting to make sure objects that +are mapped into the address space of one of the clients with `PlasmaClient::Get` +are accessible. To unmap objects from a client, call `PlasmaClient::Release`. +All objects that are mapped into a clients address space will automatically +be released when the client is disconnected from the store. + +If a new object is created and there is not enough space in the Plasma store, +the store will evict the least recently used released object. If all objects +are mapped into the address space of some client, the + +Object notifications -------------------- -* **Disconnecting the Client from the Local Plasma Store** +Additionally, you can arrange Plasma to notify you when objects are +sealed in the object store. This may especially be handy when your +program is collaborating with other Plasma clients, and needs to know +when they make objects available. + +First, you can subscribe your current Plasma client to such notifications +by getting a file descriptor: + +``` +// Start receiving notifications into file_descriptor. +int fd; +ARROW_CHECK_OK(client.Subscribe(&fd)); +``` + +Once you have the file descriptor, you can have your current Plasma client +wait to receive the next object notification. Object notifications +include information such as Object ID, data size, and metadata size of +the next newly available object: + +``` +// Receive notification of the next newly available object. +// Notification information is stored in object_id, data_size, and metadata_size +ObjectID new_object_id; +int64_t data_size; +int64_t metadata_size; +ARROW_CHECK_OK(client.GetNotification(fd, &object_id, &data_size, &metadata_size)); + +// Fetch the newly available object. +ObjectBuffer object_buffer; +ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer)); +``` + +Here is a full program that shows this capability: + +``` +#include - Once your program finishes using the Plasma object store, you should disconnect - your client as follows: +using namespace plasma; - ``` - // Disconnect the client from the Plasma store's socket. - client.Disconnect(); - ``` +int main(int argc, char** argv) { + // Start up and connect a Plasma client. + PlasmaClient client; + ARROW_CHECK_OK(client.Connect("/tmp/plasma", "", PLASMA_DEFAULT_RELEASE_DELAY)); -* **Shut Down the Plasma Object Store** + int fd; + ARROW_CHECK_OK(client.Subscribe(&fd)); - Finally, to shut down the Plasma object store itself, you can terminate the - `plasma_store` process from within your C++ program as follows: + ObjectID object_id; + int64_t data_size; + int64_t metadata_size; + while (true) { + ARROW_CHECK_OK(client.GetNotification(fd, &object_id, &data_size, &metadata_size)); + + std::cout << "Received object notification for object_id = " + << object_id.hex() << ", with data_size = " << data_size + << ", and metadata_size = " << metadata_size << std::endl; + } + + // Disconnect the client. + ARROW_CHECK_OK(client.Disconnect()); +} +``` + +If you compile it with + +``` +g++ subscribe.cc `pkg-config --cflags --libs plasma` --std=c++11 -o subscribe +``` - ``` - // Shut down the Plasma object store. - system("killall plasma_store &"); - ``` +and invoke `./create` and `./subscribe` while the Plasma store is running, +you can observe the new object arriving. From 193e00b00c886330392021dc85f51555bc21c19d Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 31 Jul 2017 17:24:41 -0700 Subject: [PATCH 13/21] unify installation instructions --- python/doc/source/development.rst | 22 +- python/doc/source/plasma.rst | 331 ------------------------------ 2 files changed, 18 insertions(+), 335 deletions(-) diff --git a/python/doc/source/development.rst b/python/doc/source/development.rst index 55b3efdad17..8cb7b43e295 100644 --- a/python/doc/source/development.rst +++ b/python/doc/source/development.rst @@ -165,6 +165,18 @@ Now build and install the Arrow C++ libraries: make install popd +If you want to build and install the Plasma in-memory object store too, +replace the cmake command with the following one: + +.. code-block:: shell + + cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ + -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ + -DARROW_PYTHON=on \ + -DARROW_PLASMA=on \ + -DARROW_BUILD_TESTS=OFF \ + .. + Now, optionally build and install the Apache Parquet libraries in your toolchain: @@ -190,9 +202,10 @@ Now, build pyarrow: cd arrow/python python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \ - --with-parquet --inplace + --with-parquet --with-plasma --inplace -If you did not build parquet-cpp, you can omit ``--with-parquet``. +If you did not build parquet-cpp, you can omit ``--with-parquet`` and if +you did not build with plasma, you can omit ``--with-plasma``. You should be able to run the unit tests with: @@ -224,9 +237,10 @@ You can build a wheel by running: .. code-block:: shell python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \ - --with-parquet --bundle-arrow-cpp bdist_wheel + --with-parquet --with-plasma --bundle-arrow-cpp bdist_wheel -Again, if you did not build parquet-cpp, you should omit ``--with-parquet``. +Again, if you did not build parquet-cpp, you should omit ``--with-parquet`` and +if you did not build with plasma, you should omit ``--with-plasma``. Developing on Windows ===================== diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index 9e1cdb6d05f..526f736aff9 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -24,337 +24,6 @@ The Plasma In-Memory Object Store .. contents:: Contents :depth: 3 -Installing Plasma ------------------ - -Installation on Ubuntu -^^^^^^^^^^^^^^^^^^^^^^ - -The following install instructions have been tested for Ubuntu 16.04. - - -First, install Anaconda in your terminal as follows. This will download -the Anaconda Linux installer and run it. Be sure to invoke the installer -with the ``bash`` command, whether or not you are using the Bash shell. - -.. code-block:: bash - - wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh - bash Anaconda3-4.4.0-Linux-x86_64.sh - -.. note:: - - As an alternative to the wget command above, you can also download the - Anaconda installer script through your web browser at their - `Download Webpage here `_. - - -Accept the Anaconda license agreement and follow the prompt. Allow the -installer to prepend the Anaconda location to your PATH. - -Then, either close and reopen your terminal window, or run the following -command, so that the new PATH takes effect: - -.. code-block:: bash - - source ~/.bashrc - -Anaconda should now be installed. For more information on installing -Anaconda, see their `documentation here `_. - - -Next, update your system and install the following dependency packages -as below: - -.. code-block:: bash - - sudo apt-get update - sudo apt-get install -y cmake build-essential autoconf curl libtool libboost-all-dev unzip libjemalloc-dev pkg-config - - -Now, we need to install arrow. These instructions will install everything -to your home directory. First download the arrow package from github: - -.. code-block:: bash - - cd ~ - git clone https://github.com/apache/arrow - -Next, create a build directory as follows: - -.. code-block:: bash - - cd arrow/cpp - mkdir build - cd build - -You should now be in the ~/arrow/cpp/build directory. Run cmake and -make to build Arrow. - -.. code-block:: bash - - cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. - make - sudo make install - -.. warning:: - - Running the ``cmake`` command above may give an ``ImportError`` - concerning numpy. If that is the case, see `ImportError when Running Cmake`_. - - -After installing arrow, you need to install pyarrow with the Plasma client as follows: - -.. code-block:: bash - - cd ~/arrow/python - PYARROW_WITH_PLASMA=1 python setup.py install - -Once you've installed pyarrow, you should verify that you are able to -import it when running python in the terminal. Also make sure you can import -the Plasma client library. Make sure to try this from -outside of the ``~/arrow/cpp/src/plasma`` directory, otherwise you may -encounter a ModuleNotFoundError. - -.. code-block:: shell - - ubuntu:~/arrow/cpp/src/plasma$ cd ~ - ubuntu:~/$ python - Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) - [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux - Type "help", "copyright", "credits" or "license" for more information. - >>> import pyarrow - >>> import pyarrow.plasma - -If you encounter an ImportError when running the above, see `ImportError After Installing Pyarrow`_. - -Congratulations! Plasma is now set up and you can look at `The Plasma API`_. - -Installation on Mac OS X -^^^^^^^^^^^^^^^^^^^^^^^^ - -The following install instructions have been tested for Mac OS X 10.11 -El Capitan. - - -First, install Anaconda as follows. Download the Graphical MacOS -Installer for your version of Python at the `Anaconda Download Webpage here `_. - -Double-click on the ``.pkg`` file, accept the license agreement, and -follow the step-by-step wizard to install Anaconda. Anaconda will be -installed for the current user's use only, and will require about 1.44 -GB of space. - -To verify that Anaconda has been installed, click on the Launchpad and -select Anaconda Navigator. It should open if you have successfully -installed Anaconda. For more information on installing Anaconda, see -their `documentation here `_. - -The next step is to install the following dependency packages as below: - -.. code-block:: bash - - brew update - brew install cmake autoconf libtool pkg-config jemalloc - -Plasma also requires the build-essential, curl, unzip, libboost-all-dev, -and libjemalloc-dev packages. MacOS should already come with curl, unzip, -and the compilation tools found in build-essential. - -Now, install arrow as follows. Open your terminal window and download the -arrow package from github with the following commands: - -.. code-block:: bash - - cd ~ - git clone https://github.com/apache/arrow - -Create a directory for the arrow build: - -.. code-block:: bash - - cd arrow/cpp - mkdir build - cd build - -You should now be in the ~/arrow/cpp/build directory. Run cmake and -make to build Arrow. - -.. code-block:: bash - - cmake -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. - make - sudo make install - -After installing arrow, you need to install pyarrow with the Plasma client as follows: - -.. code-block:: bash - - cd ~/arrow/python - PYARROW_WITH_PLASMA=1 python setup.py install - -Once you've installed pyarrow, you should verify that you are able to -import it when running python in the terminal. Also make sure you can import -the Plasma client library. Make sure to try this from -outside of the ``~/arrow/cpp/src/plasma`` directory, otherwise you may -encounter a ModuleNotFoundError. - -.. code-block:: shell - - $ cd ~ - $ python - Python 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) - [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux - Type "help", "copyright", "credits" or "license" for more information. - >>> import pyarrow - >>> import pyarrow.plasma - -Congratulations! Plasma is now set up and you can look at `The Plasma API`_. - -Troubleshooting Installation Issues -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -ImportError when Running Cmake ->>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - -While installing arrow, if you run into the following error when running -the ``cmake`` command, there may be an issue with finding numpy. - -.. code-block:: shell - - NumPy import failure: - - Traceback (most recent call last): - - File "", line 1, in - - ImportError: No module named numpy - -First, verify that numpy has been installed alongside anaconda. Running -``conda list`` outputs all the packages that have been installed with -anaconda: - -.. code-block:: shell - - ubuntu:~/arrow/cpp/build$ conda list - numpy 1.12.1 py36_0 - -If something similar to the above numpy line is not listed in the -output, numpy has not yet been installed. - -If numpy has not been installed, try running the following command: - -.. code-block:: bash - - conda install numpy - -If numpy is still not installed, try reinstalling anaconda. - -Second, verify that you are running the python version that comes with -anaconda. ``which`` should point to the python in the newly-installed -Anaconda package: - -.. code-block:: shell - - ubuntu:~/arrow/cpp/build$ which python - /home/ubuntu/anaconda3/bin/python - -If this issue comes up, most likely the anaconda library has not yet -been properly prepended to your PATH and the new PATH reloaded. - -If your machine already has other python versions installed, the Anaconda -python path should precede any other python version path. You can find -the paths to all python versions installed on your machine by running -``whereis python`` in the terminal: - -.. code-block:: shell - - ubuntu:~/arrow/cpp/build$ whereis python - python: /usr/bin/python3.5m /usr/bin/python2.7 /usr/bin/python /usr/bin/python2.7-config /usr/bin/python3.5 /usr/lib/python2.7 /usr/lib/python3.5 /etc/python2.7 /etc/python /etc/python3.5 /usr/local/lib/python2.7 /usr/local/lib/python3.5 /usr/include/python2.7 /usr/share/python /home/ubuntu/anaconda3/bin/python3.6m-config /home/ubuntu/anaconda3/bin/python3.6m /home/ubuntu/anaconda3/bin/python3.6 /home/ubuntu/anaconda3/bin/python3.6-config /home/ubuntu/anaconda3/bin/python /usr/share/man/man1/python.1.gz - -Anaconda usually modifies your ``~/.bashrc`` file in its installation. -You may need to manually add the following line or similar to the bottom -of your ``~/.bashrc`` file, then reload your terminal window: - -.. code-block:: bash - - # added by Anaconda3 4.4.0 installer - export PATH="/home/ubuntu/anaconda3/bin:$PATH" - -You can also create a persistent ``python`` shell alias to point to your -Anaconda python version by adding to following to the bottom of your -``~/.bashrc`` file: - -.. code-block:: bash - - alias python=/home/ubuntu/anaconda3/bin/python - -At this point, if you no longer have any issues with your anaconda -installation or with your python version, you should be able to run Python -in the terminal and import numpy with no errors: - -.. code-block:: shell - - ubuntu:~/arrow/cpp/build$ python - Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) - [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux - Type "help", "copyright", "credits" or "license" for more information. - >>> import numpy - >>> - -Finally, if you are confident that numpy has been installed and that you are -using Anaconda's version of python, cmake may be looking for python and -finding the wrong version (not Anaconda's version of python). Run the following -command instead (setting the ``FILEPATH`` to the path of your Anaconda python -version) to force ``cmake`` to use the correct python version: - -.. code-block:: bash - - cmake -DPYTHON_EXECUTABLE:FILEPATH=/home/ubuntu/anaconda3/bin/python -DARROW_PYTHON=on -DARROW_PLASMA=on -DARROW_BUILD_TESTS=off .. - -You may now proceed with the rest of the arrow installation. - - -ImportError After Installing Pyarrow ->>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - -You may encounter the following error output when trying to ``import pyarrow`` -inside Python: - -.. code-block:: shell - - >>> import pyarrow - Traceback (most recent call last): - File "", line 1, in - File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in - from pyarrow.lib import cpu_count, set_cpu_count - ImportError: libarrow.so.0: cannot open shared object file: No such file or directory - -If this is the case, after you have built Arrow, try running the following line -again in the terminal to remove this ImportError: - -.. code-block:: bash - - sudo ldconfig - -You may also encounter the following error output when trying to ``import pyarrow`` -inside Python: - -.. code-block:: shell - - >>> import pyarrow - Traceback (most recent call last): - File "", line 1, in - File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/__init__.py", line 28, in - from pyarrow.lib import cpu_count, set_cpu_count - ImportError: /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/ubuntu/anaconda3/lib/python3.6/site-packages/pyarrow-0.1.1.dev625+ge08c220-py3.6-linux-x86_64.egg/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so) - -If this is the case, run the following command to remove this ImportError: - -.. code-block:: bash - - conda install -y libgcc - The Plasma API -------------- From ba8b0dfafe4b6973877240d6964a5459db401e27 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 31 Jul 2017 18:33:19 -0700 Subject: [PATCH 14/21] fix docs --- python/doc/source/development.rst | 26 +++++++++++++++----------- 1 file changed, 15 insertions(+), 11 deletions(-) diff --git a/python/doc/source/development.rst b/python/doc/source/development.rst index 8cb7b43e295..a735c0b0db9 100644 --- a/python/doc/source/development.rst +++ b/python/doc/source/development.rst @@ -159,23 +159,15 @@ Now build and install the Arrow C++ libraries: cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ -DARROW_PYTHON=on \ + -DARROW_PLASMA=on \ -DARROW_BUILD_TESTS=OFF \ .. make -j4 make install popd -If you want to build and install the Plasma in-memory object store too, -replace the cmake command with the following one: - -.. code-block:: shell - - cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ - -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ - -DARROW_PYTHON=on \ - -DARROW_PLASMA=on \ - -DARROW_BUILD_TESTS=OFF \ - .. +If you don't want to build and install the Plasma in-memory object store, +you can omit the `-DARROW_PLASMA=on` flag. Now, optionally build and install the Apache Parquet libraries in your toolchain: @@ -232,6 +224,18 @@ You should be able to run the unit tests with: ====================== 181 passed, 17 skipped in 0.98 seconds =========== +On some configurations this might give an error like the following: + +.. conda-block:: shell + + ImportError: /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/ubuntu/repos/arrow/python/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so) + +This can be fixed by running the following: + +.. conda-block:: shell + + conda install -y libgcc + You can build a wheel by running: .. code-block:: shell From c8847204a4ed2a7136d7a0e439bb0a6bbb2435c6 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 31 Jul 2017 18:39:00 -0700 Subject: [PATCH 15/21] more fixes --- python/doc/source/development.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/python/doc/source/development.rst b/python/doc/source/development.rst index a735c0b0db9..455c9f8c0fd 100644 --- a/python/doc/source/development.rst +++ b/python/doc/source/development.rst @@ -167,7 +167,7 @@ Now build and install the Arrow C++ libraries: popd If you don't want to build and install the Plasma in-memory object store, -you can omit the `-DARROW_PLASMA=on` flag. +you can omit the ``-DARROW_PLASMA=on`` flag. Now, optionally build and install the Apache Parquet libraries in your toolchain: @@ -226,13 +226,13 @@ You should be able to run the unit tests with: On some configurations this might give an error like the following: -.. conda-block:: shell +.. code-block:: shell ImportError: /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/ubuntu/repos/arrow/python/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so) This can be fixed by running the following: -.. conda-block:: shell +.. code-block:: shell conda install -y libgcc From 80aaf89d3cf30ff986aca9d4f5806f1b79ab6078 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 31 Jul 2017 20:36:44 -0700 Subject: [PATCH 16/21] cleanup --- python/doc/source/development.rst | 12 ------------ python/doc/source/plasma.rst | 19 +++++++++++-------- 2 files changed, 11 insertions(+), 20 deletions(-) diff --git a/python/doc/source/development.rst b/python/doc/source/development.rst index 455c9f8c0fd..4114ab0f3dc 100644 --- a/python/doc/source/development.rst +++ b/python/doc/source/development.rst @@ -224,18 +224,6 @@ You should be able to run the unit tests with: ====================== 181 passed, 17 skipped in 0.98 seconds =========== -On some configurations this might give an error like the following: - -.. code-block:: shell - - ImportError: /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /home/ubuntu/repos/arrow/python/pyarrow/lib.cpython-36m-x86_64-linux-gnu.so) - -This can be fixed by running the following: - -.. code-block:: shell - - conda install -y libgcc - You can build a wheel by running: .. code-block:: shell diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index 526f736aff9..519f806dde5 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -241,8 +241,8 @@ object: .. code-block:: python # Reconstruct the Arrow tensor object. - reader = pa.BufferReader(buf2) # Plasma buffer -> Arrow reader - tensor2 = pa.read_tensor(reader) # Arrow reader -> Arrow tensor + reader = pa.BufferReader(buf2) + tensor2 = pa.read_tensor(reader) Finally, you can use ``pyarrow.read_tensor`` to convert the Arrow object back into numpy data: @@ -250,7 +250,7 @@ back into numpy data: .. code-block:: python # Convert back to numpy - array = tensor2.to_numpy() # Arrow tensor -> numpy array + array = tensor2.to_numpy() Storing Pandas DataFrames in Plasma ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -262,7 +262,8 @@ size of the ``DataFrame`` to allocate a buffer for. One can instead use pyarrow and its supportive API as an intermediary step to import the Pandas ``DataFrame`` into Plasma. Arrow has multiple equivalent -types to the various Pandas structures, see the :ref:`pandas` page for more. +types to the various Pandas structures, see the :ref:`pandas` page for more +information. You can create the pyarrow equivalent of a Pandas ``DataFrame`` by using ``pyarrow.from_pandas`` to convert it to a ``RecordBatch``. @@ -292,6 +293,7 @@ size of the Plasma object. mock_sink = pa.MockOutputStream() stream_writer = pa.RecordBatchStreamWriter(mock_sink, record_batch.schema) stream_writer.write_batch(record_batch) + stream_writer.close() data_size = mock_sink.size() buf = client.create(object_id, data_size) @@ -307,6 +309,7 @@ the PyArrow ``RecordBatch`` into Plasma as follows: stream = pa.FixedSizeBufferOutputStream(buf) stream_writer = pa.RecordBatchStreamWriter(stream, record_batch.schema) stream_writer.write_batch(record_batch) + stream_writer.close() Finally, seal the finished object for use by all clients: @@ -329,7 +332,7 @@ into an Arrow ``BufferReader`` object. # Fetch the Plasma object [data] = client.get([object_id]) # Get PlasmaBuffer from ObjectID - buffer = pa.BufferReader(data) # PlasmaBuffer -> Arrow BufferReader + buffer = pa.BufferReader(data) From the ``BufferReader``, we can create a specific ``RecordBatchStreamReader`` in Arrow to reconstruct the stored PyArrow ``RecordBatch`` object. @@ -337,8 +340,8 @@ in Arrow to reconstruct the stored PyArrow ``RecordBatch`` object. .. code-block:: python # Convert object back into an Arrow RecordBatch - reader = pa.RecordBatchStreamReader(buffer) # Arrow BufferReader -> Arrow RecordBatchStreamReader - rec_batch = reader.read_next_batch() # Arrow RecordBatchStreamReader -> Arrow RecordBatch + reader = pa.RecordBatchStreamReader(buffer) + record_batch = reader.read_next_batch() The last step is to convert the PyArrow ``RecordBatch`` object back into the original Pandas ``DataFrame`` structure. @@ -346,4 +349,4 @@ the original Pandas ``DataFrame`` structure. .. code-block:: python # Convert back into Pandas - result = rec_batch.to_pandas() # Arrow RecordBatch -> Pandas DataFrame + result = record_batch.to_pandas() From 791e5b0bf80509a0e82deaa41316ae77ef2310ca Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 31 Jul 2017 20:53:03 -0700 Subject: [PATCH 17/21] API changes --- python/doc/source/plasma.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index 519f806dde5..4e67a50ff3f 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -51,12 +51,12 @@ Creating a Plasma client ^^^^^^^^^^^^^^^^^^^^^^^^ To start the Plasma client, from within python, the same socket given to -``./plasma_store`` should then be passed into the Plasma client as shown below: +``./plasma_store`` should then be passed into the connect method as shown below: .. code-block:: python import pyarrow.plasma as plasma - client = plasma.PlasmaClient("/tmp/plasma", "", 0) + client = plasma.connect("/tmp/plasma", "", 0) If the following error occurs from running the above Python code, that means that either the socket given is incorrect, or the ``./plasma_store`` is @@ -65,7 +65,7 @@ process in your plasma directory. .. code-block:: shell - >>> client = plasma.PlasmaClient("/tmp/plasma", "", 0) + >>> client = plasma.connect("/tmp/plasma", "", 0) Connection to socket failed for pathname /tmp/plasma Could not connect to socket /tmp/plasma @@ -138,7 +138,7 @@ the object. # Create a different client. Note that this second client could be # created in the same or in a separate, concurrent Python session. - client2 = plasma.PlasmaClient("/tmp/plasma", "", 0) + client2 = plasma.connect("/tmp/plasma", "", 0) # Get the object in the second client. This blocks until the object has been sealed. object_id2 = plasma.ObjectID(20 * b"a") From 4163ccfaaf5b46eab2cf77cd5898ac6f9dbb0c7d Mon Sep 17 00:00:00 2001 From: Robert Nishihara Date: Mon, 31 Jul 2017 21:57:57 -0700 Subject: [PATCH 18/21] Some changes to plasma.md and add syntax highlighting. --- cpp/apidoc/tutorials/plasma.md | 139 +++++++++++++++++---------------- 1 file changed, 73 insertions(+), 66 deletions(-) diff --git a/cpp/apidoc/tutorials/plasma.md b/cpp/apidoc/tutorials/plasma.md index 89b84c4d0fd..7fd40232b3b 100644 --- a/cpp/apidoc/tutorials/plasma.md +++ b/cpp/apidoc/tutorials/plasma.md @@ -18,11 +18,13 @@ Using the Plasma In-Memory Object Store from C++ Apache Arrow offers the ability to share your data structures among multiple processes simultaneously through Plasma, an in-memory object store. -Plasma object stores can be local, as in being on the same node, or remote. -Plasma can communicate between local and remote object stores to share -objects between nodes as well. +Note that **the Plasma API is not stable**. -Like in Apache Arrow, Plasma objects are immutable. +Plasma clients are processes that run on the same machine as the object store. +They communicate with the object store over Unix domain sockets, and they read +and write data in the object store through shared memory. + +Plasma objects are immutable once they have been created. The following goes over the basics so you can begin using Plasma in your big data applications. @@ -33,45 +35,44 @@ Starting the Plasma store To start running the Plasma object store so that clients may connect and access the data, run the following command: -``` +```shell plasma_store -m 1000000000 -s /tmp/plasma ``` -This command takes in two flags -- the `-m` flag specifies -the size of the object store in bytes, and the `-s` flag specifies the path of -the UNIX domain socket that the store will listen at. +The `-m` flag specifies the size of the object store in bytes. The `-s` flag +specifies the path of the Unix domain socket that the store will listen at. -Therefore, the above command initializes a Plasma store up to 1 GB of memory, and -sets the socket to `/tmp/plasma.` +Therefore, the above command initializes a Plasma store up to 1 GB of memory +and sets the socket to `/tmp/plasma.` The Plasma store will remain available as long as the `plasma_store` process is -running in a terminal window. Messages, such as alerts for disconnecting clients, -may occasionally be outputted. To stop running the Plasma store, you can press -`CTRL-C` in the terminal window. +running in a terminal window. Messages, such as alerts for disconnecting +clients, may occasionally be output. To stop running the Plasma store, you +can press `Ctrl-C` in the terminal window. Alternatively, you can run the Plasma store in the background and ignore all message output with the following terminal command: -``` +```shell plasma_store -m 1000000000 -s /tmp/plasma 1> /dev/null 2> /dev/null & ``` -The Plasma store will instead run silently in the background. To stop running the Plasma store in this case, issue the below terminal command: +The Plasma store will instead run silently in the background. To stop running +the Plasma store in this case, issue the command below: -``` +```shell killall plasma_store & ``` Creating a Plasma client ------------------------ -Now that the Plasma object store is up and running, it is time to make client -processes (such as an instance of your C++ program) connect to it. To use the -Plasma object store as a client, your application should initialize a -`plasma::PlasmaClient` object and tell it to connect to socket specified when -starting up the Plasma object store. +Now that the Plasma object store is up and running, it is time to make a client +process connect to it. To use the Plasma object store as a client, your +application should initialize a `plasma::PlasmaClient` object and tell it to +connect to the socket specified when starting up the Plasma object store. -``` +```cpp #include using namespace plasma; @@ -87,12 +88,13 @@ int main(int argc, char** argv) { Save this program in a file `test.cc` and compile it with -``` +```shell g++ test.cc `pkg-config --cflags --libs plasma` --std=c++11 ``` -Note that multiple clients can be created within the same process, and -clients can be created among multiple concurrent processes. +Note that multiple clients can be created within the same process. + +Note that a `PlasmaClient` object is **not thread safe**. If the Plasma store is still running, you can now execute the `a.out` executable and the store will print something like @@ -106,14 +108,14 @@ which shows that the client was successfully disconnected. Object IDs ---------- -The Plasma object store uses SHA-1 identifiers for accessing objects stored -in shared memory. Each object in the Plasma store should be associated -with a unique id. The Object ID then serves as a key for *any* client to fetch -that object from the Plasma store. +The Plasma object store uses twenty-byte identifiers for accessing objects +stored in shared memory. Each object in the Plasma store should be associated +with a unique ID. The Object ID is then a key that can be used by **any** client +to fetch that object from the Plasma store. -Random generation of Object IDs is often good enough to ensure unique ids: +Random generation of Object IDs is often good enough to ensure unique IDs: -``` +```cpp // Randomly generate an Object ID. ObjectID object_id = ObjectID::from_random(); ``` @@ -123,7 +125,7 @@ same object from the Plasma object store. For easy transportation of Object IDs, you can convert/serialize an Object ID into a binary string and back as follows: -``` +```cpp // From ObjectID to binary string std:string id_string = object_id.binary(); @@ -136,7 +138,7 @@ format that git uses for commit hashes by running `ObjectID::hex`. Here is a test program you can run: -``` +```cpp #include #include #include @@ -157,27 +159,30 @@ Creating an Object ------------------ Now that you learned about Object IDs that are used to refer to objects, -let's look into how objects can be stored in Plasma. +let's look at how objects can be stored in Plasma. -Storing objects is a two-stage process. First, an object is *created*, in -which you specify a pointer for which the object's data and contents will be -constructed from. At this point, the client can still modify the contents -of the data array. +Storing objects is a two-stage process. First a buffer is allocated with a call +to `Create`. Then it can be constructed in place by the client. Then it is made +immutable and shared with other clients via a call to `Seal`. -To create an object for Plasma, you need to create an object id, as well as -give the object's maximum data size in bytes. +The `Create` call blocks while the Plasma store allocates a buffer of the +appropriate size. The client will then map the buffer into its own address +space. At this point the object can be constructed in place using a pointer that +was written by the `Create` command. -``` -// Create the Plasma object by specifying its size. +```cpp int64_t data_size = 100; +// The address of the buffer allocated by the Plasma store will be written at +// this address. uint8_t* data; +// Create a Plasma object by specifying its ID and size. ARROW_CHECK_OK(client.Create(object_id, data_size, NULL, 0, &data)); ``` You can also specify metadata for the object; the third argument is the -metadata (as raw bytes) and the forth argument is the size of the metadata. +metadata (as raw bytes) and the fourth argument is the size of the metadata. -``` +```cpp // Create a Plasma object without metadata. int64_t data_size = 100; std::string metadata = "{'author': 'john'}"; @@ -185,27 +190,27 @@ uint8_t* data; client.Create(object_id, data_size, (uint8_t*) metadata.data(), metadata.size(), &data); ``` -Now that we've specified the pointer to our object's data, we can +Now that we've obtained a pointer to our object's data, we can write our data to it: -``` +```cpp // Write some data for the Plasma object. for (int64_t i = 0; i < data_size; i++) { data[i] = static_cast(i % 4); } ``` -When the client is done, the client *seals* the buffer, making the object +When the client is done, the client **seals** the buffer, making the object immutable, and making it available to other Plasma clients: -``` +```cpp // Seal the object. This makes it available for all clients. client.Seal(object_id); ``` Here is an example that combines all these features: -``` +```cpp #include using namespace plasma; @@ -232,7 +237,8 @@ int main(int argc, char** argv) { ``` This example can be compiled with -``` + +```shell g++ create.cc `pkg-config --cflags --libs plasma` --std=c++11 -o create ``` @@ -242,7 +248,7 @@ been created and sealed for a given Object ID. Note that this function will still return False if the object has been created, but not yet sealed: -``` +```cpp // Check if an object has been created and sealed. bool has_object; client.Contains(object_id, &has_object); @@ -258,7 +264,7 @@ After an object has been sealed, any client who knows the Object ID can get the object. To store the retrieved object contents, you should create an `ObjectBuffer`, then call `PlasmaClient::Get()` as follows: -``` +```cpp // Get from the Plasma store by Object ID. ObjectBuffer object_buffer; client.Get(&object_id, 1, -1, &object_buffer); @@ -269,7 +275,7 @@ from the Plasma store at once. You can specify an array of Object IDs and `ObjectBuffers` to fetch at once, so long as you also specify the number of objects being fetched: -``` +```cpp // Get two objects at once from the Plasma store. This function // call will block until both objects have been fetched. ObjectBuffer multiple_buffers[2]; @@ -283,7 +289,7 @@ when trying to fetch from the Plasma store. You can pass in a timeout in milliseconds when calling `PlasmaClient::Get().` To use `PlasmaClient::Get()` without a timeout, just pass in -1 like in the previous example calls: -``` +```cpp // Make the function call give up fetching the object if it takes // more than 100 milliseconds. int64_t timeout = 100; @@ -294,7 +300,7 @@ Finally, to access the object, you can access the `data` and `metadata` attributes of the `ObjectBuffer`. The `data` can be indexed like any array: -``` +```cpp // Access object data. uint8_t* data = object_buffer.data; int64_t data_size = object_buffer.data_size; @@ -309,7 +315,7 @@ uint8_t first_data_byte = data[0]; Here is a longer example that shows these capabilities: -``` +```cpp #include using namespace plasma; @@ -338,7 +344,7 @@ int main(int argc, char** argv) { If you compile it with -``` +```shell g++ get.cc `pkg-config --cflags --libs plasma` --std=c++11 -o get ``` @@ -353,16 +359,17 @@ The Plasma store internally does reference counting to make sure objects that are mapped into the address space of one of the clients with `PlasmaClient::Get` are accessible. To unmap objects from a client, call `PlasmaClient::Release`. All objects that are mapped into a clients address space will automatically -be released when the client is disconnected from the store. +be released when the client is disconnected from the store (this happens even +if the client process crashes or otherwise fails to call `Disconnect`). If a new object is created and there is not enough space in the Plasma store, -the store will evict the least recently used released object. If all objects -are mapped into the address space of some client, the +the store will evict the least recently used object (an object is in use if at +least one client has gotten it but not released it). Object notifications -------------------- -Additionally, you can arrange Plasma to notify you when objects are +Additionally, you can arrange to have Plasma notify you when objects are sealed in the object store. This may especially be handy when your program is collaborating with other Plasma clients, and needs to know when they make objects available. @@ -370,7 +377,7 @@ when they make objects available. First, you can subscribe your current Plasma client to such notifications by getting a file descriptor: -``` +```cpp // Start receiving notifications into file_descriptor. int fd; ARROW_CHECK_OK(client.Subscribe(&fd)); @@ -381,7 +388,7 @@ wait to receive the next object notification. Object notifications include information such as Object ID, data size, and metadata size of the next newly available object: -``` +```cpp // Receive notification of the next newly available object. // Notification information is stored in object_id, data_size, and metadata_size ObjectID new_object_id; @@ -389,14 +396,14 @@ int64_t data_size; int64_t metadata_size; ARROW_CHECK_OK(client.GetNotification(fd, &object_id, &data_size, &metadata_size)); -// Fetch the newly available object. +// Get the newly available object. ObjectBuffer object_buffer; ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer)); ``` Here is a full program that shows this capability: -``` +```cpp #include using namespace plasma; @@ -427,7 +434,7 @@ int main(int argc, char** argv) { If you compile it with -``` +```shell g++ subscribe.cc `pkg-config --cflags --libs plasma` --std=c++11 -o subscribe ``` From 21bdc0146b759808328287378ad4f3a4a47ca94a Mon Sep 17 00:00:00 2001 From: Robert Nishihara Date: Mon, 31 Jul 2017 22:19:38 -0700 Subject: [PATCH 19/21] Small changes to python plasma documentation. --- python/doc/source/plasma.rst | 145 ++++++++++++++++------------------- 1 file changed, 65 insertions(+), 80 deletions(-) diff --git a/python/doc/source/plasma.rst b/python/doc/source/plasma.rst index 4e67a50ff3f..98dd62f97e9 100644 --- a/python/doc/source/plasma.rst +++ b/python/doc/source/plasma.rst @@ -38,20 +38,21 @@ following: plasma_store -m 1000000000 -s /tmp/plasma -The -m flag specifies the size of the store in bytes, and the -s flag specifies -the socket that the store will listen at. Thus, the above command sets the -Plasma store to use up to 1 GB of memory, and sets the socket to +The ``-m`` flag specifies the size of the store in bytes, and the ``-s`` flag +specifies the socket that the store will listen at. Thus, the above command +allows the Plasma store to use up to 1GB of memory, and sets the socket to ``/tmp/plasma``. -Leave the current terminal window open as long as Plasma store should keep -running. Messages, concerning such as disconnecting clients, may occasionally be -outputted. To stop running the Plasma store, you can press ``CTRL-C`` in the terminal. +Leaving the current terminal window open as long as Plasma store should keep +running. Messages, concerning such as disconnecting clients, may occasionally be +printed to the screen. To stop running the Plasma store, you can press +``Ctrl-C`` in the terminal. Creating a Plasma client ^^^^^^^^^^^^^^^^^^^^^^^^ -To start the Plasma client, from within python, the same socket given to -``./plasma_store`` should then be passed into the connect method as shown below: +To start a Plasma client from Python, call ``plasma.connect`` using the same +socket name: .. code-block:: python @@ -59,9 +60,8 @@ To start the Plasma client, from within python, the same socket given to client = plasma.connect("/tmp/plasma", "", 0) If the following error occurs from running the above Python code, that -means that either the socket given is incorrect, or the ``./plasma_store`` is -not currently running. Make sure that you are still running the ``./plasma_store`` -process in your plasma directory. +means that either the socket given is incorrect, or the ``./plasma_store`` is +not currently running. Check to see if the Plasma store is still running. .. code-block:: shell @@ -70,25 +70,26 @@ process in your plasma directory. Could not connect to socket /tmp/plasma -Object IDs +Object IDs ^^^^^^^^^^ -Each object in the Plasma store should be associated with a unique id. The -Object ID then serves as a key for any client to fetch that object from -the Plasma store. You can form an ``ObjectID`` object from a byte string of -20 bytes. +Each object in the Plasma store should be associated with a unique ID. The +Object ID then serves as a key that any client can use to retrieve that object +from the Plasma store. You can form an ``ObjectID`` object from a byte string of +length 20. .. code-block:: shell - # Create ObjectID of 20 bytes, each byte being the byte (b) encoding of the letter "a" - >>> id = plasma.ObjectID(20 * b"a") + # Create an ObjectID. + >>> id = plasma.ObjectID(20 * b"a") - # "a" is encoded as 61 + # The character "a" is encoded as 61 in hex. >>> id ObjectID(6161616161616161616161616161616161616161) -Random generation of Object IDs is often good enough to ensure unique ids. -You can easily create a helper function that randomizes object ids as follows: +The random generation of Object IDs is often good enough to ensure unique IDs. +You can easily create a helper function that randomly generates object IDs as +follows: .. code-block:: python @@ -101,17 +102,17 @@ You can easily create a helper function that randomizes object ids as follows: Creating an Object ^^^^^^^^^^^^^^^^^^ -Objects are created in Plasma in two stages. First, they are *created*, which -allocates a buffer for the object. At this point, the client can write to the -buffer and construct the object within the allocated buffer. +Objects are created in Plasma in two stages. First, they are **created**, which +allocates a buffer for the object. At this point, the client can write to the +buffer and construct the object within the allocated buffer. -To create an object for Plasma, you need to create an object id, as well as +To create an object for Plasma, you need to create an object ID, as well as give the object's maximum size in bytes. .. code-block:: python # Create an object. - object_id = plasma.ObjectID(20 * b"a") # Note that this is an ObjectID object, not a string + object_id = plasma.ObjectID(20 * b"a") object_size = 1000 buffer = memoryview(client.create(object_id, object_size)) @@ -119,7 +120,7 @@ give the object's maximum size in bytes. for i in range(1000): buffer[i] = i % 128 -When the client is done, the client *seals* the buffer, making the object +When the client is done, the client **seals** the buffer, making the object immutable, and making it available to other Plasma clients. .. code-block:: python @@ -128,10 +129,10 @@ immutable, and making it available to other Plasma clients. client.seal(object_id) -Getting an Object +Getting an Object ^^^^^^^^^^^^^^^^^ -After an object has been sealed, any client who knows the object ID can get +After an object has been sealed, any client who knows the object ID can get the object. .. code-block:: python @@ -141,21 +142,14 @@ the object. client2 = plasma.connect("/tmp/plasma", "", 0) # Get the object in the second client. This blocks until the object has been sealed. - object_id2 = plasma.ObjectID(20 * b"a") - [buffer2] = client2.get([object_id]) # Note that you pass in as an ObjectID object, not a string + object_id2 = plasma.ObjectID(20 * b"a") + [buffer2] = client2.get([object_id]) -If the object has not been sealed yet, then the call to client.get will block +If the object has not been sealed yet, then the call to client.get will block until the object has been sealed by the client constructing the object. Using the ``timeout_ms`` argument to get, you can specify a timeout for this (in milliseconds). After the timeout, the interpreter will yield control back. -Note that the buffer fetched is not in the same object type as the buffer the -original client created to store the object in the first place. The -buffer the original client created is a Python ``memoryview`` buffer object, -while the buffer returned from ``client.get`` is a Plasma-specific ``PlasmaBuffer`` -object. It supports the Python buffer protocol, so you can create a memoryview -from it, which supports slicing and indexing to expose its data. - .. code-block:: shell >>> buffer @@ -181,12 +175,12 @@ Using Arrow and Pandas with Plasma Storing Arrow Objects in Plasma ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Creating an Arrow object still follows the two steps of *creating* it with -a buffer, then *sealing* it, however Arrow objects such as ``Tensors`` may be -more complicated to write than simple binary data. +To store an Arrow object in Plasma, we must first **create** the object and then +**seal** it. However, Arrow objects such as ``Tensors`` may be more complicated +to write than simple binary data. -To create the object in Plasma, you still need an ``ObjectID`` and a size to -pass in. To find out the size of your Arrow object, you can use pyarrow +To create the object in Plasma, you still need an ``ObjectID`` and a size to +pass in. To find out the size of your Arrow object, you can use pyarrow API such as ``pyarrow.get_tensor_size``. .. code-block:: python @@ -199,13 +193,13 @@ API such as ``pyarrow.get_tensor_size``. tensor = pa.Tensor.from_numpy(data) # Create the object in Plasma - object_id = plasma.ObjectID(np.random.bytes(20)) + object_id = plasma.ObjectID(np.random.bytes(20)) data_size = pa.get_tensor_size(tensor) buf = client.create(object_id, data_size) -To write the Arrow ``Tensor`` object into the buffer, you can use Plasma to -convert the ``memoryview`` buffer into a ``pyarrow.FixedSizeBufferOutputStream`` -object. A ``pyarrow.FixedSizeBufferOutputStream`` is a format suitable for Arrow's +To write the Arrow ``Tensor`` object into the buffer, you can use Plasma to +convert the ``memoryview`` buffer into a ``pyarrow.FixedSizeBufferOutputStream`` +object. A ``pyarrow.FixedSizeBufferOutputStream`` is a format suitable for Arrow's ``pyarrow.write_tensor``: .. code-block:: python @@ -214,8 +208,7 @@ object. A ``pyarrow.FixedSizeBufferOutputStream`` is a format suitable for Arrow stream = pa.FixedSizeBufferOutputStream(buf) pa.write_tensor(tensor, stream) # Writes tensor's 552 bytes to Plasma stream -To finish storing the Arrow object in Plasma, you can seal it just like -for any other data: +To finish storing the Arrow object in Plasma, call ``seal``: .. code-block:: python @@ -225,18 +218,16 @@ for any other data: Getting Arrow Objects from Plasma ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -For reading the object from Plasma to Arrow, you can fetch it as a ``PlasmaBuffer`` -using its object id as usual. +To read the object, first retrieve it as a ``PlasmaBuffer`` using its object ID. .. code-block:: python # Get the arrow object by ObjectID. [buf2] = client.get([object_id]) -To convert the ``PlasmaBuffer`` back into the Arrow ``Tensor``, first you have to -create a pyarrow ``BufferReader`` object from it. You can then pass the -``BufferReader`` into ``pyarrow.read_tensor`` to reconstruct the Arrow ``Tensor`` -object: +To convert the ``PlasmaBuffer`` back into an Arrow ``Tensor``, first create a +pyarrow ``BufferReader`` object from it. You can then pass the ``BufferReader`` +into ``pyarrow.read_tensor`` to reconstruct the Arrow ``Tensor`` object: .. code-block:: python @@ -244,7 +235,7 @@ object: reader = pa.BufferReader(buf2) tensor2 = pa.read_tensor(reader) -Finally, you can use ``pyarrow.read_tensor`` to convert the Arrow object +Finally, you can use ``pyarrow.read_tensor`` to convert the Arrow object back into numpy data: .. code-block:: python @@ -255,17 +246,14 @@ back into numpy data: Storing Pandas DataFrames in Plasma ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Storing a Pandas ``DataFrame`` still follows the *create* then *seal* process -of storing an object in the Plasma store, however one cannot directly write -the ``DataFrame`` to Plasma with Pandas alone. Plasma also needs to know the -size of the ``DataFrame`` to allocate a buffer for. +Storing a Pandas ``DataFrame`` still follows the **create** then **seal** +process of storing an object in the Plasma store, however one cannot directly +write the ``DataFrame`` to Plasma with Pandas alone. Plasma also needs to know +the size of the ``DataFrame`` to allocate a buffer for. -One can instead use pyarrow and its supportive API as an intermediary step -to import the Pandas ``DataFrame`` into Plasma. Arrow has multiple equivalent -types to the various Pandas structures, see the :ref:`pandas` page for more -information. +See :ref:`pandas` for more information on using Arrow with Pandas. -You can create the pyarrow equivalent of a Pandas ``DataFrame`` by using +You can create the pyarrow equivalent of a Pandas ``DataFrame`` by using ``pyarrow.from_pandas`` to convert it to a ``RecordBatch``. .. code-block:: python @@ -282,13 +270,14 @@ You can create the pyarrow equivalent of a Pandas ``DataFrame`` by using record_batch = pa.RecordBatch.from_pandas(df) Creating the Plasma object requires an ``ObjectID`` and the size of the -data. Now that we have converted the Pandas ``DataFrame`` into a PyArrow +data. Now that we have converted the Pandas ``DataFrame`` into a PyArrow ``RecordBatch``, use the ``MockOutputStream`` to determine the size of the Plasma object. .. code-block:: python - # Create the Plasma object from the PyArrow RecordBatch + # Create the Plasma object from the PyArrow RecordBatch. Most of the work here + # is done to determine the size of buffer to request from the object store. object_id = plasma.ObjectID(np.random.bytes(20)) mock_sink = pa.MockOutputStream() stream_writer = pa.RecordBatchStreamWriter(mock_sink, record_batch.schema) @@ -297,11 +286,7 @@ size of the Plasma object. data_size = mock_sink.size() buf = client.create(object_id, data_size) -Similar to storing an Arrow object, you have to convert the ``memoryview`` -object into a ``plasma.FixedSizeBufferOutputStream`` object in order to -work with pyarrow's API. Then convert the ``FixedSizeBufferOutputStream`` -object into a pyarrow ``RecordBatchStreamWriter`` object to write out -the PyArrow ``RecordBatch`` into Plasma as follows: +The DataFrame can now be written to the buffer as follows. .. code-block:: python @@ -321,12 +306,12 @@ Finally, seal the finished object for use by all clients: Getting Pandas DataFrames from Plasma ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Since we store the Pandas DataFrame as a PyArrow ``RecordBatch`` object, -to get the object back from the Plasma store, we follow similar steps -to those specified in `Getting Arrow Objects from Plasma`_. +Since we store the Pandas DataFrame as a PyArrow ``RecordBatch`` object, +to get the object back from the Plasma store, we follow similar steps +to those specified in `Getting Arrow Objects from Plasma`_. -We first have to convert the ``PlasmaBuffer`` returned from ``client.get`` -into an Arrow ``BufferReader`` object. +We first have to convert the ``PlasmaBuffer`` returned from ``client.get`` +into an Arrow ``BufferReader`` object. .. code-block:: python @@ -334,7 +319,7 @@ into an Arrow ``BufferReader`` object. [data] = client.get([object_id]) # Get PlasmaBuffer from ObjectID buffer = pa.BufferReader(data) -From the ``BufferReader``, we can create a specific ``RecordBatchStreamReader`` +From the ``BufferReader``, we can create a specific ``RecordBatchStreamReader`` in Arrow to reconstruct the stored PyArrow ``RecordBatch`` object. .. code-block:: python @@ -343,7 +328,7 @@ in Arrow to reconstruct the stored PyArrow ``RecordBatch`` object. reader = pa.RecordBatchStreamReader(buffer) record_batch = reader.read_next_batch() -The last step is to convert the PyArrow ``RecordBatch`` object back into +The last step is to convert the PyArrow ``RecordBatch`` object back into the original Pandas ``DataFrame`` structure. .. code-block:: python From 4b987e83a10f96d406c1f6496a6bf0bc7175ac80 Mon Sep 17 00:00:00 2001 From: Robert Nishihara Date: Tue, 1 Aug 2017 10:32:12 -0700 Subject: [PATCH 20/21] Fix typo. --- cpp/apidoc/tutorials/plasma.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cpp/apidoc/tutorials/plasma.md b/cpp/apidoc/tutorials/plasma.md index 7fd40232b3b..952b4414e1c 100644 --- a/cpp/apidoc/tutorials/plasma.md +++ b/cpp/apidoc/tutorials/plasma.md @@ -61,7 +61,7 @@ The Plasma store will instead run silently in the background. To stop running the Plasma store in this case, issue the command below: ```shell -killall plasma_store & +killall plasma_store ``` Creating a Plasma client From c4ab47e0b26d6b8f25321ab135da4eae56b85964 Mon Sep 17 00:00:00 2001 From: Robert Nishihara Date: Tue, 1 Aug 2017 12:08:05 -0700 Subject: [PATCH 21/21] Remove unsupported shell keyword from plasma.md. --- cpp/apidoc/tutorials/plasma.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/cpp/apidoc/tutorials/plasma.md b/cpp/apidoc/tutorials/plasma.md index 952b4414e1c..9911546ed5c 100644 --- a/cpp/apidoc/tutorials/plasma.md +++ b/cpp/apidoc/tutorials/plasma.md @@ -35,7 +35,7 @@ Starting the Plasma store To start running the Plasma object store so that clients may connect and access the data, run the following command: -```shell +``` plasma_store -m 1000000000 -s /tmp/plasma ``` @@ -53,14 +53,14 @@ can press `Ctrl-C` in the terminal window. Alternatively, you can run the Plasma store in the background and ignore all message output with the following terminal command: -```shell +``` plasma_store -m 1000000000 -s /tmp/plasma 1> /dev/null 2> /dev/null & ``` The Plasma store will instead run silently in the background. To stop running the Plasma store in this case, issue the command below: -```shell +``` killall plasma_store ``` @@ -88,7 +88,7 @@ int main(int argc, char** argv) { Save this program in a file `test.cc` and compile it with -```shell +``` g++ test.cc `pkg-config --cflags --libs plasma` --std=c++11 ``` @@ -238,7 +238,7 @@ int main(int argc, char** argv) { This example can be compiled with -```shell +``` g++ create.cc `pkg-config --cflags --libs plasma` --std=c++11 -o create ``` @@ -344,7 +344,7 @@ int main(int argc, char** argv) { If you compile it with -```shell +``` g++ get.cc `pkg-config --cflags --libs plasma` --std=c++11 -o get ``` @@ -434,7 +434,7 @@ int main(int argc, char** argv) { If you compile it with -```shell +``` g++ subscribe.cc `pkg-config --cflags --libs plasma` --std=c++11 -o subscribe ```