From 8c877d9c9224c826960216f1fe43a356635f7b4d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 18 Sep 2017 16:26:22 -0400 Subject: [PATCH 1/3] Draft 0.7.0 release post Change-Id: Ie718ee1b03cf983036e5391f4334a57fb21ec953 --- site/_posts/2017-09-19-0.7.0-release.md | 188 ++++++++++++++++++++++++ site/_release/index.md | 2 + site/index.html | 27 +++- site/install.md | 32 ++-- 4 files changed, 225 insertions(+), 24 deletions(-) create mode 100644 site/_posts/2017-09-19-0.7.0-release.md diff --git a/site/_posts/2017-09-19-0.7.0-release.md b/site/_posts/2017-09-19-0.7.0-release.md new file mode 100644 index 00000000000..7d94c995929 --- /dev/null +++ b/site/_posts/2017-09-19-0.7.0-release.md @@ -0,0 +1,188 @@ +--- +layout: post +title: "Apache Arrow 0.7.0 Release" +date: "2017-09-18 00:00:00 -0400" +author: wesm +categories: [release] +--- + + +The Apache Arrow team is pleased to announce the 0.7.0 release. It includes +[**133 resolved JIRAs**][1] many new features and bug fixes to the various +language implementations. The Arrow memory format remains stable since the +0.3.x release. + +See the [Install Page][2] to learn how to get the libraries for your +platform. The [complete changelog][3] is also available. + +We include some highlights from the release in this post. + +## New PMC Member: Kouhei Sutou + +Since the last release we have added [Kou][4] to the Arrow Project Management +Committee. He is also a PMC for Apache Subversion, and a major contributor to +many other open source projects. + +As an active member of the Ruby community in Japan, Kou has been developing the +GLib-based C bindings for Arrow with associated Ruby wrappers, to enable Ruby +users to benefit from the work that's happening in Apache Arrow. + +We are excited to be collaborating with the Ruby community on shared +infrastructure for in-memory analytics and data science. + +## Expanded JavaScript (TypeScript) Implementation + +[Paul Taylor][5] from the [Falcor][7] and [ReactiveX][6] projects has worked to +expand the JavaScript implementation (which is written in TypeScript), using +the latest in modern JavaScript build and packaging technology. We are looking +forward to building out the JS implementation and bringing it up to full +functionality with the C++ and Java implementations. + +We are looking for more JavaScript developers to join the project and work +together to make Arrow for JS work well with many kinds of front end use cases, +like real time data visualization. + +## Type casting for C++ and Python + +As part of longer-term efforts to build an Arrow-native in-memory analytics +library, we implemented a variety of type conversion functions. These functions +are essential in ETL tasks when conforming one table schema to another. These +are similar to the `astype` function in NumPy. + +```python +In [17]: import pyarrow as pa + +In [18]: arr = pa.array([True, False, None, True]) + +In [19]: arr +Out[19]: + +[ + True, + False, + NA, + True +] + +In [20]: arr.cast(pa.int32()) +Out[20]: + +[ + 1, + 0, + NA, + 1 +] +``` + +Over time these will expand to support as many input-and-output type +combinations with optimized conversions. + +## New Arrow GPU (CUDA) Extension Library for C++ + +To help with GPU-related projects using Arrow, like the [GPU Open Analytics +Initiative][8], we have started a C++ add-on library to simplify Arrow memory +management on CUDA-enabled graphics cards. We would like to expand this to +include a library of reusable CUDA kernel functions for GPU analytics on Arrow +columnar memory. + +For example, we could write a record batch from CPU memory to GPU device memory +like so (some error checking omitted): + +```c++ +#include +#include + +using namespace arrow; + +gpu::CudaDeviceManager* manager; +std::shared_ptr context; + +gpu::CudaDeviceManager::GetInstance(&manager) +manager_->GetContext(kGpuNumber, &context); + +std::shared_ptr batch = GetCpuData(); + +std::shared_ptr device_serialized; +gpu::SerializeRecordBatch(*batch, context_.get(), &device_serialized)); +``` + +We can then "read" the GPU record batch, but the returned `arrow::RecordBatch` +internally will contain GPU device pointers that you can use for CUDA kernel +calls: + +``` +std::shared_ptr device_batch; +gpu::ReadRecordBatch(batch->schema(), device_serialized, + default_memory_pool(), &device_batch)); + +// Now run some CUDA kernels on device_batch +``` + +## Decimal Integration Tests + +[Phillip Cloud][9] has been working on decimal support in C++ to enable Parquet +read/write support in C++ and Python, and also end-to-end testing against the +Arrow Java libraries. + +In the upcoming releases, we hope to complete the remaining data types that +need end-to-end testing between Java and C++: + +* Fixed size lists (variable-size lists already implemented) +* Fixes size binary +* Unions +* Maps +* Time intervals + +## Other Notable Python Changes + +Some highlights of Python development outside of bug fixes and general API +improvements include: + +* Simplified `put` and `get` arbitrary Python objects in Plasma objects +* Object serialization functions: LINK TO DOCS + +* New `flavor='spark'` option to `pyarrow.parquet.write_table` to enable easy + writing of Parquet files maximized for Spark compatibility + +* `parquet.write_to_dataset` function with support for partitioning +* Improved support for Dask filesystems +* Improved usability for IPC (schema, record batch read/write) + +## The Road Ahead + +Upcoming Arrow releases will continue to expand the project to cover more use +cases. In addition to completing end-to-end testing for all the major data +types, some of us will be shifting attention to building Arrow-native in-memory +analytics libraries. + +We are looking for more JavaScript, R, and other programming language +developers to join the project and expand the available implementations and +bindings to more languages. + +[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.7.0 +[2]: http://arrow.apache.org/install +[3]: http://arrow.apache.org/release/0.7.0.html +[4]: https://github.com/kou +[5]: https://github.com/trxcllnt +[6]: http://reactivex.io +[7]: https://github.com/netflix/falcor +[8]: http://gpuopenanalytics.com/ +[9]: http://github.com/cpcloud \ No newline at end of file diff --git a/site/_release/index.md b/site/_release/index.md index b373d8bfe19..ad78de8f395 100644 --- a/site/_release/index.md +++ b/site/_release/index.md @@ -26,6 +26,7 @@ limitations under the License. Navigate to the release page for downloads and the changelog. +* [0.7.0 (17 September 2017)][8] * [0.6.0 (14 August 2017)][7] * [0.5.0 (23 July 2017)][6] * [0.4.1 (9 June 2017)][5] @@ -41,3 +42,4 @@ Navigate to the release page for downloads and the changelog. [5]: {{ site.baseurl }}/release/0.4.1.html [6]: {{ site.baseurl }}/release/0.5.0.html [7]: {{ site.baseurl }}/release/0.6.0.html +[8]: {{ site.baseurl }}/release/0.7.0.html diff --git a/site/index.html b/site/index.html index 95d9f4df1e5..01836d7bf5b 100644 --- a/site/index.html +++ b/site/index.html @@ -7,26 +7,37 @@

Apache Arrow

Powering Columnar In-Memory Analytics

Join Mailing List - Install (0.6.0 Release - August 14, 2017) + Install (0.7.0 Release - September 17, 2017)

-

Latest News: Apache Arrow 0.6.0 release

+

Latest News: Apache Arrow 0.7.0 release

Fast

-

Apache Arrow™ enables execution engines to take advantage of the latest SIM -D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format - as possible.

+

Apache Arrow™ enables execution engines to take advantage of + the latest SIMD (Single input multiple data) operations included in modern + processors, for native vectorized optimization of analytical data + processing. Columnar layout is optimized for data locality for better + performance on modern hardware like CPUs and GPUs.

+

The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.

+

Flexible

-

Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python, Ruby, and JavaScript implementations are in progress and more languages are welcome.

+

Arrow acts as a new high-performance interface between various + systems. It is also focused on supporting a wide variety of + industry-standard programming languages. Java, C, C++, Python, Ruby, + and JavaScript implementations are in progress and more languages are + welcome.

Standard

-

Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics.

+

Apache Arrow is backed by key developers of 13 major open source + projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, + Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it + the de-facto standard for columnar in-memory analytics.

@@ -41,7 +52,7 @@

Advantages of a Common Data Layer

common data layer
  • Each system has its own internal memory format
  • -
  • 70-80% CPU wasted on serialization and deserialization
  • +
  • 70-80% computation wasted on serialization and deserialization
  • Similar functionality implemented in multiple projects
diff --git a/site/install.md b/site/install.md index 6cb80c1336f..74d298667d6 100644 --- a/site/install.md +++ b/site/install.md @@ -20,17 +20,17 @@ limitations under the License. {% endcomment %} --> -## Current Version: 0.6.0 +## Current Version: 0.7.0 -### Released: 14 August 2017 +### Released: 17 September 2017 See the [release notes][10] for more about what's new. ### Source release -* **Source Release**: [apache-arrow-0.6.0.tar.gz][6] -* **Verification**: [md5][3], [asc][7] -* [Git tag b173334][2] +* **Source Release**: [apache-arrow-0.7.0.tar.gz][6] +* **Verification**: [sha512][3], [asc][7] +* [Git tag 97f9029][2] ### Java Packages @@ -52,8 +52,8 @@ Install them with: ```shell -conda install arrow-cpp=0.6.* -c conda-forge -conda install pyarrow==0.6.* -c conda-forge +conda install arrow-cpp=0.7.* -c conda-forge +conda install pyarrow==0.7.* -c conda-forge ``` ### Python Wheels on PyPI (Unofficial) @@ -61,10 +61,10 @@ conda install pyarrow==0.6.* -c conda-forge We have provided binary wheels on PyPI for Linux, macOS, and Windows: ```shell -pip install pyarrow==0.6.* +pip install pyarrow==0.7.* ``` -We recommend pinning `0.6.*` in `requirements.txt` to install the latest patch +We recommend pinning `0.7.*` in `requirements.txt` to install the latest patch release. These include the Apache Arrow and Apache Parquet C++ binary libraries bundled @@ -149,13 +149,13 @@ conda install arrow-cpp -c twosigma conda install pyarrow -c twosigma ``` -[1]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.6.0/ -[2]: https://github.com/apache/arrow/releases/tag/apache-arrow-0.6.0 -[3]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.6.0/apache-arrow-0.6.0.tar.gz.md5 -[4]: http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.arrow%22%20AND%20v%3A%220.6.0%22 +[1]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.7.0/ +[2]: https://github.com/apache/arrow/releases/tag/apache-arrow-0.7.0 +[3]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.7.0/apache-arrow-0.7.0.tar.gz.sha512 +[4]: http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.arrow%22%20AND%20v%3A%220.7.0%22 [5]: http://conda-forge.github.io -[6]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.6.0/apache-arrow-0.6.0.tar.gz -[7]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.6.0/apache-arrow-0.6.0.tar.gz.asc +[6]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.7.0/apache-arrow-0.7.0.tar.gz +[7]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.7.0/apache-arrow-0.7.0.tar.gz.asc [8]: https://github.com/red-data-tools/parquet-glib [9]: https://github.com/red-data-tools/arrow-packages -[10]: http://arrow.apache.org/release/0.6.0.html +[10]: http://arrow.apache.org/release/0.7.0.html From a9f8770e39ef31ef7e98ad5b34c297672f5783ef Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 18 Sep 2017 17:13:02 -0400 Subject: [PATCH 2/3] More edits, links Change-Id: Ic138eca127a4a98bb20803fef33e6c64560f5e9d --- site/_posts/2017-09-19-0.7.0-release.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/site/_posts/2017-09-19-0.7.0-release.md b/site/_posts/2017-09-19-0.7.0-release.md index 7d94c995929..6fe2b87ddca 100644 --- a/site/_posts/2017-09-19-0.7.0-release.md +++ b/site/_posts/2017-09-19-0.7.0-release.md @@ -157,14 +157,14 @@ Some highlights of Python development outside of bug fixes and general API improvements include: * Simplified `put` and `get` arbitrary Python objects in Plasma objects -* Object serialization functions: LINK TO DOCS - +* [High-speed, memory efficient object serialization][10]. This is important + enough that we will likely write a dedicated blog post about it. * New `flavor='spark'` option to `pyarrow.parquet.write_table` to enable easy writing of Parquet files maximized for Spark compatibility - -* `parquet.write_to_dataset` function with support for partitioning +* `parquet.write_to_dataset` function with support for partitioned writes * Improved support for Dask filesystems -* Improved usability for IPC (schema, record batch read/write) +* Improved Python usability for IPC: read and write schemas and record batches + more easily. See the [API docs][11] for more about these. ## The Road Ahead @@ -185,4 +185,6 @@ bindings to more languages. [6]: http://reactivex.io [7]: https://github.com/netflix/falcor [8]: http://gpuopenanalytics.com/ -[9]: http://github.com/cpcloud \ No newline at end of file +[9]: http://github.com/cpcloud +[10]: http://arrow.apache.org/docs/python/ipc.html +[11]: http://arrow.apache.org/docs/python/api.html \ No newline at end of file From 3e05047e29995f8592e3a697a8d9d2ff2a3a79d2 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 19 Sep 2017 00:25:15 -0400 Subject: [PATCH 3/3] Update publication date to 19 September Change-Id: I2f134011e2974dab568ce1292a6cb8711d903bb3 --- site/_posts/2017-09-19-0.7.0-release.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site/_posts/2017-09-19-0.7.0-release.md b/site/_posts/2017-09-19-0.7.0-release.md index 6fe2b87ddca..dd253df61cf 100644 --- a/site/_posts/2017-09-19-0.7.0-release.md +++ b/site/_posts/2017-09-19-0.7.0-release.md @@ -1,7 +1,7 @@ --- layout: post title: "Apache Arrow 0.7.0 Release" -date: "2017-09-18 00:00:00 -0400" +date: "2017-09-19 00:00:00 -0400" author: wesm categories: [release] ---