Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 190 additions & 0 deletions site/_posts/2017-09-19-0.7.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
---
layout: post
title: "Apache Arrow 0.7.0 Release"
date: "2017-09-19 00:00:00 -0400"
author: wesm
categories: [release]
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

The Apache Arrow team is pleased to announce the 0.7.0 release. It includes
[**133 resolved JIRAs**][1] many new features and bug fixes to the various
language implementations. The Arrow memory format remains stable since the
0.3.x release.

See the [Install Page][2] to learn how to get the libraries for your
platform. The [complete changelog][3] is also available.

We include some highlights from the release in this post.

## New PMC Member: Kouhei Sutou

Since the last release we have added [Kou][4] to the Arrow Project Management
Committee. He is also a PMC for Apache Subversion, and a major contributor to
many other open source projects.

As an active member of the Ruby community in Japan, Kou has been developing the
GLib-based C bindings for Arrow with associated Ruby wrappers, to enable Ruby
users to benefit from the work that's happening in Apache Arrow.

We are excited to be collaborating with the Ruby community on shared
infrastructure for in-memory analytics and data science.

## Expanded JavaScript (TypeScript) Implementation

[Paul Taylor][5] from the [Falcor][7] and [ReactiveX][6] projects has worked to
expand the JavaScript implementation (which is written in TypeScript), using
the latest in modern JavaScript build and packaging technology. We are looking
forward to building out the JS implementation and bringing it up to full
functionality with the C++ and Java implementations.

We are looking for more JavaScript developers to join the project and work
together to make Arrow for JS work well with many kinds of front end use cases,
like real time data visualization.

## Type casting for C++ and Python

As part of longer-term efforts to build an Arrow-native in-memory analytics
library, we implemented a variety of type conversion functions. These functions
are essential in ETL tasks when conforming one table schema to another. These
are similar to the `astype` function in NumPy.

```python
In [17]: import pyarrow as pa

In [18]: arr = pa.array([True, False, None, True])

In [19]: arr
Out[19]:
<pyarrow.lib.BooleanArray object at 0x7ff6fb069b88>
[
True,
False,
NA,
True
]

In [20]: arr.cast(pa.int32())
Out[20]:
<pyarrow.lib.Int32Array object at 0x7ff6fb0383b8>
[
1,
0,
NA,
1
]
```

Over time these will expand to support as many input-and-output type
combinations with optimized conversions.

## New Arrow GPU (CUDA) Extension Library for C++

To help with GPU-related projects using Arrow, like the [GPU Open Analytics
Initiative][8], we have started a C++ add-on library to simplify Arrow memory
management on CUDA-enabled graphics cards. We would like to expand this to
include a library of reusable CUDA kernel functions for GPU analytics on Arrow
columnar memory.

For example, we could write a record batch from CPU memory to GPU device memory
like so (some error checking omitted):

```c++
#include <arrow/api.h>
#include <arrow/gpu/cuda_api.h>

using namespace arrow;

gpu::CudaDeviceManager* manager;
std::shared_ptr<gpu::CudaContext> context;

gpu::CudaDeviceManager::GetInstance(&manager)
manager_->GetContext(kGpuNumber, &context);

std::shared_ptr<RecordBatch> batch = GetCpuData();

std::shared_ptr<gpu::CudaBuffer> device_serialized;
gpu::SerializeRecordBatch(*batch, context_.get(), &device_serialized));
```

We can then "read" the GPU record batch, but the returned `arrow::RecordBatch`
internally will contain GPU device pointers that you can use for CUDA kernel
calls:

```
std::shared_ptr<RecordBatch> device_batch;
gpu::ReadRecordBatch(batch->schema(), device_serialized,
default_memory_pool(), &device_batch));

// Now run some CUDA kernels on device_batch
```

## Decimal Integration Tests

[Phillip Cloud][9] has been working on decimal support in C++ to enable Parquet
read/write support in C++ and Python, and also end-to-end testing against the
Arrow Java libraries.

In the upcoming releases, we hope to complete the remaining data types that
need end-to-end testing between Java and C++:

* Fixed size lists (variable-size lists already implemented)
* Fixes size binary
* Unions
* Maps
* Time intervals

## Other Notable Python Changes

Some highlights of Python development outside of bug fixes and general API
improvements include:

* Simplified `put` and `get` arbitrary Python objects in Plasma objects
* [High-speed, memory efficient object serialization][10]. This is important
enough that we will likely write a dedicated blog post about it.
* New `flavor='spark'` option to `pyarrow.parquet.write_table` to enable easy
writing of Parquet files maximized for Spark compatibility
* `parquet.write_to_dataset` function with support for partitioned writes
* Improved support for Dask filesystems
* Improved Python usability for IPC: read and write schemas and record batches
more easily. See the [API docs][11] for more about these.

## The Road Ahead

Upcoming Arrow releases will continue to expand the project to cover more use
cases. In addition to completing end-to-end testing for all the major data
types, some of us will be shifting attention to building Arrow-native in-memory
analytics libraries.

We are looking for more JavaScript, R, and other programming language
developers to join the project and expand the available implementations and
bindings to more languages.

[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.7.0
[2]: http://arrow.apache.org/install
[3]: http://arrow.apache.org/release/0.7.0.html
[4]: https://github.com/kou
[5]: https://github.com/trxcllnt
[6]: http://reactivex.io
[7]: https://github.com/netflix/falcor
[8]: http://gpuopenanalytics.com/
[9]: http://github.com/cpcloud
[10]: http://arrow.apache.org/docs/python/ipc.html
[11]: http://arrow.apache.org/docs/python/api.html
2 changes: 2 additions & 0 deletions site/_release/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ limitations under the License.

Navigate to the release page for downloads and the changelog.

* [0.7.0 (17 September 2017)][8]
* [0.6.0 (14 August 2017)][7]
* [0.5.0 (23 July 2017)][6]
* [0.4.1 (9 June 2017)][5]
Expand All @@ -41,3 +42,4 @@ Navigate to the release page for downloads and the changelog.
[5]: {{ site.baseurl }}/release/0.4.1.html
[6]: {{ site.baseurl }}/release/0.5.0.html
[7]: {{ site.baseurl }}/release/0.6.0.html
[8]: {{ site.baseurl }}/release/0.7.0.html
27 changes: 19 additions & 8 deletions site/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,26 +7,37 @@ <h1>Apache Arrow</h1>
<p class="lead">Powering Columnar In-Memory Analytics</p>
<p>
<a class="btn btn-lg btn-success" href="mailto:dev-subscribe@arrow.apache.org" role="button">Join Mailing List</a>
<a class="btn btn-lg btn-primary" href="{{ site.baseurl }}/install/" role="button">Install (0.6.0 Release - August 14, 2017)</a>
<a class="btn btn-lg btn-primary" href="{{ site.baseurl }}/install/" role="button">Install (0.7.0 Release - September 17, 2017)</a>
</p>
</div>
<h4><strong>Latest News</strong>: <a href="{{ site.baseurl }}/blog/">Apache Arrow 0.6.0 release</a></h4>
<h4><strong>Latest News</strong>: <a href="{{ site.baseurl }}/blog/">Apache Arrow 0.7.0 release</a></h4>
<div class="row">
<div class="col-lg-4">
<h2>Fast</h2>
<p>Apache Arrow&#8482; enables execution engines to take advantage of the latest SIM
D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format
as possible.</p>
<p>Apache Arrow&#8482; enables execution engines to take advantage of
the latest SIMD (Single input multiple data) operations included in modern
processors, for native vectorized optimization of analytical data
processing. Columnar layout is optimized for data locality for better
performance on modern hardware like CPUs and GPUs.</p>

<p>The Arrow memory format supports <strong>zero-copy reads</strong>
for lightning-fast data access without serialization overhead.</p>

</div>
<div class="col-lg-4">
<h2>Flexible</h2>
<p>Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python, Ruby, and JavaScript implementations are in progress and more languages are welcome.</p>
<p>Arrow acts as a new high-performance interface between various
systems. It is also focused on supporting a wide variety of
industry-standard programming languages. Java, C, C++, Python, Ruby,
and JavaScript implementations are in progress and more languages are
welcome.</p>
</div>
<div class="col-lg-4">
<h2>Standard</h2>
<p>Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics.</p>
<p>Apache Arrow is backed by key developers of 13 major open source
projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis,
Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it
the de-facto standard for columnar in-memory analytics.</p>
</div>
</div> <!-- close "row" div -->

Expand All @@ -41,7 +52,7 @@ <h2>Advantages of a Common Data Layer</h2>
<img src="img/copy2.png" alt="common data layer" style="width:100%" />
<ul>
<li>Each system has its own internal memory format</li>
<li>70-80% CPU wasted on serialization and deserialization</li>
<li>70-80% computation wasted on serialization and deserialization</li>
<li>Similar functionality implemented in multiple projects</li>
</ul>
</div>
Expand Down
32 changes: 16 additions & 16 deletions site/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,17 @@ limitations under the License.
{% endcomment %}
-->

## Current Version: 0.6.0
## Current Version: 0.7.0

### Released: 14 August 2017
### Released: 17 September 2017

See the [release notes][10] for more about what's new.

### Source release

* **Source Release**: [apache-arrow-0.6.0.tar.gz][6]
* **Verification**: [md5][3], [asc][7]
* [Git tag b173334][2]
* **Source Release**: [apache-arrow-0.7.0.tar.gz][6]
* **Verification**: [sha512][3], [asc][7]
* [Git tag 97f9029][2]

### Java Packages

Expand All @@ -52,19 +52,19 @@ Install them with:


```shell
conda install arrow-cpp=0.6.* -c conda-forge
conda install pyarrow==0.6.* -c conda-forge
conda install arrow-cpp=0.7.* -c conda-forge
conda install pyarrow==0.7.* -c conda-forge
```

### Python Wheels on PyPI (Unofficial)

We have provided binary wheels on PyPI for Linux, macOS, and Windows:

```shell
pip install pyarrow==0.6.*
pip install pyarrow==0.7.*
```

We recommend pinning `0.6.*` in `requirements.txt` to install the latest patch
We recommend pinning `0.7.*` in `requirements.txt` to install the latest patch
release.

These include the Apache Arrow and Apache Parquet C++ binary libraries bundled
Expand Down Expand Up @@ -149,13 +149,13 @@ conda install arrow-cpp -c twosigma
conda install pyarrow -c twosigma
```

[1]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.6.0/
[2]: https://github.com/apache/arrow/releases/tag/apache-arrow-0.6.0
[3]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.6.0/apache-arrow-0.6.0.tar.gz.md5
[4]: http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.arrow%22%20AND%20v%3A%220.6.0%22
[1]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.7.0/
[2]: https://github.com/apache/arrow/releases/tag/apache-arrow-0.7.0
[3]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.7.0/apache-arrow-0.7.0.tar.gz.sha512
[4]: http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.arrow%22%20AND%20v%3A%220.7.0%22
[5]: http://conda-forge.github.io
[6]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.6.0/apache-arrow-0.6.0.tar.gz
[7]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.6.0/apache-arrow-0.6.0.tar.gz.asc
[6]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.7.0/apache-arrow-0.7.0.tar.gz
[7]: https://www.apache.org/dyn/closer.cgi/arrow/arrow-0.7.0/apache-arrow-0.7.0.tar.gz.asc
[8]: https://github.com/red-data-tools/parquet-glib
[9]: https://github.com/red-data-tools/arrow-packages
[10]: http://arrow.apache.org/release/0.6.0.html
[10]: http://arrow.apache.org/release/0.7.0.html