Skip to content

Commit

Permalink
ARROW-1252: [Website] Updates for 0.5.0 and short blog post summarizi…
Browse files Browse the repository at this point in the history
…ng the release

Also updated the CHANGELOG.md

Author: Wes McKinney <[email protected]>

Closes apache#885 from wesm/ARROW-1252 and squashes the following commits:

e603f388 [Wes McKinney] Fix up markdown formatting of underscores
797215b2 [Wes McKinney] Release announcement blog post
3babc7b4 [Wes McKinney] Add release page
3fd41e11 [Wes McKinney] First cut revising install page
b8416ee5 [Wes McKinney] Add changelog to CHANGELOG.md
3f9dec05 [Wes McKinney] Start on 0.5.0 website updates
  • Loading branch information
wesm committed Jul 25, 2017
1 parent ad8db9d commit 540ca92
Show file tree
Hide file tree
Showing 7 changed files with 553 additions and 93 deletions.
274 changes: 206 additions & 68 deletions CHANGELOG.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions dev/make_changelog.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ def format_changelog_website(issues, out):
CATEGORIES = {
'New Feature': NEW_FEATURE,
'Improvement': NEW_FEATURE,
'Wish': NEW_FEATURE,
'Task': NEW_FEATURE,
'Test': NEW_FEATURE,
'Bug': BUGFIX
Expand Down
114 changes: 114 additions & 0 deletions site/_posts/2017-07-24-0.5.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
layout: post
title: "Apache Arrow 0.5.0 Release"
date: "2017-07-25 00:00:00 -0400"
author: wesm
categories: [release]
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

The Apache Arrow team is pleased to announce the 0.5.0 release. It includes
[**130 resolved JIRAs**][1] with some new features, expanded integration
testing between implementations, and bug fixes. The Arrow memory format remains
stable since the 0.3.x and 0.4.x releases.

See the [Install Page][2] to learn how to get the libraries for your
platform. The [complete changelog][5] is also available.

## Expanded Integration Testing

In this release, we added compatibility tests for dictionary-encoded data
between Java and C++. This enables the distinct values (the *dictionary*) in a
vector to be transmitted as part of an Arrow schema while the record batches
contain integers which correspond to the dictionary.

So we might have:

```
data (string): ['foo', 'bar', 'foo', 'bar']
```

In dictionary-encoded form, this could be represented as:

```
indices (int8): [0, 1, 0, 1]
dictionary (string): ['foo', 'bar']
```

In upcoming releases, we plan to complete integration testing for the remaining
data types (including some more complicated types like unions and decimals) on
the road to a 1.0.0 release in the future.

## C++ Activity

We completed a number of significant pieces of work in the C++ part of Apache
Arrow.

### Using jemalloc as default memory allocator

We decided to use [jemalloc][4] as the default memory allocator unless it is
explicitly disabled. This memory allocator has significant performance
advantages in Arrow workloads over the default `malloc` implementation. We will
publish a blog post going into more detail about this and why you might care.

### Sharing more C++ code with Apache Parquet

We imported the compression library interfaces and dictionary encoding
algorithms from the [Apache Parquet C++ library][3]. The Parquet library now
depends on this code in Arrow, and we will be able to use it more easily for
data compression in Arrow use cases.

As part of incorporating Parquet's dictionary encoding utilities, we have
developed an `arrow::DictionaryBuilder` class to enable building
dictionary-encoded arrays iteratively. This can help save memory and yield
better performance when interacting with databases, Parquet files, or other
sources which may have columns having many duplicates.

### Support for LZ4 and ZSTD compressors

We added LZ4 and ZSTD compression library support. In ARROW-300 and other
planned work, we intend to add some compression features for data sent via RPC.

## Python Activity

We fixed many bugs which were affecting Parquet and Feather users and fixed
several other rough edges with normal Arrow use. We also added some additional
Arrow type conversions: structs, lists embedded in pandas objects, and Arrow
time types (which deserialize to the `datetime.time` type).

In upcoming releases we plan to continue to improve [Dask][7] support and
performance for distributed processing of Apache Parquet files with pyarrow.

## The Road Ahead

We have much work ahead of us to build out Arrow integrations in other data
systems to improve their processing performance and interoperability with other
systems.

We are discussing the roadmap to a future 1.0.0 release on the [developer
mailing list][6]. Please join the discussion there.

[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.5.0
[2]: http://arrow.apache.org/install
[3]: http://github.com/apache/parquet-cpp
[4]: https://github.com/jemalloc/jemalloc
[5]: http://arrow.apache.org/release/0.5.0.html
[6]: http://mail-archives.apache.org/mod_mbox/arrow-dev/
[7]: http://github.com/dask/dask
Loading

0 comments on commit 540ca92

Please sign in to comment.