Skip to content

Commit

Permalink
ARROW-1353: [Website] Update website for 0.6.0 release and add short …
Browse files Browse the repository at this point in the history
…release blog post

Author: Wes McKinney <[email protected]>

Closes apache#967 from wesm/ARROW-1353 and squashes the following commits:

804fe35 [Wes McKinney] Escape underscores in CHANGELOG.md
1b7c4b6 [Wes McKinney] Finish 0.6.0 blog post
a78cb94 [Wes McKinney] Some updates for 0.6.0 site update
  • Loading branch information
wesm committed Aug 16, 2017
1 parent 0faa17c commit 5bf07cf
Show file tree
Hide file tree
Showing 6 changed files with 399 additions and 21 deletions.
110 changes: 108 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,119 @@
under the License.
-->

# Apache Arrow 0.6.0 (14 August 2017)

## Bug

* ARROW-1192 - [JAVA] Improve splitAndTransfer performance for List and Union vectors
* ARROW-1195 - [C++] CpuInfo doesn't get cache size on Windows
* ARROW-1204 - [C++] lz4 ExternalProject fails in Visual Studio 2015
* ARROW-1225 - [Python] pyarrow.array does not attempt to convert bytes to UTF8 when passed a StringType
* ARROW-1237 - [JAVA] Expose the ability to set lastSet
* ARROW-1239 - issue with current version of git-commit-id-plugin
* ARROW-1240 - security: upgrade logback to address CVE-2017-5929
* ARROW-1242 - [Java] security - upgrade Jackson to mitigate 3 CVE vulnerabilities
* ARROW-1245 - [Integration] Java Integration Tests Disabled
* ARROW-1248 - [Python] C linkage warnings in Clang with public Cython API
* ARROW-1249 - [JAVA] Expose the fillEmpties function from Nullable<Varlength>Vector.mutator
* ARROW-1263 - [C++] CpuInfo should be able to get CPU features on Windows
* ARROW-1265 - [Plasma] Plasma store memory leak warnings in Python test suite
* ARROW-1267 - [Java] Handle zero length case in BitVector.splitAndTransfer
* ARROW-1269 - [Packaging] Add Windows wheel build scripts from ARROW-1068 to arrow-dist
* ARROW-1275 - [C++] Default static library prefix for Snappy should be "\_static"
* ARROW-1276 - Cannot serializer empty DataFrame to parquet
* ARROW-1283 - [Java] VectorSchemaRoot should be able to be closed() more than once
* ARROW-1285 - PYTHON: NotImplemented exception creates empty parquet file
* ARROW-1287 - [Python] Emulate "whence" argument of seek in NativeFile
* ARROW-1290 - [C++] Use array capacity doubling in arrow::BufferBuilder
* ARROW-1291 - [Python] `pa.RecordBatch.from_pandas` doesn't accept DataFrame with numeric column names
* ARROW-1294 - [C++] New Appveyor build failures
* ARROW-1296 - [Java] templates/FixValueVectors reset() method doesn't set allocationSizeInBytes correctly
* ARROW-1300 - [JAVA] Fix ListVector Tests
* ARROW-1306 - [Python] Encoding? issue with error reporting for `parquet.read_table`
* ARROW-1308 - [C++] ld tries to link `arrow_static` even when -DARROW_BUILD_STATIC=off
* ARROW-1309 - [Python] Error inferring List type in `Array.from_pandas` when inner values are all None
* ARROW-1310 - [JAVA] Revert ARROW-886
* ARROW-1312 - [C++] Set default value to `ARROW_JEMALLOC` to OFF until ARROW-1282 is resolved
* ARROW-1326 - [Python] Fix Sphinx build in Travis CI
* ARROW-1327 - [Python] Failing to release GIL in `MemoryMappedFile._open` causes deadlock
* ARROW-1328 - [Python] `pyarrow.Table.from_pandas` option `timestamps_to_ms` changes column values
* ARROW-1330 - [Plasma] Turn on plasma tests on manylinux1
* ARROW-1335 - [C++] `PrimitiveArray::raw_values` has inconsistent semantics re: offsets compared with subclasses
* ARROW-1338 - [Python] Investigate non-deterministic core dump on Python 2.7, Travis CI builds
* ARROW-1340 - [Java] NullableMapVector field doesn't maintain metadata
* ARROW-1342 - [Python] Support strided array of lists
* ARROW-1343 - [Format/Java/C++] Ensuring encapsulated stream / IPC message sizes are always a multiple of 8
* ARROW-1350 - [C++] Include Plasma source tree in source distribution
* ARROW-187 - [C++] Decide on how pedantic we want to be about exceptions
* ARROW-276 - [JAVA] Nullable Value Vectors should extend BaseValueVector instead of BaseDataValueVector
* ARROW-573 - [Python/C++] Support ordered dictionaries data, pandas Categorical
* ARROW-884 - [C++] Exclude internal classes from documentation
* ARROW-932 - [Python] Fix compiler warnings on MSVC
* ARROW-968 - [Python] RecordBatch [i:j] syntax is incomplete

## Improvement

* ARROW-1093 - [Python] Fail Python builds if flake8 yields warnings
* ARROW-1121 - [C++] Improve error message when opening OS file fails
* ARROW-1140 - [C++] Allow optional build of plasma
* ARROW-1149 - [Plasma] Create Cython client library for Plasma
* ARROW-1173 - [Plasma] Blog post for Plasma
* ARROW-1211 - [C++] Consider making `default_memory_pool()` the default for builder classes
* ARROW-1213 - [Python] Enable s3fs to be used with ParquetDataset and reader/writer functions
* ARROW-1219 - [C++] Use more vanilla Google C++ formatting
* ARROW-1224 - [Format] Clarify language around buffer padding and alignment in IPC
* ARROW-1230 - [Plasma] Install libraries and headers
* ARROW-1243 - [Java] security: upgrade all libraries to latest stable versions
* ARROW-1251 - [Python/C++] Revise build documentation to account for latest build toolchain
* ARROW-1253 - [C++] Use pre-built toolchain libraries where prudent to speed up CI builds
* ARROW-1255 - [Plasma] Check plasma flatbuffer messages with the flatbuffer verifier
* ARROW-1257 - [Plasma] Plasma documentation
* ARROW-1258 - [C++] Suppress dlmalloc warnings on Clang
* ARROW-1259 - [Plasma] Speed up Plasma tests
* ARROW-1260 - [Plasma] Use factory method to create Python PlasmaClient
* ARROW-1264 - [Plasma] Don't exit the Python interpreter if the plasma client can't connect to the store
* ARROW-1274 - [C++] `add_compiler_export_flags()` throws warning with CMake >= 3.3
* ARROW-1288 - Clean up many ASF license headers
* ARROW-1289 - [Python] Add `PYARROW_BUILD_PLASMA` option like Parquet
* ARROW-1301 - [C++/Python] Add remaining supported libhdfs UNIX-like filesystem APIs
* ARROW-1303 - [C++] Support downloading Boost
* ARROW-1315 - [GLib] Status check of arrow::ArrayBuilder::Finish() is missing
* ARROW-1323 - [GLib] Add `garrow_boolean_array_get_values()`
* ARROW-1333 - [Plasma] Sorting example for DataFrames in plasma
* ARROW-1334 - [C++] Instantiate arrow::Table from vector of Array objects (instead of Columns)

## New Feature

* ARROW-1076 - [Python] Handle nanosecond timestamps more gracefully when writing to Parquet format
* ARROW-1104 - Integrate in-memory object store from Ray
* ARROW-1246 - [Format] Add Map logical type to metadata
* ARROW-1268 - [Website] Blog post on Arrow integration with Spark
* ARROW-1281 - [C++/Python] Add Docker setup for running HDFS tests and other tests we may not run in Travis CI
* ARROW-1305 - [GLib] Add GArrowIntArrayBuilder
* ARROW-1336 - [C++] Add arrow::schema factory function
* ARROW-439 - [Python] Add option in `to_pandas` conversions to yield Categorical from String/Binary arrays
* ARROW-622 - [Python] Investigate alternatives to `timestamps_to_ms` argument in pandas conversion

## Task

* ARROW-1270 - [Packaging] Add Python wheel build scripts for macOS to arrow-dist
* ARROW-1272 - [Python] Add script to arrow-dist to generate and upload manylinux1 Python wheels
* ARROW-1273 - [Python] Add convenience functions for reading only Parquet metadata or effective Arrow schema from a particular Parquet file
* ARROW-1297 - 0.6.0 Release
* ARROW-1304 - [Java] Fix checkstyle checks warning

## Test

* ARROW-1241 - [C++] Visual Studio 2017 Appveyor build job

# Apache Arrow 0.5.0 (23 July 2017)

## Bug

* ARROW-1074 - from_pandas doesnt convert ndarray to list
* ARROW-1074 - `from_pandas` doesnt convert ndarray to list
* ARROW-1079 - [Python] Empty "private" directories should be ignored by Parquet interface
* ARROW-1081 - C++: arrow::test::TestBase::MakePrimitive doesn't fill null_bitmap
* ARROW-1081 - C++: arrow::test::TestBase::MakePrimitive doesn't fill `null_bitmap`
* ARROW-1096 - [C++] Memory mapping file over 4GB fails on Windows
* ARROW-1097 - Reading tensor needs file to be opened in writeable mode
* ARROW-1098 - Document Error?
Expand Down
112 changes: 112 additions & 0 deletions site/_posts/2017-08-16-0.6.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
layout: post
title: "Apache Arrow 0.6.0 Release"
date: "2017-08-16 00:00:00 -0400"
author: wesm
categories: [release]
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

The Apache Arrow team is pleased to announce the 0.6.0 release. It includes
[**90 resolved JIRAs**][1] with the new Plasma shared memory object store, and
improvements and bug fixes to the various language implementations. The Arrow
memory format remains stable since the 0.3.x release.

See the [Install Page][2] to learn how to get the libraries for your
platform. The [complete changelog][5] is also available.

## Plasma Shared Memory Object Store

This release includes the [Plasma Store][7], which you can read more about in
the linked blog post. This system was originally developed as part of the [Ray
Project][8] at the [UC Berkeley RISELab][9]. We recognized that Plasma would be
highly valuable to the Arrow community as a tool for shared memory management
and zero-copy deserialization. Additionally, we believe we will be able to
develop a stronger software stack through sharing of IO and buffer management
code.

The Plasma store is a server application which runs as a separate process. A
reference C++ client, with Python bindings, is made available in this
release. Clients can be developed in Java or other languages in the future to
enable simple sharing of complex datasets through shared memory.

## Arrow Format Addition: Map type

We added a Map logical type to represent ordered and unordered maps
in-memory. This corresponds to the `MAP` logical type annotation in the Parquet
format (where maps are represented as repeated structs).

Map is represented as a list of structs. It is the first example of a logical
type whose physical representation is a nested type. We have not yet created
implementations of Map containers in any of the implementations, but this can
be done in a future release.

As an example, the Python data:

```
data = [{'a': 1, 'bb': 2, 'cc': 3}, {'dddd': 4}]
```

Could be represented in an Arrow `Map<String, Int32>` as:

```
Map<String, Int32> = List<Struct<keys: String, values: Int32>>
is_valid: [true, true]
offsets: [0, 3, 4]
values: Struct<keys: String, values: Int32>
children:
- keys: String
is_valid: [true, true, true, true]
offsets: [0, 1, 3, 5, 9]
data: abbccdddd
- values: Int32
is_valid: [true, true, true, true]
data: [1, 2, 3, 4]
```
## Python Changes

Some highlights of Python development outside of bug fixes and general API
improvements include:

* New `strings_to_categorical=True` option when calling `Table.to_pandas` will
yield pandas `Categorical` types from Arrow binary and string columns
* Expanded Hadoop Filesystem (HDFS) functionality to improve compatibility with
Dask and other HDFS-aware Python libraries.
* s3fs and other Dask-oriented filesystems can now be used with
`pyarrow.parquet.ParquetDataset`
* More graceful handling of pandas's nanosecond timestamps when writing to
Parquet format. You can now pass `coerce_timestamps='ms'` to cast to
milliseconds, or `'us'` for microseconds.

## Toward Arrow 1.0.0 and Beyond

We are still discussing the roadmap to 1.0.0 release on the [developer mailing
list][6]. The focus of the 1.0.0 release will likely be memory format stability
and hardening integration tests across the remaining data types implemented in
Java and C++. Please join the discussion there.

[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.6.0
[2]: http://arrow.apache.org/install
[3]: http://github.com/apache/parquet-cpp
[5]: http://arrow.apache.org/release/0.6.0.html
[6]: http://mail-archives.apache.org/mod_mbox/arrow-dev/
[7]: http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
[8]: https://ray-project.github.io/ray/
[9]: https://rise.cs.berkeley.edu/
Loading

0 comments on commit 5bf07cf

Please sign in to comment.