Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update deps #14

Open
wants to merge 87 commits into
base: master
Choose a base branch
from
Open

Update deps #14

wants to merge 87 commits into from

Conversation

kayibal
Copy link
Owner

@kayibal kayibal commented Aug 5, 2019

This PR updates sparsity to work with latest pandas version 0.25.0 and latest Dask version 2.2.0. Be sure to have latest versions of filesystem packages installed too.

kayibal and others added 30 commits April 17, 2017 16:34
distributed assign support
Add property columns and index to dask.SparseFrame and increase version to 0.8.0
…ules downgrade to 0.19 until dask releases patch
This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big
This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big
was broken if loc returned a single location or integers were used as indexers
kayibal and others added 30 commits April 20, 2018 12:47
Implents to_npz for dsitributed collection.
Also fixes a small issue with optimized distributed join,
Removes some deprecation warning by updating calls from _keys() to __dask_keys__() as well as updating th import from dask.optimize to dask.optimization
)

* Update indexer instantiation. Allow loc on index with duplicates.

* Support latest versions of pandas (>=0.23.0)

* Update circleci configuration to v2

* fix indexing error with older scipy versions (<1.0.0)

* Support column indexing in _xs method

* raise error if sparse frame is indexed (__getitem__) with None
This resolves problems that appeared after changing drtools' FileSystems behaviour.

Eventually this should be handled more elegantly. Currently there's some duplicated code which is the same as in filesystem module in drtools. Maybe we should make FileSystems a separate package (opensource) and use it both in sparsity and drtools?
* Raise error when initializing with unaligned indices
Now it detects whether pandas appended 2 description rows at the end
and removes them only if necessary.
Previously original DataFrame's index/columns would be preserved
and passed index/columns would be ignored.

Now passed index/columns are used but a SyntaxWarning is issued.

Fixes #52.
`data` currently can't be a list anyway. Its `.shape` attribute is used
at the very beginning of init method, so it has to be array-like.
- column names are preserved in groupby_agg
- when groupby_agg is used with Multiindex and level=, resulting
index has values only for specified level
- when grouping by column, this column is not present in result

Fixes #58.
* More info in setup.py

* Fix link in readme.
* enable tracking on documentation page
* Update documentation link.
* Implement distributed groupby sum and apply_concat_apply function for SparseFrame

* add test for different index datatypes

* implement sort_index

* implement __len__

* implement rename, optimize groupby_sum and join

implements distributed rename method and adds quicker routines to groupby_sum if divisions are known. Adds support for joining sp.SparseFrames onto a distributed SparseFrame.

* implement distributed set_index

* number of line ouput in __repr__ changed.

* Create folders when writing to local filesystem

* Fix empty dtype

* Implement distributed drop.

* Always add npz extension when writing SparseFrame to npz format

* Fix metadata handling on set_index method

* Add method for dask SparseFrame and tuple divisions type

* Support empty divisions

* Pass on divisions on sort_index

* More restrictive pandas version as .drop method fails with pandas==0.20.3

* Fix bug where empty dataframe would create wrongly sized shuffle array

* Fix bug where join with in memory sparse frame would return rows from meta_nonempty

* Update dask version in setup.py

* Update deprecated set_options call

* Fix moto and boto versions

* Update test dependencies
Fix behaviour when passing Index to __getitem__
Fixes #74
* Rename io modules to io_ and fix some version conflicts

Numpy 1.16.* is not compatible with sparsity 0.20.* thus we need to fix
the setup.py. When using Scipy<1.0.0 empty column access does not work,
thus the dependency had to be adjusted here as well.
This also renames the io_ modules to avoid issues with pythons
internal module.

* Fix incompatibility with numpy>=1.16.0, potential security issue.

Due to a security issue (CVE-2019-6446) numpy changed the default value
 of allow_pickle in np.load to True, this led to error when reading
 sparse frames from npz archives. This commit fixes it by allowing
 pickled objects, thus reading sparse frames from unkown sources is
 still a security risk.
* Add support for dask persist

This adds support for dask persist method.

* Test persist functionality

* PRETTY rename import
* Check for type of meta in `apply_and_enforce`

It was possible that although computed type is SparseFrame, other type
is returned (if meta was not a SparseFrame).

Imports are not changed, just reorganized.

* Simple __getindex__ for dask SparseFrames.

Support for dsp[index] syntax. Doesn't aim to work the same as in
pandas, just maps __getitem__ onto partitions.

* Add getitem test with empty frame

* todense() returns Series when there is one empty column

Previously it returned DataFrame, even though in case of 1-column
non-empty SparseFrame, it returned Series.

Imports are only re-organized.

* Add .todense() method to Dask SparseFrame

It works by mapping SparseFrame.todense onto partitions.
It as necessary to allow `map_partitions` to return other types
then SparseFrame, so kwarg `cls` was added. It implies that one cannot
use `cls` kwarg as an argument to mapped function (because it will be
consumed by `map_partitions` and not passed to a mapped function).

* Support reindex in case of empty frame

* More elegant way to implement todense function. (#80)

This leverages the dask.delayed object api to achieve the same result
which was previously a hack between map_partitions and initializing
dd.DataFrame directy.
Turns out the bug #76 was already fixed in commit
e8fa03f
* Bugfix: sf['missing column'] raises KeyError

Previously it returned last column.

* Add test for dask version
This changes add support for pandas>0.23, including 0.24 and 0.25.0
Other cases import a name without underscore.
There is no such argument in pandas 0.23.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants