Update deps #14

kayibal · 2019-08-05T16:05:04Z

This PR updates sparsity to work with latest pandas version 0.25.0 and latest Dask version 2.2.0. Be sure to have latest versions of filesystem packages installed too.

distributed assign support

Add property columns and index to dask.SparseFrame and increase version to 0.8.0

…ly method for various data types

…thod.

…ules downgrade to 0.19 until dask releases patch

…ataFrame

Hotfix/add signature

Added fillna method

This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big

was broken if loc returned a single location or integers were used as indexers

Implents to_npz for dsitributed collection. Also fixes a small issue with optimized distributed join,

Removes some deprecation warning by updating calls from _keys() to __dask_keys__() as well as updating th import from dask.optimize to dask.optimization

…ty (#43)

) * Update indexer instantiation. Allow loc on index with duplicates. * Support latest versions of pandas (>=0.23.0) * Update circleci configuration to v2 * fix indexing error with older scipy versions (<1.0.0) * Support column indexing in _xs method * raise error if sparse frame is indexed (__getitem__) with None

This resolves problems that appeared after changing drtools' FileSystems behaviour. Eventually this should be handled more elegantly. Currently there's some duplicated code which is the same as in filesystem module in drtools. Maybe we should make FileSystems a separate package (opensource) and use it both in sparsity and drtools?

* Raise error when initializing with unaligned indices

Now it detects whether pandas appended 2 description rows at the end and removes them only if necessary.

Previously original DataFrame's index/columns would be preserved and passed index/columns would be ignored. Now passed index/columns are used but a SyntaxWarning is issued. Fixes #52.

`data` currently can't be a list anyway. Its `.shape` attribute is used at the very beginning of init method, so it has to be array-like.

And add a better docstring.

- column names are preserved in groupby_agg - when groupby_agg is used with Multiindex and level=, resulting index has values only for specified level - when grouping by column, this column is not present in result Fixes #58.

* More info in setup.py * Fix link in readme.

* enable tracking on documentation page * Update documentation link.

* Implement distributed groupby sum and apply_concat_apply function for SparseFrame * add test for different index datatypes * implement sort_index * implement __len__ * implement rename, optimize groupby_sum and join implements distributed rename method and adds quicker routines to groupby_sum if divisions are known. Adds support for joining sp.SparseFrames onto a distributed SparseFrame. * implement distributed set_index * number of line ouput in __repr__ changed. * Create folders when writing to local filesystem * Fix empty dtype * Implement distributed drop. * Always add npz extension when writing SparseFrame to npz format * Fix metadata handling on set_index method * Add method for dask SparseFrame and tuple divisions type * Support empty divisions * Pass on divisions on sort_index * More restrictive pandas version as .drop method fails with pandas==0.20.3 * Fix bug where empty dataframe would create wrongly sized shuffle array * Fix bug where join with in memory sparse frame would return rows from meta_nonempty * Update dask version in setup.py * Update deprecated set_options call * Fix moto and boto versions * Update test dependencies

Fix behaviour when passing Index to __getitem__ Fixes #74

* Rename io modules to io_ and fix some version conflicts Numpy 1.16.* is not compatible with sparsity 0.20.* thus we need to fix the setup.py. When using Scipy<1.0.0 empty column access does not work, thus the dependency had to be adjusted here as well. This also renames the io_ modules to avoid issues with pythons internal module. * Fix incompatibility with numpy>=1.16.0, potential security issue. Due to a security issue (CVE-2019-6446) numpy changed the default value of allow_pickle in np.load to True, this led to error when reading sparse frames from npz archives. This commit fixes it by allowing pickled objects, thus reading sparse frames from unkown sources is still a security risk.

* Add support for dask persist This adds support for dask persist method. * Test persist functionality * PRETTY rename import

* Check for type of meta in `apply_and_enforce` It was possible that although computed type is SparseFrame, other type is returned (if meta was not a SparseFrame). Imports are not changed, just reorganized. * Simple __getindex__ for dask SparseFrames. Support for dsp[index] syntax. Doesn't aim to work the same as in pandas, just maps __getitem__ onto partitions. * Add getitem test with empty frame * todense() returns Series when there is one empty column Previously it returned DataFrame, even though in case of 1-column non-empty SparseFrame, it returned Series. Imports are only re-organized. * Add .todense() method to Dask SparseFrame It works by mapping SparseFrame.todense onto partitions. It as necessary to allow `map_partitions` to return other types then SparseFrame, so kwarg `cls` was added. It implies that one cannot use `cls` kwarg as an argument to mapped function (because it will be consumed by `map_partitions` and not passed to a mapped function). * Support reindex in case of empty frame * More elegant way to implement todense function. (#80) This leverages the dask.delayed object api to achieve the same result which was previously a hack between map_partitions and initializing dd.DataFrame directy.

Turns out the bug #76 was already fixed in commit e8fa03f

* Bugfix: sf['missing column'] raises KeyError Previously it returned last column. * Add test for dask version

This changes add support for pandas>0.23, including 0.24 and 0.25.0

Other cases import a name without underscore.

There is no such argument in pandas 0.23.4

kayibal and others added 30 commits April 17, 2017 16:34

add groupby_agg method

45376b4

read version from a single location

4c3eec2

Add package_data to include VERSION file in distribution

eb745aa

Fix package data values must be iterables

e339052

Add installtion test to sparsity

60e83d8

Added test_assign_column for dask.SparsedFrame

3f54e0d

Added working assign column for dask.SparseFrame

176228b

Increase version number:

1da4035

distributed assign support

Added property columns and index to dask.SparseFrame

f4602d0

Increase version, add *.egg-info to .gitignore

9e33ab3

Merge pull request #6 from datarevenue-berlin/feature/meta

e8838cf

Add property columns and index to dask.SparseFrame and increase version to 0.8.0

Add multiply method to SparseFrame and its test.

c776436

Fixed ci on a branch.

98007fd

Change toarray() for 1 dimensional SparseFrames, add tests for multip…

77ceb6b

…ly method for various data types

Change back to_array functionality and mimic pd.DataFrame.multiply me…

9b4740b

…thod.

bump version

2a701bb

Hotfix pandas-0.20.0 breaks backwards compatibility with internal mod…

21dfe7c

…ules downgrade to 0.19 until dask releases patch

adds support to pandas>=0.20

69a48b5

Adjust pandas dependency on setup.py

a4bba4a

Hotfix: Accept kwargs in SparseFrame.add for congruency with pandas.D…

71dbd6b

…ataFrame

Version 0.9.3

f8fc149

Merge pull request #8 from datarevenue-berlin/hotfix/add-signature

ba89a8d

Hotfix/add signature

Added fillna method

fe0ff68

Version bump to 0.10.0

e83b1e8

Merge pull request #9 from datarevenue-berlin/feature/fillna

69f2aa1

Added fillna method

Add reading categories from path

8763fef

This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big

Bump version to 0.11.0 and remove test from setup.py packages

54bd778

Fix CodeCov reports on project coverage

e2408fa

Read categories from path

1f4f5e9

This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big

Fix multiindex loc indexer (#11)

ed6ae45

was broken if loc returned a single location or integers were used as indexers

kayibal and others added 30 commits April 20, 2018 12:47

To npz (#39)

b6a5938

Implents to_npz for dsitributed collection. Also fixes a small issue with optimized distributed join,

update dask imports and __dask__keys usage (#40)

f29af39

Removes some deprecation warning by updating calls from _keys() to __dask_keys__() as well as updating th import from dask.optimize to dask.optimization

Drop support for pandas>=0.23.0 as api changes break iloc functionali…

4d5fd2b

…ty (#43)

Raise error when initialising with unaligned indices (#51)

4c09026

* Raise error when initializing with unaligned indices

Fix __repr__ (#60)

011fd3e

Now it detects whether pandas appended 2 description rows at the end and removes them only if necessary.

Fix joining with axis=0 with different columns (#57)

eb777fd

Fix init from pd.DataFrame with passed index/columns (#61)

c042db9

Previously original DataFrame's index/columns would be preserved and passed index/columns would be ignored. Now passed index/columns are used but a SyntaxWarning is issued. Fixes #52.

Removed unused code (#62)

f3cd306

`data` currently can't be a list anyway. Its `.shape` attribute is used at the very beginning of init method, so it has to be array-like.

Swap behaviour for axis=0/1 in .multiply (#63)

2c2cdd7

And add a better docstring.

Better index/columns handling in groupby operations (#64)

20e3bc4

- column names are preserved in groupby_agg - when groupby_agg is used with Multiindex and level=, resulting index has values only for specified level - when grouping by column, this column is not present in result Fixes #58.

Remove traildb (#41)

b52a270

Sphinx doc (#47)

96e57f1

Require pandas not higher then 0.23.4

3d10dc8

Add BSD 3-clause license

fe33ab0

Pypi (#65)

87ab3d5

* More info in setup.py * Fix link in readme.

Add Google Analytics ID (#66)

0972823

* enable tracking on documentation page * Update documentation link.

Compatiblity with dask version 0.19.3 (#70)

43d1a44

Support getitem with Index (#75)

b27df42

Fix behaviour when passing Index to __getitem__ Fixes #74

Add support for dask persist (#77)

6736452

* Add support for dask persist This adds support for dask persist method. * Test persist functionality * PRETTY rename import

Add test for local version (#83)

86bf9df

Turns out the bug #76 was already fixed in commit e8fa03f

Bugfix/81 get missing column (#82)

816afae

* Bugfix: sf['missing column'] raises KeyError Previously it returned last column. * Add test for dask version

Support latest pandas version.

73c690f

This changes add support for pandas>0.23, including 0.24 and 0.25.0

Fix some tests that occurred with latest dask & co versions

49e1d71

Add alias to ensure_index import

200a554

Other cases import a name without underscore.

FIXME: remove raise_missing=True

25fad8f

There is no such argument in pandas 0.23.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update deps #14

Update deps #14

kayibal commented Aug 5, 2019

Update deps #14

Are you sure you want to change the base?

Update deps #14

Conversation

kayibal commented Aug 5, 2019