Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate memory allocations #98

Open
3 tasks
observingClouds opened this issue May 9, 2022 · 4 comments
Open
3 tasks

Duplicate memory allocations #98

observingClouds opened this issue May 9, 2022 · 4 comments
Labels
help wanted Extra attention is needed performance

Comments

@observingClouds
Copy link
Owner

Currently, the input data to xbitinfo.get_bitinformation is duplicated when calling

X = ds[var].values
Main.X = X
.

Main.X = X is a deep copy operation. This poses in general an issue for large datasets, because a single copy of the dataset can already be too much to load into memory. Further, I observed that the memory is not freed when calling xbitinfo.get_bitinformation again.

This results into several issues/tasks:

  • allocate memory only once, or even better
  • allocate memory lazily
  • free memory after Julia call
@observingClouds observingClouds added help wanted Extra attention is needed performance labels May 9, 2022
@milankl
Copy link
Collaborator

milankl commented May 9, 2022

Just FYI this is the memory allocation that happens within Julia

julia> using BitInformation
julia> A = rand(Float32,100,200);
julia> @time bitinformation(A);
  0.001958 seconds (281 allocations: 13.328 KiB)

julia> A = rand(Float32,1000,2000);
julia> @time bitinformation(A);
  0.118632 seconds (281 allocations: 13.328 KiB)

As you can see it's independent of the array size and should only contain the counter arrays (but I haven't checked as it's so small anyway). I don't know how you want to check memory allocation on the python side but in case it includes the allocations within julia this is the lower bound.

@observingClouds
Copy link
Owner Author

Thanks for these numbers!
All the solutions I can currently think of require to work with chunks in some way. For the chunking to work, the easiest way I can think of, is to do the chunking on the python side and call BitInformation.jl on each of these chunks and combine the results afterwards.

@milankl, can the calculation of the information content be separated by chunks? I though that an array could easily be chunked into 1D-stripes along the dimension the information content should be retrieved from and afterwards combined by taking the average (assuming chunks have identical size). I assumed that something like the following would yield identical results:

import xbitinfo as xb
import numpy as np
import xarray as xr

xr.set_options(display_style="text")
ds = xr.tutorial.load_dataset("eraint_uvz")
ds_selection = ds[['z']].isel(latitude=10,month=1)  # selection to have only 2D data
print(ds_selection)
"""
<xarray.Dataset>
Dimensions:    (level: 3, longitude: 480)
Coordinates:
  * longitude  (longitude) float32 -180.0 -179.2 -178.5 ... 177.8 178.5 179.2
  * level      (level) int32 200 500 850
Data variables:
    z          (level, longitude) float32 1.151e+05 1.151e+05 ... 1.359e+04
Attributes:
    Conventions:  CF-1.0
    Info:         Monthly ERA-Interim data. Downloaded and edited by fabien.m...
"""

# Apply BitInformation.jl on each `level` dimension separately
bitinfo_chunks = {}
for level in range(3):
    bitinfo_chunks[level] = xb.get_bitinformation(ds_selection.isel(level=level), dim='longitude').z.values
# Combine information content across `level`s by a simple mean
bitinfo_chunks_combined = np.array([np.hstack(v) for v in bitinfo_chunks.values()]).mean(axis=0)
print(bitinfo_chunks_combined)
"""
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.24053964 0.27775998 0.2840053  0.83751976 0.77918835
 0.71564043 0.55766442 0.37690905 0.33012151 0.08764519 0.15900722
 0.30933046 0.10688834 0.21276485 0.30603788 0.06007946 0.1924336
 0.0458691  0.1069354 ]
"""

# Apply BitInformation.jl on each dimensions simultanously
bitinfo_all = xb.get_bitinformation(ds_selection, dim='longitude').z.values
print(bitinfo_all)
"""
[0.         0.         0.         0.         0.91829583 0.91829583
 0.91829583 0.91829583 0.91829583 0.         0.91829583 0.91829583
 0.         0.8119377  0.46199519 0.94738475 0.89059076 0.79981288
 0.70050295 0.51163389 0.2799572  0.33485714 0.08154634 0.05379317
 0.28785182 0.07245253 0.07962509 0.30929713 0.06492267 0.11333437
 0.18299123 0.14846125]
"""

but the outputs are obviously not identical. Is this a bug in our wrapper, do I need to combine the field differently or is this expected?

Understanding how the analysis of a dataset can be split into chunks, would be a good start for fixing this issue.

@milankl
Copy link
Collaborator

milankl commented May 9, 2022

The easiest way I can think of, is to do the chunking on the python side and call BitInformation.jl on each of these chunks and combine the results afterwards. @milankl, can the calculation of the information content be separated by chunks? I though that an array could easily be chunked into 1D-stripes along the dimension the information content should be retrieved from and afterwards combined by taking the average (assuming chunks have identical size).

I agree, as the python side will be xarray-aware the chunking should happen there. You then call with every chunk the bitpair_count function, add the bitpair arrays in python and call a mutual_information function which converts the bitcounts to probabilities and then to information (see below). Most of that is already set up, but I'll need to smooth some edges to make this workflow work.

I assumed that something like the following would yield identical result, but the outputs are obviously not identical. Is this a bug in our wrapper, do I need to combine the field differently or is this expected?

Yes, you can't average information calculated from chunks. Think about a bitstream like 00001111 for which the entropy is 1 bit. However, if you cut it into two chunks (0000, 1111) the entropy in both of them is 0. However, the biggest work is done in the bitcounting

https://github.com/milankl/BitInformation.jl/blob/5f3ebbd135e427c68048988fdc24f6f2d5cb71e9/src/bit_count.jl#L67-L76

which can be called with the counter array C being zeros, but also the C from the previous chunk.

@milankl
Copy link
Collaborator

milankl commented Feb 27, 2023

Btw this here

image

From https://github.com/JuliaPy/PyCall.jl#readme tells me that it should be possible to call Julia code without copy from python by using PyArray. I've defined everything for <:AbstractArray so the functions should continue to work with PyArray. But I obviously haven't looked into the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed performance
Projects
None yet
Development

No branches or pull requests

2 participants