Duplicate memory allocations #98

observingClouds · 2022-05-09T06:50:37Z

Currently, the input data to xbitinfo.get_bitinformation is duplicated when calling

Lines 185 to 186 in c91a21f

    
           X = ds[var].values 
        
           Main.X = X

.

Main.X = X is a deep copy operation. This poses in general an issue for large datasets, because a single copy of the dataset can already be too much to load into memory. Further, I observed that the memory is not freed when calling xbitinfo.get_bitinformation again.

This results into several issues/tasks:

allocate memory only once, or even better
allocate memory lazily
free memory after Julia call

The text was updated successfully, but these errors were encountered:

milankl · 2022-05-09T08:57:39Z

Just FYI this is the memory allocation that happens within Julia

julia> using BitInformation
julia> A = rand(Float32,100,200);
julia> @time bitinformation(A);
  0.001958 seconds (281 allocations: 13.328 KiB)

julia> A = rand(Float32,1000,2000);
julia> @time bitinformation(A);
  0.118632 seconds (281 allocations: 13.328 KiB)

As you can see it's independent of the array size and should only contain the counter arrays (but I haven't checked as it's so small anyway). I don't know how you want to check memory allocation on the python side but in case it includes the allocations within julia this is the lower bound.

observingClouds · 2022-05-09T10:19:09Z

Thanks for these numbers!
All the solutions I can currently think of require to work with chunks in some way. For the chunking to work, the easiest way I can think of, is to do the chunking on the python side and call BitInformation.jl on each of these chunks and combine the results afterwards.

@milankl, can the calculation of the information content be separated by chunks? I though that an array could easily be chunked into 1D-stripes along the dimension the information content should be retrieved from and afterwards combined by taking the average (assuming chunks have identical size). I assumed that something like the following would yield identical results:

import xbitinfo as xb
import numpy as np
import xarray as xr

xr.set_options(display_style="text")
ds = xr.tutorial.load_dataset("eraint_uvz")
ds_selection = ds[['z']].isel(latitude=10,month=1)  # selection to have only 2D data
print(ds_selection)
"""
<xarray.Dataset>
Dimensions:    (level: 3, longitude: 480)
Coordinates:
  * longitude  (longitude) float32 -180.0 -179.2 -178.5 ... 177.8 178.5 179.2
  * level      (level) int32 200 500 850
Data variables:
    z          (level, longitude) float32 1.151e+05 1.151e+05 ... 1.359e+04
Attributes:
    Conventions:  CF-1.0
    Info:         Monthly ERA-Interim data. Downloaded and edited by fabien.m...
"""

# Apply BitInformation.jl on each `level` dimension separately
bitinfo_chunks = {}
for level in range(3):
    bitinfo_chunks[level] = xb.get_bitinformation(ds_selection.isel(level=level), dim='longitude').z.values
# Combine information content across `level`s by a simple mean
bitinfo_chunks_combined = np.array([np.hstack(v) for v in bitinfo_chunks.values()]).mean(axis=0)
print(bitinfo_chunks_combined)
"""
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.24053964 0.27775998 0.2840053  0.83751976 0.77918835
 0.71564043 0.55766442 0.37690905 0.33012151 0.08764519 0.15900722
 0.30933046 0.10688834 0.21276485 0.30603788 0.06007946 0.1924336
 0.0458691  0.1069354 ]
"""

# Apply BitInformation.jl on each dimensions simultanously
bitinfo_all = xb.get_bitinformation(ds_selection, dim='longitude').z.values
print(bitinfo_all)
"""
[0.         0.         0.         0.         0.91829583 0.91829583
 0.91829583 0.91829583 0.91829583 0.         0.91829583 0.91829583
 0.         0.8119377  0.46199519 0.94738475 0.89059076 0.79981288
 0.70050295 0.51163389 0.2799572  0.33485714 0.08154634 0.05379317
 0.28785182 0.07245253 0.07962509 0.30929713 0.06492267 0.11333437
 0.18299123 0.14846125]
"""

but the outputs are obviously not identical. Is this a bug in our wrapper, do I need to combine the field differently or is this expected?

Understanding how the analysis of a dataset can be split into chunks, would be a good start for fixing this issue.

milankl · 2022-05-09T17:05:01Z

The easiest way I can think of, is to do the chunking on the python side and call BitInformation.jl on each of these chunks and combine the results afterwards. @milankl, can the calculation of the information content be separated by chunks? I though that an array could easily be chunked into 1D-stripes along the dimension the information content should be retrieved from and afterwards combined by taking the average (assuming chunks have identical size).

I agree, as the python side will be xarray-aware the chunking should happen there. You then call with every chunk the bitpair_count function, add the bitpair arrays in python and call a mutual_information function which converts the bitcounts to probabilities and then to information (see below). Most of that is already set up, but I'll need to smooth some edges to make this workflow work.

I assumed that something like the following would yield identical result, but the outputs are obviously not identical. Is this a bug in our wrapper, do I need to combine the field differently or is this expected?

Yes, you can't average information calculated from chunks. Think about a bitstream like 00001111 for which the entropy is 1 bit. However, if you cut it into two chunks (0000, 1111) the entropy in both of them is 0. However, the biggest work is done in the bitcounting

https://github.com/milankl/BitInformation.jl/blob/5f3ebbd135e427c68048988fdc24f6f2d5cb71e9/src/bit_count.jl#L67-L76

which can be called with the counter array C being zeros, but also the C from the previous chunk.

milankl · 2023-02-27T21:37:14Z

Btw this here

From https://github.com/JuliaPy/PyCall.jl#readme tells me that it should be possible to call Julia code without copy from python by using PyArray. I've defined everything for <:AbstractArray so the functions should continue to work with PyArray. But I obviously haven't looked into the details.

observingClouds added help wanted Extra attention is needed performance labels May 9, 2022

observingClouds mentioned this issue Oct 19, 2022

Pythonic bitinformation #126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate memory allocations #98

Duplicate memory allocations #98

observingClouds commented May 9, 2022

milankl commented May 9, 2022

observingClouds commented May 9, 2022

milankl commented May 9, 2022

milankl commented Feb 27, 2023

Duplicate memory allocations #98

Duplicate memory allocations #98

Comments

observingClouds commented May 9, 2022

milankl commented May 9, 2022

observingClouds commented May 9, 2022

milankl commented May 9, 2022

milankl commented Feb 27, 2023