Data Storage Protocol

Jump to bottom

John Brandt edited this page Sep 5, 2020 · 4 revisions

Data Storage Protocol

Project data base

The projects are stored in a comma separated file with lat, long, unique_path, and name.
This is loaded into 4-predict and 4-download and indexed by the name or unique path.

Data typing

Everything in raw/* is stored as int16, via np.trunc(array * 65535).astype(np.int16) because the original reflectance values are int16 and minimal calculations have occured
Everything in interim/* is float32, via np.float32(array) because there are still calculations to be done
Everything in processed/* is int32, via np.trunc(array * 65535).astype(np.int32)
All calculuations are float32, all tensors are float32, meaning that on loading any array, call np.float32(array), and assert that the array is between -10 and 10.

Data naming conventions

Unique_path is created as the country/admin1/name-uniqueid/
Local and cloud are separated with a local_prefix and cloud_prefix

Data storage type

Currently hickle

Data persistency

All data in raw/ is persistent
All other data is processed on demand and should be deleted from the respective folders before closing the docker containers

Issues

The processed/* is int16 sizing but saved as int32, because it is signed
The hickle protocol does not seem to allow for streaming to / from s3, so it may be returned to pickle in the future