Skip to content

Data Storage Protocol

John Brandt edited this page Sep 5, 2020 · 4 revisions

Data Storage Protocol

Project data base

  • The projects are stored in a comma separated file with lat, long, unique_path, and name.
  • This is loaded into 4-predict and 4-download and indexed by the name or unique path.

Data typing

  • Everything in raw/* is stored as int16, via np.trunc(array * 65535).astype(np.int16) because the original reflectance values are int16 and minimal calculations have occured
  • Everything in interim/* is float32, via np.float32(array) because there are still calculations to be done
  • Everything in processed/* is int32, via np.trunc(array * 65535).astype(np.int32)
  • All calculuations are float32, all tensors are float32, meaning that on loading any array, call np.float32(array), and assert that the array is between -10 and 10.

Data naming conventions

  • Unique_path is created as the country/admin1/name-uniqueid/
  • Local and cloud are separated with a local_prefix and cloud_prefix

Data storage type

  • Currently hickle

Data persistency

  • All data in raw/ is persistent
  • All other data is processed on demand and should be deleted from the respective folders before closing the docker containers

Issues

  • The processed/* is int16 sizing but saved as int32, because it is signed
  • The hickle protocol does not seem to allow for streaming to / from s3, so it may be returned to pickle in the future