Skip to content

Conversation

@ADCollard
Copy link

I am trying to create a zarr converter for the CrIS PCA files that are in netCDF format. I am trying to make it use ObsBuilder and a YAML control file in order to be maximally flexible for other data types.

A lot of this was written with chatGPT, so it may not be the best solution. But I have been unable to pass the dimensions correctly to container.add.

I could do with some pointers....

@ADCollard ADCollard requested a review from rmclaren November 26, 2025 16:03
@ADCollard ADCollard assigned ADCollard and unassigned ADCollard Nov 26, 2025
@rmclaren rmclaren changed the base branch from feature/data_v5 to feature/data_v6 November 26, 2025 17:01
@rmclaren
Copy link
Contributor

I merged the latest from data_v6 and made this the base branch. Please do a git pull...

@ADCollard
Copy link
Author

I just pushed a version that fails in a more meaningful place. Basically it balks at the final container.add() when I supply a 2D array (it's a silent exit so no error message)

python  gen_data.py 2024-04-01 2024-04-02 cris_pca zarr
python /scratch3/NCEPDEV/da/Andrew.Collard/git/ocelot/data_prep/src/reader.py 2024-04-01 2024-04-02 cris_pca zarr 

*** CrisPcaObsBuilder CONSTRUCTOR ***
    ENCODER_YAML = /scratch3/NCEPDEV/da/Andrew.Collard/git/ocelot/data_prep/mapping/cris_pca.yaml
    DIM PATH MAP: {'location': '*', 'npc_global': '*/npc_global'}
*** _make_description(): using ENCODER_YAML ***

=== NcdfRunner starting ===
FILES FOUND: 9
CHECKING: SNDR.J1.CRIS.20180217T0006.m06.g002.PCA_RED.beta.v03_00.W.251029165834.nc
MATCH: SNDR.J1.CRIS.20180217T0006.m06.g002.PCA_RED.beta.v03_00.W.251029165834.nc
***** Entering make_obs *****
*** load_input() CALLED: /scratch3/NCEPDEV/stmp/Andrew.Collard/CrIS_PCA/data1/SNDR.J1.CRIS.20180217T0006.m06.g002.PCA_RED.beta.v03_00.W.251029165834.nc
    dims: Frozen({'npc_local': 10, 'wnum_all': 2223, 'atrack': 45, 'xtrack': 30, 'fov': 9, 'npc_global': 150, 'red_spectral_region': 25, 'pca_outlier': 100, 'fov_poly': 8, 'wnum_lw': 717, 'wnum_mw': 869, 'wnum_sw': 637})
*** preprocess_dataset() CALLED ***
    atrack=45, xtrack=30, fov=9 -> nlocs=12150
    flattening lat -> latitude
    flattening lon -> longitude
    converting obs_time_tai93 -> UNIX seconds
    flattening global_pc_score to (location, 150)
*** preprocess complete, vars: ['location', 'scan_position', 'latitude', 'longitude', 'time', 'global_pc_score']
    _dims_for_var(latitude, ('location',)) -> ['*']
Adding latitude from latitude with dim_paths ['*']
  shape = (12150,)
    _dims_for_var(longitude, ('location',)) -> ['*']
Adding longitude from longitude with dim_paths ['*']
  shape = (12150,)
    _dims_for_var(time, ('location',)) -> ['*']
Adding time from time with dim_paths ['*']
  shape = (12150,)
    _dims_for_var(timestamp, ('location',)) -> ['*']
Adding timestamp from time with dim_paths ['*']
  shape = (12150,)
    _dims_for_var(scan_position, ('location',)) -> ['*']
Adding scan_position from scan_position with dim_paths ['*']
  shape = (12150,)
    _dims_for_var(global_pc_score, ('location', 'npc_global')) -> ['*', '*/npc_global']
Adding global_pc_score from global_pc_score with dim_paths ['*', '*/npc_global']
  shape = (12150, 150)
Fatal Python error: Segmentation fault

Current thread 0x00007f0a74825b80 (most recent call first):
  File "/scratch3/NCEPDEV/da/Andrew.Collard/git/ocelot/data_prep/mapping/cris_pca.py", line 184 in make_obs
  File "/scratch3/NCEPDEV/da/Andrew.Collard/git/ocelot/data_prep/src/runner.py", line 48 in _make_obs
  File "/scratch3/NCEPDEV/da/Andrew.Collard/git/ocelot/data_prep/src/runner.py", line 185 in run
  File "/scratch3/NCEPDEV/da/Andrew.Collard/git/ocelot/data_prep/src/runner.py", line 205 in run
  File "/scratch3/NCEPDEV/da/Andrew.Collard/git/ocelot/data_prep/src/reader.py", line 259 in _append_data_for_day
  File "/scratch3/NCEPDEV/da/Andrew.Collard/git/ocelot/data_prep/src/reader.py", line 242 in create_yearly_data
  File "/scratch3/NCEPDEV/da/Andrew.Collard/git/ocelot/data_prep/src/reader.py", line 308 in <module>

@rmclaren
Copy link
Contributor

rmclaren commented Nov 26, 2025

Make sure your dim path strings are all caps. so ['*', '*/NPCGLOBAL'] or something. These path strings have to be valid for "query" strings so they need to be all caps.. Also avoid special characters... This is just an artifact from BUFR.

@ADCollard
Copy link
Author

Make sure your dim path strings are all caps. so ['*', '*/NPCGLOBAL'] or something. These path strings have to be valid for "query" strings so they need to be all caps.. Also avoid special characters... This is just an artifact from BUFR.

D'oh! I didn't even think of that! I seem to have progressed a little further now. Thanks.

@rmclaren
Copy link
Contributor

Sorry about, never knew we were going to use this beyond BUFR. Basically just need to update the string in the YAML file of dimension path...

@rmclaren
Copy link
Contributor

I think the underscore is fine actually

@ADCollard
Copy link
Author

Sorry about, never knew we were going to use this beyond BUFR. Basically just need to update the string in the YAML file of dimension path...

Not a problem. Thanks for your help. This has been a good exercise in digging into the obs_builder code for me.

@ADCollard
Copy link
Author

@rmclaren Would you mind taking a look at the current iteration of this? When I run:

python  gen_data.py 2024-04-01 2024-04-02 cris_pca zarr

I get the following output. (note the dim_2) in the directory names:

ls /scratch3/NCEPDEV/stmp/Andrew.Collard/ocelot_data/cris_pca_2024.zarr
dim_2			   global_pc_score_dim_2_119  global_pc_score_dim_2_14	 global_pc_score_dim_2_25  global_pc_score_dim_2_46  global_pc_score_dim_2_67  global_pc_score_dim_2_88
.
.
.
.
global_pc_score_dim_2_118  global_pc_score_dim_2_139  global_pc_score_dim_2_24	 global_pc_score_dim_2_45  global_pc_score_dim_2_66  global_pc_score_dim_2_87

I was hoping for npc_global rather than dim_2, given this in mapping/cris_pca.yaml:

dimensions:
  - name: location
    source: "location"
    path: "*"

  - name: npc_global
    source: npc_global
    path: "*/NPCGLOBAL"

@rmclaren
Copy link
Contributor

rmclaren commented Dec 4, 2025

@ADCollard Sorry, you threw me for a loop there (I'm a little slow this time of year...). So two things:

  1. In the YAML file, the dimensions section goes inside of the encoder section (it is a property of the encoder). You have it outside... This is why it is not seeing your dimension name.

  2. It is not necessary to manually define the Location dimension. It defines this one automatically. You are using it in your code also, just be aware that the encoder doesn't need that one (you don't have to define it manually).

@rmclaren
Copy link
Contributor

rmclaren commented Dec 4, 2025

Another thing I noticed is that in runner.py, you are not filtering the files to the ones in the date range... (?)

Comment on lines +1 to +8
dimensions:
- name: location
source: "location"
path: "*"

- name: npc_global
source: npc_global
path: "*/NPCGLOBAL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dimensions" section should be inside "encoder". Otherwise it will not be applied,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants