Skip to content

Conversation

@amametjanov
Copy link
Member

@amametjanov amametjanov commented Dec 8, 2025

Change default IO type from NETCDF4C to PNETCDF

Checklist

  • Linting
  • Building
    • CMake build does not produce any new warnings from changes in this PR
  • Testing
    • Add a comment to the PR titled Testing with the following:
      • Which machines CTest unit tests
        have been run on and indicate that are all passing.
      • The Polaris omega_pr test suite
        has passed, using the Polaris e3sm_submodules/Omega baseline
      • Document machine(s), compiler(s), and the build path(s) used for -p for both the baseline (Polaris e3sm_submodules/Omega) and the PR build
      • Indicate "All tests passed" or document failing tests
      • Document testing used to verify the changes including any tests that are added/modified/impacted.
      • Performance related PRs: Please include a relevant PACE experiment link documenting performance before and after.

Fixes #333
Closes #334

@amametjanov
Copy link
Member Author

amametjanov commented Dec 8, 2025

Testing:

  • aurora+oneapi-ifxgpu: 100% tests passed, 0 tests failed out of 35
  • frontier:
    • craygnu: 100% tests passed
    • craygnu-mphipcc: 100%
    • craycray: 80% tests passed, 7 tests failed out of 35 (unrelated to IO)
    • craycray-mphipcc: 100%
    • crayamd: 100%
    • crayamd-mphipcc: 100%
  • pm:
    • gnu: 100%
    • intel: 100%
    • gnugpu: 100%
  • chrysalis:
    • intel: 100%
    • gnu: 100%

@xylar
Copy link

xylar commented Dec 9, 2025

Thanks very much, @amametjanov! This looks promising.

@xylar
Copy link

xylar commented Dec 9, 2025

@amametjanov, I've got 3 tests for this in the queue, one on Chrysalis and 2 on Frontier. But wait times seem to be a bit long both places. I'll keep you posted.

@amametjanov
Copy link
Member Author

To fix IO_Test ctest, this PR needs scorpio PR #670, which expands supports for CDF5 data-types (like 64-bit ints).

@xylar
Copy link

xylar commented Dec 9, 2025

@amametjanov, maybe this will be fixed by E3SM-Project/scorpio#670 but what I'm seeing on Chrysalis with this branch is:

PIO: FATAL ERROR: Aborting... An error occured, Waiting on pending requests on file (output.nc, ncid=21) failed (Number of pending requests on file = 1, Number of variables with pending requests = 1, Number of request blocks = 1, Current block being waited on = 0, Number of requests in current block = 1).. NetCDF: Operation not allowed in define mode (err=-39). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/home/ac.xylar/e3sm_work/e3sm/azamat/mod-dflt-io-type/externals/scorpio/src/clib/pio_darray_int.cpp: 2192)

The polaris output is available at:

/lcrc/group/e3sm/ac.xylar/polaris_0.9/chrysalis/test_20251209/omega-pr-mod-dflt-io-type

It seems like this won't be a short-term fix for Omega if a scoprio fix is needed, because that would mean:

  • scorpio fix gets merge
  • scorpio release happens
  • scorpio gets update in E3SM
  • E3SM/master gets merged into Omega/develop
  • This branch goes in

It feels like we should look into whether there's some alternative way to address #323 in the next week or two.

Update scorpio from v1.8.2 2025-Jul-14 to v1.9.0 2025-Nov-21.
Also add fix for PnetCDF CDF5 types.
@amametjanov amametjanov force-pushed the azamat/mod-dflt-io-type branch from 4d44424 to e68388f Compare December 21, 2025 23:03
@amametjanov amametjanov marked this pull request as ready for review December 21, 2025 23:26
@amametjanov
Copy link
Member Author

Xylar, please check with updated head of this branch to see if it fixes NetCDF: Operation not allowed in define mode errors in polaris tests. All ctests are passing for me.

This branch updates the scorpio submodule ahead of E3SM/master's version, which is still on v1.8.2 2025-Jul-14. When scorpio gets update in E3SM/master (with v1.9.0 or later), E3SM/master merge to Omega/develop will subsume this branch's updates.

@xylar
Copy link

xylar commented Dec 22, 2025

Thanks, @amametjanov! I'll retest as soon as I can.

Copy link

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the omega_pr suite using the fix in E3SM-Project/polaris#442, pointing to this branch for the Omega build.

I was able to run successfully with both Intel and Gnu on Chrysalis. I discovered that I can't log in to either Aurora or Frontier at the moment. I'm trying on Perlmutter (CPU and GPU) next.

In the mean time, two small questions/comments.

@xylar
Copy link

xylar commented Jan 5, 2026

On Perlmutter-CPU (Both Intel and Gnu) and -GPU (Gnu-GPU), I'm seeing the same hanging behavior reported in E3SM-Project/polaris#396 as we had seen previously. It seems like maybe that behavior is unfortunately independent of this PIO problem.

@grnydawn
Copy link

@amametjanov , I ran the tests for this PR on Frontier, but I got the same PIO error:

PIO: FATAL ERROR: Aborting... invalid IO type (/lustre/orion/cli115/scratch/grnydawn/cronjobs/tasks/polaris_cdash/polaris/e3sm_submodules/Omega/externals/scorpio/src/clib/pio_lists.cpp: 120)

Please see Frontier test results at Omega CDash dashboard at https://my.cdash.org/index.php?project=omega

@xylar
Copy link

xylar commented Jan 27, 2026

@amametjanov, could you please let me know when this is ready for me to re-test in Polaris?

@amametjanov
Copy link
Member Author

Yes, this is ready, please re-run Polaris tests. 🙏

@xylar xylar added the bug Something isn't working label Jan 29, 2026
@xylar xylar self-assigned this Jan 29, 2026
@xylar
Copy link

xylar commented Jan 29, 2026

Testing

I successfully ran the omega_pr suite from Polaris using this branch and:

  • Chrysalis with intel
  • Chrysalis with gnu
  • Frontier with craygnu
  • Frontier with craygnu-mphipcc
  • Frontier with craycray
  • Frontier with craycray-mphipcc
  • Frontier with crayamd
  • Frontier with crayamd-mphipcc

(Unchecked items are still in the queue -- update soon...)

I also verified that one output.nc file produced by Omega was in cdf5 format, as expected. I presume the same will be true for all files.

@amametjanov and @grnydawn, thank you so much for figuring out these issues and fixing them!

@xylar
Copy link

xylar commented Jan 29, 2026

My Frontier tests are now running. All the CPU tests did okay. The GPU tests are very slow by comparison, and they are taking more than the 1 hour I had allocated. I don't know for sure but I presume the slowness is not from this PR.

I also saw the I/O failure in one craygnu-mphipcc test: ocean/planar/manufactured_solution/convergence_both/del2. It is discouraging that I am seeing this failure but the situation is clearly far better than it was before.

I will rerun both the tests that timed out and the one that failed. We will see what happens.

Update: It seems like the file system on Frontier might be a problem. My resubmitted jobs are hanging just trying to load the environment.

@xylar
Copy link

xylar commented Jan 29, 2026

In the crayamd-mphipcc run on Frontier, I'm seeing a different error after trying to rerun following a timeout:

PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Operation not allowed in data mode (file = output.nc) (/ccs/home/xylar/e3sm_work/Omega/azamat/mod-dflt-io-type/externals/scorpio/src/clib/pio_getput_int.cpp: 544)

This is for the ocean/spherical/icos/rotation_2d test case.

@xylar
Copy link

xylar commented Jan 29, 2026

When I try to rerun the failed test ( ocean/planar/manufactured_solution/convergence_both/del2) for craygnu-mphipcc, I see a different error:

PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Unknown file format (file = output.nc) (/ccs/home/xylar/e3sm_work/Omega/azamat/mod-dflt-io-type/externals/scorpio/src/clib/pioc_support.cpp: 5503)

The original error was the same as we have seen before:

PIO: FATAL ERROR: Aborting... invalid IO type (/ccs/home/xylar/e3sm_work/Omega/azamat/mod-dflt-io-type/externals/scorpio/src/clib/pio_lists.cpp: 120)

I presume these errors might indicate that Omega isn't overwriting the output.nc file as expected if a run either times out or fails. If so, that issue would presumably be beyond the scope of this PR.

@rljacob
Copy link
Member

rljacob commented Jan 29, 2026

For Frontier, craygnu and craygnu-mphipcc is the only compiler E3SM cares about. Don't spend more then a token amount of E3SM time looking that the others.

@xylar
Copy link

xylar commented Jan 29, 2026

Okay, thanks @rljacob. That wasn't clear to me.

@xylar
Copy link

xylar commented Jan 29, 2026

I set up ocean/planar/manufactured_solution/convergence_both/del2 and reran it with craygnu-mphipcc, this time successfully, so the problem there seems to be transient.

@xylar
Copy link

xylar commented Jan 29, 2026

Nope, now I'm seeing the usual error message:

PIO: FATAL ERROR: Aborting... invalid IO type (/ccs/home/xylar/e3sm_work/Omega/azamat/mod-dflt-io-type/externals/scorpio/src/clib/pio_lists.cpp: 120)

but this time in the ocean/planar/manufactured_solution/convergence_both/default (again with craygnu-mphipcc).

So I think this PR should go in but we can't consider this problem to be solved.

@xylar xylar merged commit 1b5cc62 into develop Jan 29, 2026
1 check passed
@xylar xylar deleted the azamat/mod-dflt-io-type branch January 29, 2026 18:24
@amametjanov
Copy link
Member Author

Thank you for re-running Polaris tests (and merging).
I re-ran omega ctests on latest post-merge develop with all 6 frontier compilers and saw just 2 ctests fail with craycray on cpus; all others pass:

  • 11 - HORZOPERATORS_PLANE_TEST and 12 - HORZOPERATORS_SPHERE_TEST

I heard that frontier scratch filesystem was hanging and slow this week: maybe that's the culprit.

xylar added a commit to xylar/polaris that referenced this pull request Feb 2, 2026
This merge updates the e3sm_submodules/Omega submodule from [f2e951a](https://github.com/E3SM-Project/Omega/tree/f2e951a) to [fc53608](https://github.com/E3SM-Project/Omega/tree/fc53608).

This update includes the following MPAS-Ocean and MPAS-Frameworks PRs (check mark indicates bit-for-bit with previous PR in the list):
- [ ]  (ocn) E3SM-Project/Omega#325
- [ ]  (ocn) E3SM-Project/Omega#329
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Omega

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Changing Omega default Netcdf file format to CDF5

5 participants