Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMS2io collective netcdf reads hang when the io layout is not the same as the layout #1617

Open
uramirez8707 opened this issue Dec 2, 2024 · 0 comments

Comments

@uramirez8707
Copy link
Contributor

Describe the bug
FMS2io collective netcdf reads hang when the io layout is not the same as the layout.
This is because in the domain_read, the code calls netcdf_read_data inside a if (fileobj%is_root) block.

if (fileobj%is_root) then
if (fileobj%adjust_indices) then
!< If the file is distributed, the file only contains the io global domain
c(xdim_index) = 1
c(ydim_index) = 1
else
!< If the file is not distributed read, the file contains the global domain,
!! so you only need to read the global io domain
c(xdim_index) = xgbegin
c(ydim_index) = ygbegin
endif
e(xdim_index) = xgsize
e(ydim_index) = ygsize
!Read in the global io domain
select type(vdata)
type is (integer(kind=i4_kind))
call allocate_array(buf_i4_kind, e)
call netcdf_read_data(fileobj, variable_name, buf_i4_kind, &
unlim_dim_level=unlim_dim_level, &
corner=c, edge_lengths=e, broadcast=.false.)
type is (integer(kind=i8_kind))
call allocate_array(buf_i8_kind, e)
call netcdf_read_data(fileobj, variable_name, buf_i8_kind, &
unlim_dim_level=unlim_dim_level, &
corner=c, edge_lengths=e, broadcast=.false.)
type is (real(kind=r4_kind))
call allocate_array(buf_r4_kind, e)
call netcdf_read_data(fileobj, variable_name, buf_r4_kind, &
unlim_dim_level=unlim_dim_level, &
corner=c, edge_lengths=e, broadcast=.false.)
type is (real(kind=r8_kind))
call allocate_array(buf_r8_kind, e)
call netcdf_read_data(fileobj, variable_name, buf_r8_kind, &
unlim_dim_level=unlim_dim_level, &
corner=c, edge_lengths=e, broadcast=.false.)
class default
call error("unsupported variable type: domain_read_2d: file: "//trim(fileobj%path)//" variable:"// &
& trim(variable_name))
end select
endif

And in netcdf_read_data, it does

if(fileobj%use_collective) then
varid = get_variable_id(fileobj%ncid, trim(variable_name), msg=append_error_msg)
! NetCDF does not have the ability to specify collective I/O at
! the file basis so we must activate at the variable level
err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective)
call check_netcdf_code(err, append_error_msg)
select type(buf)
type is (integer(kind=i4_kind))
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e)
type is (integer(kind=i8_kind))
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e)
type is (real(kind=r4_kind))
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e)
type is (real(kind=r8_kind))
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e)
class default
call error("Unsupported variable type: "//trim(append_error_msg))
end select
call check_netcdf_code(err, append_error_msg)
call unpack_data_2d(fileobj, varid, variable_name, buf)

but since the nf90_get_var call is only done by the root pe, the code hangs.

In the case when the io layout is the same as the layout, each rank is reading its own section of the code so all the ranks make it to the nf90_get_var call.

To Reproduce
This small test reproduces the issue. The program works when the io layout is the same as the layout, but it hangs when it is not.
https://github.com/uramirez8707/FMS/blob/pnetcdf_test/test_fms/fms2_io/test_domain_pnetcdf.F90

Expected behavior
This test case should work

System Environment
This happens in any system.

Additional context
N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant