What version of the code are you using?
release-cesm2.1.5-0-g7a6c5b0 (describe version attached)
Have you made any changes to files in the source tree?
No changes to the source tree.
Describe every step you took leading up to the problem:
/work/07644/oxygen/ls6/cesm-2.1.5/cime/scripts/create_newcase \
--case "${CASE}" \
--compset FHIST_BGC \
--res f09_f09_mg17 \
--machine ls6
./xmlchange NTASKS=-8
./xmlchange DOUT_S=TRUE
./xmlchange STOP_N=5
./xmlchange RESUBMIT=0
./xmlchange RUN_REFCASE=b.e21.B1850.f09_g17.CMIP6-piControl.001
./xmlchange RUN_REFDATE=1231-01-01
./xmlchange RUN_TYPE=hybrid
./xmlchange GET_REFCASE=TRUE
./xmlchange RUN_STARTDATE=1850-01-01
./xmlchange STOP_OPTION=ndays
./xmlchange GMAKE_J=64
./xmlchange SSTICE_YEAR_ALIGN=1
./xmlchange SSTICE_YEAR_START=0
./xmlchange SSTICE_YEAR_END=0
./xmlchange SSTICE_DATA_FILENAME="${SCRATCH}/cesm-input-data/SSTICE/sstice_cmip6_pi-Control_clim_40101_200012_diddled.nc"
./case.setup
./case.build --clean-all
./case.build
./case.submit
Namelists are also attached.
If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.
This is a port for TACC's Lonestar6. The config xmls are attached.
Describe your problem or question:
When scaling this case to multiple nodes (LS6 has 128-cores per node), CESM encounters a PIO error during initialization and crashes. This particular run failed during LND but a few runs failed during ATM as well:
Here is what I have tried that did not work and produced the same error (sometimes at different stages in the run):
release-cesm2.1.5-0-g7a6c5b0 (describe version attached)
Have you made any changes to files in the source tree?
No changes to the source tree.
Describe every step you took leading up to the problem:
/work/07644/oxygen/ls6/cesm-2.1.5/cime/scripts/create_newcase \
--case "${CASE}" \
--compset FHIST_BGC \
--res f09_f09_mg17 \
--machine ls6
./xmlchange NTASKS=-8
./xmlchange DOUT_S=TRUE
./xmlchange STOP_N=5
./xmlchange RESUBMIT=0
./xmlchange RUN_REFCASE=b.e21.B1850.f09_g17.CMIP6-piControl.001
./xmlchange RUN_REFDATE=1231-01-01
./xmlchange RUN_TYPE=hybrid
./xmlchange GET_REFCASE=TRUE
./xmlchange RUN_STARTDATE=1850-01-01
./xmlchange STOP_OPTION=ndays
./xmlchange GMAKE_J=64
./xmlchange SSTICE_YEAR_ALIGN=1
./xmlchange SSTICE_YEAR_START=0
./xmlchange SSTICE_YEAR_END=0
./xmlchange SSTICE_DATA_FILENAME="${SCRATCH}/cesm-input-data/SSTICE/sstice_cmip6_pi-Control_clim_40101_200012_diddled.nc"
./case.setup
./case.build --clean-all
./case.build
./case.submit
Namelists are also attached.
If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.
This is a port for TACC's Lonestar6. The config xmls are attached.
Describe your problem or question:
When scaling this case to multiple nodes (LS6 has 128-cores per node), CESM encounters a PIO error during initialization and crashes. This particular run failed during LND but a few runs failed during ATM as well:
pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 135 :
Specified netCDF file does not exist.
MPI error (MPI_File_open) : Other I/O error , error stack:
ADIO_OPEN(219): open failed on a remote node
pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 135 :
Unknown error in file operation
Image PC Routine Line Source
cesm.exe 0000000002AD6196 Unknown Unknown Unknown
cesm.exe 00000000027B6E21 pio_support_mp_pi 118 pio_support.F90
cesm.exe 00000000027B4E65 pio_utils_mp_chec 59 pio_utils.F90
cesm.exe 00000000028C4135 ionf_mod_mp_creat 135 ionf_mod.F90
cesm.exe 00000000027A4AB6 piolib_mod_mp_cre 2663 piolib_mod.F90
cesm.exe 0000000001AA62BA ncdio_pio_mp_ncd_ 262 ncdio_pio.F90.in
cesm.exe 0000000001AD9761 restfilemod_mp_re 441 restFileMod.F90
cesm.exe 0000000001AD8F28 restfilemod_mp_re 91 restFileMod.F90
cesm.exe 00000000019D0FC5 clm_initializemod 548 clm_initializeMod.F90
cesm.exe 00000000019B9A05 lnd_comp_mct_mp_l 233 lnd_comp_mct.F90
cesm.exe 000000000043B3C4 component_mod_mp_ 267 component_mod.F90
cesm.exe 0000000000429BC4 cime_comp_mod_mp_ 1237 cime_comp_mod.F90
cesm.exe 00000000004383E9 MAIN__ 114 cime_driver.F90
cesm.exe 000000000041A8E2 Unknown Unknown Unknown
libc-2.28.so 000014BFE6A597E5 __libc_start_main Unknown Unknown
cesm.exe 000000000041A7EE Unknown Unknown Unknown
Here is what I have tried that did not work and produced the same error (sometimes at different stages in the run):
- Tested on 1 node, 2 nodes, 4 nodes, and (attached) 8 nodes. I did test runs without namelist modifications (except for use_init_interp in user_nl_clm) and they passed once, but not twice. After rebuilding and setting namelists, only 1-node worked with 2 simulated years per day. However, one iteration of the 1-node run seemingly failed when writing out the restart files.
- Switching PIO_TYPENAME from the default 'pnetcdf' to 'netcdf' and 'netcdf4p'. Only 'netcdf' worked but performance tanked on the 1-node from 2 SYPD to 0.8 SYPD.
- Lowering the PIO_STRIDE
- Adjusting the PE layout to put LND on a single node
- Switching to PIO2
- Adjusting the disk striping in the run directory from default 72 targets to 8, 4, and 1 target, changing the chunksize as well.
- Adding I_MPI_IO_HINTS, I_MPI_EXTRA_FILESYSTEM, and I_MPI_EXTRA_FILESYSTEM_LIST
- Adding an explicit romio_hints file to the run directory
- Temporarily disabling grid interpolation in the LND model and copying finidat from successful case manually
- Re-downloading the source code
- Building/running in an apptainer container with GNU, AOCC compilers and switching to MVAPICH2 instead of IMPI
Attachments
-
lnd.log.2945034.260224-215744.txt114.7 KB · Views: 4
-
describe_version.txt4.6 KB · Views: 1
-
cpl.log.2945034.260224-215744.txt41.8 KB · Views: 2
-
config_machines.xml.txt694 bytes · Views: 4
-
config_compilers.xml.txt3.9 KB · Views: 4
-
cesm.log.2945034.260224-215744.txt92 KB · Views: 2
-
atm.log.2945034.260224-215744.txt394.8 KB · Views: 2
-
user_nl_clm.txt25 bytes · Views: 2
-
user_nl_cam.txt14.2 KB · Views: 1