Hello
I have an issue regarding PIO which I think is similar to these two threads that are unanswered (CESM2.2 ; ERROR: ionf_mod.F90 + errors occur when the cesm2.1.3 start running)
I am using CESM 2.1.3 (CLM5) and I have modified the soil layer structure of the land model (discussed here: Lower boundary conditions (differences between CLM4.5 and CLM5))
These are the steps I took:
- I did AD spin up based on the documentation and it did run perfectly as I expected (I only tested it for 40 years and not the 200y suggested in the documentation: 1.5.5. Spinup of CLM5.0-BGC-Crop — ctsm release-clm5.0 documentation).
- Then I tried the ND spin up test (this time also for a shorter time frame than suggested) and it failed.
./case.setup
# doubling the CPUs (instead of 40 cores per 2 nodes it is using 40 cores per 4 nodes) to reduce the time it needs to run the lengthy spin up. It took 12h to run 10years and with this 160 cores it kept running 10years in 6-7 hours.
./xmlchange NTASKS=160
./xmlchange NTASKS_ESP=1
./case.build
./case.submit
This job started running and kept running till the year 87, failed in the 5th submission (only one resubmit was remaining when it failed). Below is the message I see in the cesm log file:
there are also these lines in the cesm log file:
Also this appears in the first line of the cesm log file:
This is odd for me because if there was an issue in the first place why it kept running perfectly fine until the year 87 and then suddenly fails with some NetCDF error!!!
Do you think this is related to the fact that I changed the cores and nodes in the beginning? Should I use the previous 2 nodes (80 cores)? I don't know what happened here.
Moreover, I tried running the same thing with some minor differences on another cluster (4 nodes, 64 cores each = 256 cores) + (10 years for AD spin up + 6 years for ND = I know it is a short one but it was done just to test if without changing the number of nodes and cores the model finishes the ND spin up or what) and it finished the ND without any issue. So the hypothesis I have about the error being related to the doubling of cores becomes a bit more rational, right? or maybe because this second ND short test was a short run it didn't fail and the previous one was 10 times longer and the second one also would have failed if it was a long run?
Thanks for your support.
I have an issue regarding PIO which I think is similar to these two threads that are unanswered (CESM2.2 ; ERROR: ionf_mod.F90 + errors occur when the cesm2.1.3 start running)
I am using CESM 2.1.3 (CLM5) and I have modified the soil layer structure of the land model (discussed here: Lower boundary conditions (differences between CLM4.5 and CLM5))
These are the steps I took:
- I did AD spin up based on the documentation and it did run perfectly as I expected (I only tested it for 40 years and not the 200y suggested in the documentation: 1.5.5. Spinup of CLM5.0-BGC-Crop — ctsm release-clm5.0 documentation).
- Then I tried the ND spin up test (this time also for a shorter time frame than suggested) and it failed.
Code:
./create_newcase --case $CASES2_1_3/spinupshortND --compset IHistClm50BgcCrop --res f19_g17 --machine beluga --compiler intel --mpilib intelmpi --walltime 14:00:00 --run-unsupported
Code:
# copy pointer and restart file from AD spin up you did before to the current case (ND spin up)
cp /cesm2_1_3_OUT/spinupshort/run/spinupshort.clm2.r* .
cp /cesm2_1_3_OUT/spinupshort/run/rpointer.* .
./case.setup
# doubling the CPUs (instead of 40 cores per 2 nodes it is using 40 cores per 4 nodes) to reduce the time it needs to run the lengthy spin up. It took 12h to run 10years and with this 160 cores it kept running 10years in 6-7 hours.
./xmlchange NTASKS=160
./xmlchange NTASKS_ESP=1
Code:
./xmlchange RUN_TYPE=startup
./xmlchange STOP_OPTION=nyears,STOP_N=10
Code:
# Edit user_nl_clm file (excluding all variables except for TSOI and TSOI_ICE, and pointing to the finidat path from last AD spin up)
emacs user_nl_clm
soil_layerstruct= '23SL_3.5m_D500'
use_init_interp = .true.
finidat ='/cesm2_1_3_OUT/spinupshortND/spinupshort.clm2.r.0040-01-01-00000.nc'
&clm_inparm
hist_empty_htapes = .true.
hist_fincl1 = 'TSOI', 'TSOI_ICE'
/
./case.build
Code:
./xmlchange RUN_STARTDATE=0040-01-01
./xmlchange RESUBMIT=5
./xmlchange STOP_DATE=1000101
./case.submit
This job started running and kept running till the year 87, failed in the 5th submission (only one resubmit was remaining when it failed). Below is the message I see in the cesm log file:
Code:
NetCDF: Unknown file format
pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 235 :
NetCDF: Unknown file format
Image PC Routine Line Source
cesm.exe 000000000143C756 Unknown Unknown Unknown
cesm.exe 000000000127E231 pio_support_mp_pi 118 pio_support.F90
cesm.exe 000000000127C35D pio_utils_mp_chec 74 pio_utils.F90
cesm.exe 000000000133ADE6 ionf_mod_mp_open_ 235 ionf_mod.F90
cesm.exe 000000000126DBB5 piolib_mod_mp_pio 2831 piolib_mod.F90
cesm.exe 0000000001187117 shr_dmodel_mod_mp 885 shr_dmodel_mod.F90
cesm.exe 00000000011860B9 shr_dmodel_mod_mp 675 shr_dmodel_mod.F90
cesm.exe 000000000121AF8A shr_strdata_mod_m 743 shr_strdata_mod.F90
cesm.exe 00000000004DEBC6 datm_comp_mod_mp_ 664 datm_comp_mod.F90
cesm.exe 00000000004DC69D atm_comp_mct_mp_a 247 atm_comp_mct.F90
cesm.exe 000000000042FFAA component_mod_mp_ 728 component_mod.F90
cesm.exe 0000000000416F4D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 000000000042FC47 MAIN__ 125 cime_driver.F90
cesm.exe 00000000004132CE Unknown Unknown Unknown
libc-2.24.so 00001552166E02E0 __libc_start_main Unknown Unknown
cesm.exe 00000000004131EA Unknown Unknown Unknown
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
there are also these lines in the cesm log file:
Code:
( 159) bc11608.int.ets1.calculquebec.ca
NetCDF: Invalid dimension ID or name
NetCDF: Invalid dimension ID or name
NetCDF: Invalid dimension ID or name
NetCDF: Invalid dimension ID or name
NetCDF: Invalid dimension ID or name
NetCDF: Variable not found
NetCDF: Invalid dimension ID or name
NetCDF: Invalid dimension ID or name
Also this appears in the first line of the cesm log file:
Code:
Invalid PIO rearranger comm max pend req (comp2io), 0
Resetting PIO rearranger comm max pend req (comp2io) to 64
This is odd for me because if there was an issue in the first place why it kept running perfectly fine until the year 87 and then suddenly fails with some NetCDF error!!!
Do you think this is related to the fact that I changed the cores and nodes in the beginning? Should I use the previous 2 nodes (80 cores)? I don't know what happened here.
Moreover, I tried running the same thing with some minor differences on another cluster (4 nodes, 64 cores each = 256 cores) + (10 years for AD spin up + 6 years for ND = I know it is a short one but it was done just to test if without changing the number of nodes and cores the model finishes the ND spin up or what) and it finished the ND without any issue. So the hypothesis I have about the error being related to the doubling of cores becomes a bit more rational, right? or maybe because this second ND short test was a short run it didn't fail and the previous one was 10 times longer and the second one also would have failed if it was a long run?
Thanks for your support.