PIO error in spin up ND

wvsi3w · Sep 15, 2024

Hello
I have an issue regarding PIO which I think is similar to these two threads that are unanswered (CESM2.2 ; ERROR: ionf_mod.F90 + errors occur when the cesm2.1.3 start running)

I am using CESM 2.1.3 (CLM5) and I have modified the soil layer structure of the land model (discussed here: Lower boundary conditions (differences between CLM4.5 and CLM5))

These are the steps I took:
- I did AD spin up based on the documentation and it did run perfectly as I expected (I only tested it for 40 years and not the 200y suggested in the documentation: 1.5.5. Spinup of CLM5.0-BGC-Crop — ctsm release-clm5.0 documentation).
- Then I tried the ND spin up test (this time also for a shorter time frame than suggested) and it failed.

Code:

./create_newcase --case $CASES2_1_3/spinupshortND --compset IHistClm50BgcCrop --res f19_g17 --machine beluga --compiler intel --mpilib intelmpi --walltime 14:00:00 --run-unsupported

Code:

# copy pointer and restart file from AD spin up you did before to the current case (ND spin up)
cp /cesm2_1_3_OUT/spinupshort/run/spinupshort.clm2.r* .
cp /cesm2_1_3_OUT/spinupshort/run/rpointer.* .

./case.setup

# doubling the CPUs (instead of 40 cores per 2 nodes it is using 40 cores per 4 nodes) to reduce the time it needs to run the lengthy spin up. It took 12h to run 10years and with this 160 cores it kept running 10years in 6-7 hours.
./xmlchange NTASKS=160
./xmlchange NTASKS_ESP=1

Code:

./xmlchange RUN_TYPE=startup
./xmlchange STOP_OPTION=nyears,STOP_N=10

Code:

# Edit user_nl_clm file (excluding all variables except for TSOI and TSOI_ICE, and pointing to the finidat path from last AD spin up)
emacs user_nl_clm
soil_layerstruct= '23SL_3.5m_D500'
use_init_interp = .true.

finidat ='/cesm2_1_3_OUT/spinupshortND/spinupshort.clm2.r.0040-01-01-00000.nc'

&clm_inparm
    hist_empty_htapes = .true.
    hist_fincl1 = 'TSOI', 'TSOI_ICE'
/

./case.build

Code:

./xmlchange RUN_STARTDATE=0040-01-01
./xmlchange RESUBMIT=5
./xmlchange STOP_DATE=1000101

./case.submit

This job started running and kept running till the year 87, failed in the 5th submission (only one resubmit was remaining when it failed). Below is the message I see in the cesm log file:

Code:

 NetCDF: Unknown file format
 pio_support::pio_die:: myrank=          -1 : ERROR: ionf_mod.F90:         235 :
  NetCDF: Unknown file format
Image              PC                Routine            Line        Source
cesm.exe           000000000143C756  Unknown               Unknown  Unknown
cesm.exe           000000000127E231  pio_support_mp_pi         118  pio_support.F90
cesm.exe           000000000127C35D  pio_utils_mp_chec          74  pio_utils.F90
cesm.exe           000000000133ADE6  ionf_mod_mp_open_         235  ionf_mod.F90
cesm.exe           000000000126DBB5  piolib_mod_mp_pio        2831  piolib_mod.F90
cesm.exe           0000000001187117  shr_dmodel_mod_mp         885  shr_dmodel_mod.F90
cesm.exe           00000000011860B9  shr_dmodel_mod_mp         675  shr_dmodel_mod.F90
cesm.exe           000000000121AF8A  shr_strdata_mod_m         743  shr_strdata_mod.F90
cesm.exe           00000000004DEBC6  datm_comp_mod_mp_         664  datm_comp_mod.F90
cesm.exe           00000000004DC69D  atm_comp_mct_mp_a         247  atm_comp_mct.F90
cesm.exe           000000000042FFAA  component_mod_mp_         728  component_mod.F90
cesm.exe           0000000000416F4D  cime_comp_mod_mp_        3465  cime_comp_mod.F90
cesm.exe           000000000042FC47  MAIN__                    125  cime_driver.F90
cesm.exe           00000000004132CE  Unknown               Unknown  Unknown
libc-2.24.so       00001552166E02E0  __libc_start_main     Unknown  Unknown
cesm.exe           00000000004131EA  Unknown               Unknown  Unknown
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

there are also these lines in the cesm log file:

Code:

(  159)  bc11608.int.ets1.calculquebec.ca
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Variable not found
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name

Also this appears in the first line of the cesm log file:

Code:

 Invalid PIO rearranger comm max pend req (comp2io),            0
 Resetting PIO rearranger comm max pend req (comp2io) to           64

This is odd for me because if there was an issue in the first place why it kept running perfectly fine until the year 87 and then suddenly fails with some NetCDF error!!!
Do you think this is related to the fact that I changed the cores and nodes in the beginning? Should I use the previous 2 nodes (80 cores)? I don't know what happened here.

Moreover, I tried running the same thing with some minor differences on another cluster (4 nodes, 64 cores each = 256 cores) + (10 years for AD spin up + 6 years for ND = I know it is a short one but it was done just to test if without changing the number of nodes and cores the model finishes the ND spin up or what) and it finished the ND without any issue. So the hypothesis I have about the error being related to the doubling of cores becomes a bit more rational, right? or maybe because this second ND short test was a short run it didn't fail and the previous one was 10 times longer and the second one also would have failed if it was a long run?

Thanks for your support.

slevis · Sep 16, 2024

Please remind me whether these runs use custom datm inputs rather than the default ones that come with the model. If so, tell me whether the ND simulation fails while reading a datm input file that has not been read before by any of the other simulations that worked. If so, then I would suspect a problem with the particular datm input file.

If my hypothesis is correct, then your AD simulation would have also failed when it got to that year.

If my hypothesis is wrong, then your hypothesis may be correct.

wvsi3w · Sep 16, 2024

slevis said:
Please remind me whether these runs use custom datm inputs rather than the default ones that come with the model. If so, tell me whether the ND simulation fails while reading a datm input file that has not been read before by any of the other simulations that worked. If so, then I would suspect a problem with the particular datm input file.

If my hypothesis is correct, then your AD simulation would have also failed when it got to that year.

If my hypothesis is wrong, then your hypothesis may be correct.

Dear Samuel Levis,
Thanks a lot for your response,
I only used the default inputs, nothing is changed except for the soil layer structure that I mentioned, as I needed to get familiar with the running process of AD/ND spin-up before doing my real simulation.
The AD finished without an issue.

I have another thing in mind, do you think it would probably be related to the fact that I changed the number of cores (160) for the ND job, and AD was done with default cores (80)? Is there a restriction on this change of cores? I mean it makes sense, the AD one was done with 80 cores (2 nodes) then I changed it to 160 (4 nodes) in ND and it might created some issues (?)... silly me.

Another thing is that the cluster it was running on had some issues recently (got slow and minor system issues) and that could have also created this error (?)

slevis · Sep 18, 2024

These are possible reasons.

wvsi3w · Sep 28, 2024

slevis said:
These are possible reasons.

Dear Sam,
I have a question regarding the AD spin up
I have noticed that there would be a temporary shut down in the system (cluster) that I am using for a couple of days and I was wondering if I should wait until this shut down is finished or I can run the AD spin up and continue from the latest restart file from this AD spin up after the system is back online?

The case is running now and I see a restart file for the clm but I was thinking if this is a good idea to start from the last restart file of the AD spin up in these situations? this wouldn't create any significant change for the spin up process (I mean having a shut down in between the AD and starting from the last restart file of that AD spin up)?

Moreover, in case I could do this, what would the process be? should I turn on some RUN_TYPE=startup and add the finidat path and pointers from the interrupted AD spin up???

Thanks in advance.

slevis · Sep 30, 2024

Yes, restart files give you the option of continuing simulations as if they never stopped. To do this you just need to change "CONTINUE" from false to true in env_run.xml.

wvsi3w · Sep 30, 2024

slevis said:
Yes, restart files give you the option of continuing simulations as if they never stopped. To do this you just need to change "CONTINUE" from false to true in env_run.xml.

Thank you Sam for your response,
I did change the CONTINUE_RUN to TRUE and did ./case.build and ./case.submit but it failed:

Code:

Input/output error: '/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.7.4/lib/python3.7/textwrap.py'
Can't locate Config/SetupTools.pm in @INC (you may need to install the Config::SetupTools module) (@INC contains: /utils/perl5lib /cvmfs/soft.computecanada.ca/nix
/var/nix/profiles/16.09/lib/perl5/site_perl /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/lib/perl5 /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16
.09/lib/perl5 /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/lib/perl5/site_perl /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/perl/5.22.4/li
b/perl5/site_perl/5.22.4/x86_64-linux-thread-multi /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/perl/5.22.4/lib/perl5/site_perl/5.22.4 /cvmfs/soft.co
mputecanada.ca/easybuild/software/2017/Core/perl/5.22.4/lib/perl5/5.22.4/x86_64-linux-thread-multi /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/perl/
5.22.4/lib/perl5/5.22.4 .) at /lustre03/project/6001010/my_cesm_sandbox/components/cism//cime_config/buildnml line 22.

I actually did load the modules before I changed the CONTINUE_RUN. These are the modules I load every time:
"StdEnv/2018.3 perl/5.22.4 python/3.7.4 cmake/3.16.3 intelmpi/2018.3.222 hdf5-mpi/1.10.3 netcdf-mpi/4.4.1.1 netcdf-fortran-mpi/4.4.4"

I don't know what went wrong here because the error seems to be related to the python module which is odd, since I always run the model successfully with those modules and it failed this time with CONTINUE_RUN=TRUE. Maybe I did something wrong in the process?

slevis · Sep 30, 2024

I do not know what went wrong in your process. Here is the process that works on our computers. I recommend that you try it by starting over, in case something got messed up. I think you have run simulations successfully with this process:
1) ./create_newcase ...
2) ./case.setup
3) ./case_build
4) ./case_submit

Now, say your simulation stopped and some time before stopping (could be immediately before stopping) it wrote a restart file. If so, then I would change CONTINUE_RUN to true and
5) ./case_submit

wvsi3w · Oct 1, 2024

slevis said:
I do not know what went wrong in your process. Here is the process that works on our computers. I recommend that you try it by starting over, in case something got messed up. I think you have run simulations successfully with this process:
1) ./create_newcase ...
2) ./case.setup
3) ./case_build
4) ./case_submit

Now, say your simulation stopped and some time before stopping (could be immediately before stopping) it wrote a restart file. If so, then I would change CONTINUE_RUN to true and
5) ./case_submit

Thanks a lot for your message.
I figured what was wrong with my system as it gets updating modules every time, below are the modules I changed to get it running without that error:

Code:

Currently Loaded Modules:
  1) CCconfig          3) gcccore/.9.3.0  (H)      5) intel/2020.1.217 (t)   7) libfabric/1.10.1       9) StdEnv/2020 (S)  11) cmake/3.27.7 (t)
  2) gentoo/2020 (S)   4) imkl/2020.1.217 (math)   6) ucx/1.8.0              8) openmpi/4.0.3    (m)  10) mii/1.1.2

  Where:
   S:     Module is Sticky, requires --force to unload or purge
   m:     MPI implementations / Implémentations MPI
   math:  Mathematical libraries / Bibliothèques mathématiques
   t:     Tools for development / Outils de développement
   H:                Hidden Module

Inactive Modules:
  1) perl/5.36.1   2) python/3.12.4   3) intelmpi/2021.9.0   4) netcdf-mpi/4.9.2   5) hdf5-mpi/1.14.2   6) netcdf-fortran-mpi/4.6.1

Also, thanks for the steps you mentioned, before this I thought I should do ./case.build then ./case.submit whenever I change anything in the case. good to know that.

The case is running now from the last restart file without an issue.

PIO error in spin up ND

wvsi3w

wvsi3w

Member

Attachments

slevis

Moderator

wvsi3w

wvsi3w

Member

slevis

Moderator

wvsi3w

wvsi3w

Member

slevis

Moderator

wvsi3w

wvsi3w

Member

slevis

Moderator

wvsi3w

wvsi3w

Member