Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Mpirun and Job Submission Errors

Will Smith

Will Smith
New Member
Hi there,

I am currently encountering issues while running a case with CESM2.3.alpha17. The problems involve an mpirun error, along with the messages "Submitted job case.run with id None" and "Submitted job case.st_archive with id None". I have attached the cesm.log file, other relevant case files, and a screenshot of the MPI library settings for your review.

Here are the settings I used for the case:
./xmlchange NTASKS=2
./xmlchange STOP_OPTION=ndays
./xmlchange STOP_N=1
./xmlchange RESUBMIT=0
./xmlchange NTHRDS=1
./xmlchange JOB_WALLCLOCK_TIME=96:00:00
./xmlchange CREATE_ESMF_PET_FILES=TRUE
./case.setup
./preview_namelists
echo "empty_htapes = .true." >> user_nl_cam
echo "hist_empty_htapes = .true." >> user_nl_clm
echo "rtmhist_nhtfrq = -876000" >> user_nl_mosart
echo "history_frequency = 100" >> user_nl_cism
./preview_namelists
./case.setup --reset
./case.build --skip-provenance-check
./preview_run
./case.submit

The attached screenshot shows the settings related to the MPI library, which I suspect might be linked to the issues I'm experiencing. Could you please help me identify where things might be going wrong?

Thank you for your assistance.

Best,
Will
 

Attachments

  • 2024-05-01 10.08.56.png
    2024-05-01 10.08.56.png
    80.2 KB · Views: 12
  • test7.6.1.sh.e4924383.zip
    480 bytes · Views: 1
  • test7.6.1.sh.o4924383.zip
    3.1 KB · Views: 1
  • cesm.log.240430-104502.zip
    6.1 KB · Views: 8

Will Smith

Will Smith
New Member
For reference,

Many thanks
 

Attachments

  • test7.6.1.sh.o4924383.txt
    20.1 KB · Views: 1
  • test7.6.1.sh.e4924383.txt
    513 bytes · Views: 2
  • cesm.log.240430-104502.txt
    21.5 KB · Views: 4

jedwards

CSEG and Liaisons
Staff member
According to the cesm log you are failing due to an incompatible netcdf file. I can't tell from that log which file you are having the problem with,
that's probably in one of the component logs. The message is

bort with message NetCDF: NC_UNLIMITED size already in use in file pio_nc.c at line 2107
Abort with message NetCDF: NC_UNLIMITED size already in use in file pio_nc.c at line 2107
 

Will Smith

Will Smith
New Member
According to the cesm log you are failing due to an incompatible netcdf file. I can't tell from that log which file you are having the problem with,
that's probably in one of the component logs. The message is

bort with message NetCDF: NC_UNLIMITED size already in use in file pio_nc.c at line 2107
Abort with message NetCDF: NC_UNLIMITED size already in use in file pio_nc.c at line 2107
Hi Jedwards,

Thanks for your info. I'v checked the log files of the other components, but I don't seem to find issues regarding netcdf. Could you please provide a more detailed explanation? The attached are the log files of the other components and a screenshot of netcdf version for your review.

Best,
Will
 

Attachments

  • 2024-05-01 16.37.50.png
    2024-05-01 16.37.50.png
    26 KB · Views: 10
  • atm.log.240430-104502.txt
    24.4 KB · Views: 1
  • glc.log.240430-104502.txt
    16.1 KB · Views: 1
  • lnd.log.240430-104502.txt
    91.2 KB · Views: 1
  • med.log.240430-104502.txt
    60.8 KB · Views: 1
  • rof.log.240430-104502.txt
    6.6 KB · Views: 1

jedwards

CSEG and Liaisons
Staff member
It looks like the problem may be with:
/mnt/iusers01/fatpou01/sees01/s29826zs/scratch/Projects/inputdata/atm/datm7/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/Solar/clmforc.GSWP3.c2011.0.5x0.5.Solr.2000-01.nc

Please confirm the md5sum of that file is 5ee6f7fe2c4b8110a9d44a9beacc48b4

Why do you have ./xmlchange CREATE_ESMF_PET_FILES=TRUE? Are there any errors in the PET files?
 

Will Smith

Will Smith
New Member
It looks like the problem may be with:
/mnt/iusers01/fatpou01/sees01/s29826zs/scratch/Projects/inputdata/atm/datm7/atm_forcing.datm7.GSWP3.0.5d.v1.c170516/Solar/clmforc.GSWP3.c2011.0.5x0.5.Solr.2000-01.nc

Please confirm the md5sum of that file is 5ee6f7fe2c4b8110a9d44a9beacc48b4

Why do you have ./xmlchange CREATE_ESMF_PET_FILES=TRUE? Are there any errors in the PET files?
Hi,

I confirm he md5sum of that file is 5ee6f7fe2c4b8110a9d44a9beacc48b4. As for ./xmlchange CREATE_ESMF_PET_FILES=TRUE, just for the stable operation of the model.
 

Attachments

  • 2024-05-01 17.19.17.png
    2024-05-01 17.19.17.png
    11.4 KB · Views: 9

jedwards

CSEG and Liaisons
Staff member
You didn't answer about errors in the PET files - I recommend you leave that value set as FALSE.
 

Will Smith

Will Smith
New Member
You didn't answer about errors in the PET files - I recommend you leave that value set as FALSE.
Hi,
I'm sorry for the oversight earlier. I've been unable to find any information on PET files. I've set CREATE_ESMF_PET_FILES to FALSE, but I'm still facing issues with mpirun failing. I've attached the updated cesm.log for your reference. Additionally, I tested the MPI by running a basic MPI parallel programme, and I've attached a screenshot of the results. Could you offer any advice on how to resolve this issue?

Best,
Will
 

Attachments

  • cesm.log.240501-174431.txt
    21.5 KB · Views: 3
  • 2024-05-01 19.03.06.png
    2024-05-01 19.03.06.png
    82.7 KB · Views: 10

jedwards

CSEG and Liaisons
Staff member
Again - it is not an mpirun issue. It is an issue with an input or output file, I can't tell from the logs you are providing what file is causing the problem. Perhaps you should try running in debug mode:'
./xmlchange DEBUG=TRUE
./case.build --clean-all
./case.build
./case.submit
 

Will Smith

Will Smith
New Member
Again - it is not an mpirun issue. It is an issue with an input or output file, I can't tell from the logs you are providing what file is causing the problem. Perhaps you should try running in debug mode:'
./xmlchange DEBUG=TRUE
./case.build --clean-all
./case.build
./case.submit
Hi Jedwards,

I followed your advice to run the process in debug mode, but I encountered a problem during the case build phase, the case build failed. Attached is the log file that were generated for your reference. Could you please suggest any adjustments that might help resolve this issue?

Thank you for your continued support.
 

Attachments

  • CDEPS.bldlog.240502-141728.txt
    136.3 KB · Views: 1

yongsheng zheng

Yongsheng zheng
Member
[zys@zys mycase]$ tail -n 14 /home/zys/cesm_files/output/mycase/run/cesm.log.240520-183341
cesm.exe 0000000000421CE2 Unknown Unknown Unknown
libc-2.17.so 00007F0F37344555 __libc_start_main Unknown Unknown
cesm.exe 0000000000421BE9 Unknown Unknown Unknown

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 32820 RUNNING AT zys
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

May I ask what is the cause of the problem with this operation?
 

yongsheng zheng

Yongsheng zheng
Member
You may only ask after you have followed the instructions for reporting a problem in the forum.
In CESms 2.1.3, I ran./case.submit and this last problem occurred. I was not sure if the cpu was insufficient or what?


[zys@zys ~]$ cd /home/zys/model/cesm/cesm_2.1.3/cime/scripts/cases/compset_QPC6
[zys@zys compset_QPC6]$ ./xmlchange NTASKS=56
[zys@zys compset_QPC6]$ ./case.submit
File /home/zys/model/cesm/cesm_2.1.3/cime/scripts/cases/compset_QPC6/LockedFiles/env_mach_pes.xml has been modified
found difference in NTASKS : case 56 locked 28
ERROR: Invoke case.setup --reset
[zys@zys compset_QPC6]$ ./xmlchange NTASKS=56
[zys@zys compset_QPC6]$ ./case.setup --reset
Successfully cleaned batch script .case.run
Creating batch scripts
Writing case.run script from input template /home/zys/model/cesm/cesm_2.1.3/cime/config/cesm/machines/template.case.run
Creating file .case.run
Writing case.st_archive script from input template /home/zys/model/cesm/cesm_2.1.3/cime/config/cesm/machines/template.st_archive
Creating file case.st_archive
If an old case build already exists, might want to run 'case.build --clean' before building
You can now run './preview_run' to get more info on how your case will be run
[zys@zys compset_QPC6]$ ./case.build --skip-provenance-check
Building case in directory /home/zys/model/cesm/cesm_2.1.3/cime/scripts/cases/compset_QPC6
sharedlib_only is False
model_only is False
Generating component namelists as part of build
Creating component namelists
Calling /home/zys/model/cesm/cesm_2.1.3/components/cam//cime_config/buildnml
...calling cam buildcpp to set build time options
CAM namelist copy: file1 /home/zys/model/cesm/cesm_2.1.3/cime/scripts/cases/compset_QPC6/Buildconf/camconf/atm_in file2 /home/zys/cesm_files/output/compset_QPC6/run/atm_in
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/slnd/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/sice/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/data_comps/docn/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/srof/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/sglc/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/swav/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/sesp/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/drivers/mct/cime_config/buildnml
Finished creating component namelists
Building gptl with output to file /home/zys/cesm_files/output/compset_QPC6/bld/gptl.bldlog.240520-215038
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/build_scripts/buildlib.gptl
Building mct with output to file /home/zys/cesm_files/output/compset_QPC6/bld/mct.bldlog.240520-215038
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/build_scripts/buildlib.mct
Building pio with output to file /home/zys/cesm_files/output/compset_QPC6/bld/pio.bldlog.240520-215038
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/build_scripts/buildlib.pio
Building csm_share with output to file /home/zys/cesm_files/output/compset_QPC6/bld/csm_share.bldlog.240520-215038
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/build_scripts/buildlib.csm_share
Building atm with output to /home/zys/cesm_files/output/compset_QPC6/bld/atm.bldlog.240520-215038
Building lnd with output to /home/zys/cesm_files/output/compset_QPC6/bld/lnd.bldlog.240520-215038
Building ice with output to /home/zys/cesm_files/output/compset_QPC6/bld/ice.bldlog.240520-215038
Building ocn with output to /home/zys/cesm_files/output/compset_QPC6/bld/ocn.bldlog.240520-215038
Building rof with output to /home/zys/cesm_files/output/compset_QPC6/bld/rof.bldlog.240520-215038
Building glc with output to /home/zys/cesm_files/output/compset_QPC6/bld/glc.bldlog.240520-215038
Building wav with output to /home/zys/cesm_files/output/compset_QPC6/bld/wav.bldlog.240520-215038
Building esp with output to /home/zys/cesm_files/output/compset_QPC6/bld/esp.bldlog.240520-215038
cam built in 1.841096 seconds
sglc built in 2.662916 seconds
sesp built in 2.649319 seconds
slnd built in 2.695173 seconds
swav built in 2.702382 seconds
sice built in 2.754334 seconds
srof built in 2.775750 seconds
docn built in 2.818056 seconds
Building cesm with output to /home/zys/cesm_files/output/compset_QPC6/bld/cesm.bldlog.240520-215038
Time spent not building: 0.691438 sec
Time spent building: 7.701663 sec
MODEL BUILD HAS FINISHED SUCCESSFULLY
[zys@zys compset_QPC6]$ ./case.submit
Creating component namelists
Calling /home/zys/model/cesm/cesm_2.1.3/components/cam//cime_config/buildnml
CAM namelist copy: file1 /home/zys/model/cesm/cesm_2.1.3/cime/scripts/cases/compset_QPC6/Buildconf/camconf/atm_in file2 /home/zys/cesm_files/output/compset_QPC6/run/atm_in
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/slnd/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/sice/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/data_comps/docn/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/srof/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/sglc/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/swav/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/sesp/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/drivers/mct/cime_config/buildnml
Finished creating component namelists
Checking that inputdata is available as part of case submission
Loading input file list: 'Buildconf/cam.input_data_list'
Loading input file list: 'Buildconf/docn.input_data_list'
Loading input file list: 'Buildconf/cpl.input_data_list'
Check case OK
submit_jobs case.run
Submit job case.run
Starting job script case.run
Generating namelists for /home/zys/model/cesm/cesm_2.1.3/cime/scripts/cases/compset_QPC6
Creating component namelists
Calling /home/zys/model/cesm/cesm_2.1.3/components/cam//cime_config/buildnml
CAM namelist copy: file1 /home/zys/model/cesm/cesm_2.1.3/cime/scripts/cases/compset_QPC6/Buildconf/camconf/atm_in file2 /home/zys/cesm_files/output/compset_QPC6/run/atm_in
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/slnd/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/sice/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/data_comps/docn/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/srof/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/sglc/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/swav/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/components/stub_comps/sesp/cime_config/buildnml
Calling /home/zys/model/cesm/cesm_2.1.3/cime/src/drivers/mct/cime_config/buildnml
Finished creating component namelists
-------------------------------------------------------------------------
- Prestage required restarts into /home/zys/cesm_files/output/compset_QPC6/run
- Case input data directory (DIN_LOC_ROOT) is /home/zys/model/cesm/inputdata
- Checking for required input datasets in DIN_LOC_ROOT
-------------------------------------------------------------------------
2024-05-20 21:50:53 MODEL EXECUTION BEGINS HERE
run command is mpirun -n 56 /home/zys/cesm_files/output/compset_QPC6/bld/cesm.exe >> cesm.log.$LID 2>&1
ERROR: RUN FAIL: Command 'mpirun -n 56 /home/zys/cesm_files/output/compset_QPC6/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /home/zys/cesm_files/output/compset_QPC6/run/cesm.log.240520-215052.
 

jedwards

CSEG and Liaisons
Staff member
You have still not followed the reporting instructions or read the output provided by the model.
See log file for details: /home/zys/cesm_files/output/compset_QPC6/run/cesm.log.240520-215052

Also note that the latest cesm2.1.x release is 2.1.5 and that the first thing you should try is to update to that latest release and see if your issue has already been addressed.
 
Top