Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

B cases hanging during initialization of atm

usha k

Usha K H
New Member
Hello,

I'm running CESM v2.1.5 with B1850 compset. The run keeps hanging near the end of the initialization irrespective of number of nodes. None of the log files show an error message. I have come across similar issues in the forum but none help. I am using intel compiler v 2023.1.0, with impi as my mpi lib and pbs job submission. I have used upto 9 nodes, with max task per node =128 (it is our institute system). As suggested by other discussions I have put export OMP_STACKSIZE=256M, ulimit -c unlimited etc. the pe layout is the default one :
./pelayout
Comp NTASKS NTHRDS ROOTPE
CPL : 1024/ 1; 0
ATM : 1024/ 1; 0
LND : 512/ 1; 0
ICE : 512/ 1; 512
OCN : 128/ 1; 1024
ROF : 512/ 1; 0
GLC : 1024/ 1; 0
WAV : 256/ 1; 0
ESP : 1/ 1; 0

Kindly help. I have attached my run log and machine files (machine name is champ).
 

Attachments

  • atm_log.txt
    2.1 KB · Views: 1
  • cpl_log.txt
    41 KB · Views: 1
  • cesm_log.txt
    11 KB · Views: 2
  • config_compilers.txt
    46.9 KB · Views: 0
  • config_machines.txt
    125.5 KB · Views: 2

jedwards

CSEG and Liaisons
Staff member
There is no indication of a problem here. I notice that you are not setting a batch system (eg pbs or slurm)
Are you sharing the compute tasks with other jobs? Have you had a champ system expert look at the problem with you?
 

usha k

Usha K H
New Member
There is no indication of a problem here. I notice that you are not setting a batch system (eg pbs or slurm)
Are you sharing the compute tasks with other jobs? Have you had a champ system expert look at the problem with you?
Hi, Actually I have set the batch system as pbs in both config_machines and config_batch. I have attached the env_mach_specific file now . The compute tasks are not shared. Also, I am giving maximum number of cores per node available (i.e. 128 cores). I even tried running with debug =true. Since there are no errors champ system people are not able to help at this point.

What am I missing here?
 

Attachments

  • env_mach_specific.txt
    1.3 KB · Views: 2

jedwards

CSEG and Liaisons
Staff member
You should have a PBS log file in your case directory - have you looked for errors there?
 

usha k

Usha K H
New Member
Hi, I could not find any errors. I am sharing the contents of the file named run.B1850_test_impi.o188826.o188826:

Setting resource.RLIMIT_STACK to -1 from (-1, -1)
Generating namelists for /home/ushah/cesm_work/cases/B1850_test_impi
- Prestaging REFCASE (/home/ushah/cesm_work/inputdata/cesm2_init/b.e20.B1850.f09_g17.pi_control.all.299_merge_v3/0134-01-01) to /scratch/ushah/cesm_work/output/B1850_test_impi/B1850_test_impi/run
Copy rpointer /home/ushah/cesm_work/inputdata/cesm2_init/b.e20.B1850.f09_g17.pi_control.all.299_merge_v3/0134-01-01/rpointer.drv
Copy rpointer /home/ushah/cesm_work/inputdata/cesm2_init/b.e20.B1850.f09_g17.pi_control.all.299_merge_v3/0134-01-01/rpointer.atm
Copy rpointer /home/ushah/cesm_work/inputdata/cesm2_init/b.e20.B1850.f09_g17.pi_control.all.299_merge_v3/0134-01-01/rpointer.ice
Copy rpointer /home/ushah/cesm_work/inputdata/cesm2_init/b.e20.B1850.f09_g17.pi_control.all.299_merge_v3/0134-01-01/rpointer.glc
Copy rpointer /home/ushah/cesm_work/inputdata/cesm2_init/b.e20.B1850.f09_g17.pi_control.all.299_merge_v3/0134-01-01/rpointer.lnd
Copy rpointer /home/ushah/cesm_work/inputdata/cesm2_init/b.e20.B1850.f09_g17.pi_control.all.299_merge_v3/0134-01-01/rpointer.ocn.restart
Copy rpointer /home/ushah/cesm_work/inputdata/cesm2_init/b.e20.B1850.f09_g17.pi_control.all.299_merge_v3/0134-01-01/rpointer.ocn.ovf
Copy rpointer /home/ushah/cesm_work/inputdata/cesm2_init/b.e20.B1850.f09_g17.pi_control.all.299_merge_v3/0134-01-01/rpointer.rof
Creating component namelists
Calling /home/ushah/cesm_work/cesm215_intel/components/cam//cime_config/buildnml
CAM namelist copy: file1 /home/ushah/cesm_work/cases/B1850_test_impi/Buildconf/camconf/atm_in file2 /scratch/ushah/cesm_work/output/B1850_test_impi/B1850_test_impi/run/atm_in
Calling /home/ushah/cesm_work/cesm215_intel/components/clm//cime_config/buildnml
Calling /home/ushah/cesm_work/cesm215_intel/components/cice//cime_config/buildnml
Calling /home/ushah/cesm_work/cesm215_intel/components/pop//cime_config/buildnml
Calling /home/ushah/cesm_work/cesm215_intel/components/mosart//cime_config/buildnml
Running /home/ushah/cesm_work/cesm215_intel/components/cism//cime_config/buildnml
Calling /home/ushah/cesm_work/cesm215_intel/components/ww3//cime_config/buildnml
Calling /home/ushah/cesm_work/cesm215_intel/cime/src/components/stub_comps/sesp/cime_config/buildnml
Calling /home/ushah/cesm_work/cesm215_intel/cime/src/drivers/mct/cime_config/buildnml
Finished creating component namelists
-------------------------------------------------------------------------
- Prestage required restarts into /scratch/ushah/cesm_work/output/B1850_test_impi/B1850_test_impi/run
- Case input data directory (DIN_LOC_ROOT) is /home/ushah/cesm_work/inputdata
- Checking for required input datasets in DIN_LOC_ROOT
-------------------------------------------------------------------------
2025-07-03 16:20:31 MODEL EXECUTION BEGINS HERE
run command is /app/compilers/oneapi/2023/mpi/2021.9.0/bin/mpirun -bootstrap=ssh -np 1152 -ppn 128 /scratch/ushah/cesm_work/output/B1850_test_impi/B1850_test_impi/bld/cesm.exe >> cesm.log.$LID 2>&1

there is no other log files in the case directory as far as I know.
 

jedwards

CSEG and Liaisons
Staff member
That is the file that I was referring to. I would try a case (or an mpi hello-world example) on a single node, if that
works try two nodes and so on until you get to the full size of the problem. It looks like you are just using two nodes - sometimes
a memory error will cause the model to fail without a log message - try changing MAX_MPITASKS_PER_NODE to 64 then 32
to reduce the number of tasks per node and thus the memory requirement.
 

usha k

Usha K H
New Member
Hi, I tried giving one node with MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE as 128, the issue persisted. I also tried reducing them to 64 and 32, still same issue. I am sharing output of ./preview_run for max task per node of 64.

CASE INFO:
nodes: 9
total tasks: 576
tasks per node: 64
thread count: 1

BATCH INFO:
FOR JOB: case.run
ENV:
Setting Environment I_MPI_CC=icc
Setting Environment I_MPI_CXX=icpc
Setting Environment I_MPI_FC=ifort
Setting Environment OMP_STACKSIZE=256M
Setting Environment NETCDF_PATH=/apps/netcdf-CandFortran_ic2023
Setting Environment OMP_NUM_THREADS=1

SUBMIT CMD:
qsub -q workq -l walltime=48:00:00 -v ARGS_FOR_SCRIPT='--resubmit' .case.run

MPIRUN (job=case.run):
/app/compilers/oneapi/2023/mpi/2021.9.0/bin/mpirun -bootstrap=ssh -np 576 -ppn 64 /scratch/ushah/cesm_work/output/B1850_test_impi/B1850_test_impi/bld/cesm.exe >> cesm.log.$LID 2>&1

FOR JOB: case.st_archive
ENV:
Setting Environment I_MPI_CC=icc
Setting Environment I_MPI_CXX=icpc
Setting Environment I_MPI_FC=ifort
Setting Environment OMP_STACKSIZE=256M
Setting Environment NETCDF_PATH=/apps/netcdf-CandFortran_ic2023
Setting Environment OMP_NUM_THREADS=1

SUBMIT CMD:
qsub -q workq -l walltime=0:20:00 -W depend=afterok:0 -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
 

jedwards

CSEG and Liaisons
Staff member
Did you try an mpi hello-world on 1 node, then 2 etc before attempting to run cesm?
Then when you try cesm start with a simple compset such as X or A - you are currently
attempting to start at the finish line.
 
Top