Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

intermittent runtime hanging

I find that CESM and standalone CAM both hang intermittently when running on multiple procs even on the same node. Is there a race condition? This is a generic_linux_intel port. It always hangs in a similar place in the log file.Here is the tail end of log output from a CAM run (showing the point of hang). This was with OMP_NUM_THREADS 1 and ntasks = 10. It also hung with ntasks = 12. I noticed that in CESM sometimes I could stop it from hanging by changing the ntasks values in env_mach_pes.xml. I'm not sure what other info would be helpful. This is CESM1_0. ...procid 8 assigned 4 latitude values from 32
through 35
procid 9 assigned 4 latitude values from 36
through 39
procid 10 assigned 4 latitude values from 40
through 43
procid 11 assigned 3 latitude values from 44
through 46
gid 1 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
gid 5 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 10 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 9 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
gid 11 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 0 imxy 72 jmxy 3 4 4
gid 3 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 4 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 4 4 4 4
4 4 3 kmyz 30
gid 6 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
gid 7 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
gid 8 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
gid 2 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 kmyz 30
4 4 4 4 4 4
4 4 3 kmyz 30 
 

eaton

CSEG and Liaisons
Have you run with debug flags on?
What are the configure and build-namelist commands you're using?  
 

eaton

CSEG and Liaisons
Have you run with debug flags on?
What are the configure and build-namelist commands you're using?  
 

eaton

CSEG and Liaisons
Have you run with debug flags on?
What are the configure and build-namelist commands you're using?  
 

jedwards

CSEG and Liaisons
Staff member
What you are describing sounds like a race condition, however the origin of that condition is not always easy to determine.   If you find a problem in the model we would be grateful to learn the details however the problem could also be in your Machines mpi library or your compilers openmp threading implementation.    Good Luck.
 

jedwards

CSEG and Liaisons
Staff member
What you are describing sounds like a race condition, however the origin of that condition is not always easy to determine.   If you find a problem in the model we would be grateful to learn the details however the problem could also be in your Machines mpi library or your compilers openmp threading implementation.    Good Luck.
 

jedwards

CSEG and Liaisons
Staff member
What you are describing sounds like a race condition, however the origin of that condition is not always easy to determine.   If you find a problem in the model we would be grateful to learn the details however the problem could also be in your Machines mpi library or your compilers openmp threading implementation.    Good Luck.
 
Top