peter_m_kalmus@jpl_nasa_gov
New Member
I find that CESM and standalone CAM both hang intermittently when running on multiple procs even on the same node. Is there a race condition? This is a generic_linux_intel port. It always hangs in a similar place in the log file.Here is the tail end of log output from a CAM run (showing the point of hang). This was with OMP_NUM_THREADS 1 and ntasks = 10. It also hung with ntasks = 12. I noticed that in CESM sometimes I could stop it from hanging by changing the ntasks values in env_mach_pes.xml. I'm not sure what other info would be helpful. This is CESM1_0. ...procid 8 assigned 4 latitude values from 32
through 35
procid 9 assigned 4 latitude values from 36
through 39
procid 10 assigned 4 latitude values from 40
through 43
procid 11 assigned 3 latitude values from 44
through 46
gid 1 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
gid 5 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 10 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 9 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
gid 11 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 0 imxy 72 jmxy 3 4 4
gid 3 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 4 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 4 4 4 4
4 4 3 kmyz 30
gid 6 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
gid 7 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
gid 8 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
gid 2 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 kmyz 30
4 4 4 4 4 4
4 4 3 kmyz 30
through 35
procid 9 assigned 4 latitude values from 36
through 39
procid 10 assigned 4 latitude values from 40
through 43
procid 11 assigned 3 latitude values from 44
through 46
gid 1 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
gid 5 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 10 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 9 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
gid 11 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 0 imxy 72 jmxy 3 4 4
gid 3 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
gid 4 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 4 4 4 4
4 4 3 kmyz 30
gid 6 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
gid 7 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
gid 8 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
gid 2 imxy 72 jmxy 3 4 4
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 4 4 4 4
4 4 3 jmyz 3 4 4
4 4 3 jmyz 3 4 4
4 4 4 4 4 4
4 4 3 kmyz 30
4 4 3 kmyz 30
4 4 4 4 4 4
4 4 3 kmyz 30