This is another message about porting CESM to the NOC Anemone machine. The first test now runs to completion on the login node, but when I try for a batch run, using the main set of compute nodes, it fails in the (initial?) bootstrap section. As recommended by other users of Anemone, I am using the intel message passing system. The only output is in the cesm log file, which contains:
check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on compute002 (pid 38367, exit code 256)
poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1031): error setting up the bootstrap proxies
error setting up the bootstrap proxies
Possible reasons:
1. Host is unavailable. Please check that all hosts are available.
2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
4. slurm bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
Having searched CESM and Itel forums, I suspect this is a system error.
Unfortunately the many other users of Anemone have no problems of this sort. As far as I can tell they all submit jobs using batch submission files, so their recommendations refer to to various #BATCH lines, environment variables or (and I tried it) replacing mpirun by mpiexec.hydra. For various reasons there are no professional support staff except for the installation of system updates.
So to run CESM, I really have to sort this out for myself. I have tried various changes to parameters in the xml files, i.e asking for verbose output, without success or further output. I have therefor had to hunt through the CIME python files, starting with barch.submit, to see if I could deduce how a job is submitted and what final set of commands is sent to sbatch - but have found the process slow, frustrating and so far without success.
I found the file ".case.run" in the mycase directory. This starts with a few #BATCH lines, but these are followed by more python and, again, calls to a series of routines deep within the Cime heirarchy to do the hard work.
So:
1. Is there a flag that allows me to see the commands finally sent to the batch system - which I can compare with the ones used by other users?
2. If not, is there a Cime python routine in which I can insert print or other statements which can give me the same information?
3. If not, is there a Cime document which explains how the batch submission process works and what it is trying to do at each stage?
Thanks for any help.
check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on compute002 (pid 38367, exit code 256)
poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1031): error setting up the bootstrap proxies
error setting up the bootstrap proxies
Possible reasons:
1. Host is unavailable. Please check that all hosts are available.
2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
4. slurm bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
Having searched CESM and Itel forums, I suspect this is a system error.
Unfortunately the many other users of Anemone have no problems of this sort. As far as I can tell they all submit jobs using batch submission files, so their recommendations refer to to various #BATCH lines, environment variables or (and I tried it) replacing mpirun by mpiexec.hydra. For various reasons there are no professional support staff except for the installation of system updates.
So to run CESM, I really have to sort this out for myself. I have tried various changes to parameters in the xml files, i.e asking for verbose output, without success or further output. I have therefor had to hunt through the CIME python files, starting with barch.submit, to see if I could deduce how a job is submitted and what final set of commands is sent to sbatch - but have found the process slow, frustrating and so far without success.
I found the file ".case.run" in the mycase directory. This starts with a few #BATCH lines, but these are followed by more python and, again, calls to a series of routines deep within the Cime heirarchy to do the hard work.
So:
1. Is there a flag that allows me to see the commands finally sent to the batch system - which I can compare with the ones used by other users?
2. If not, is there a Cime python routine in which I can insert print or other statements which can give me the same information?
3. If not, is there a Cime document which explains how the batch submission process works and what it is trying to do at each stage?
Thanks for any help.