Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Is there a way to see the script used to setup a CESM batch run?

djw

David Webb
New Member
This is another message about porting CESM to the NOC Anemone machine. The first test now runs to completion on the login node, but when I try for a batch run, using the main set of compute nodes, it fails in the (initial?) bootstrap section. As recommended by other users of Anemone, I am using the intel message passing system. The only output is in the cesm log file, which contains:

check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on compute002 (pid 38367, exit code 256)
poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1031): error setting up the bootstrap proxies
error setting up the bootstrap proxies
Possible reasons:
1. Host is unavailable. Please check that all hosts are available.
2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
4. slurm bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.

Having searched CESM and Itel forums, I suspect this is a system error.

Unfortunately the many other users of Anemone have no problems of this sort. As far as I can tell they all submit jobs using batch submission files, so their recommendations refer to to various #BATCH lines, environment variables or (and I tried it) replacing mpirun by mpiexec.hydra. For various reasons there are no professional support staff except for the installation of system updates.

So to run CESM, I really have to sort this out for myself. I have tried various changes to parameters in the xml files, i.e asking for verbose output, without success or further output. I have therefor had to hunt through the CIME python files, starting with barch.submit, to see if I could deduce how a job is submitted and what final set of commands is sent to sbatch - but have found the process slow, frustrating and so far without success.

I found the file ".case.run" in the mycase directory. This starts with a few #BATCH lines, but these are followed by more python and, again, calls to a series of routines deep within the Cime heirarchy to do the hard work.

So:
1. Is there a flag that allows me to see the commands finally sent to the batch system - which I can compare with the ones used by other users?
2. If not, is there a Cime python routine in which I can insert print or other statements which can give me the same information?
3. If not, is there a Cime document which explains how the batch submission process works and what it is trying to do at each stage?

Thanks for any help.
 

jedwards

CSEG and Liaisons
Staff member
./preview_run will show you the syntax of the submit command, the mpirun command, and any environment variables that you set.
I think that that should give you the information you need, but you can also add a --debug flag to your case.submit command, that
will produce a lot more output to stdout and will also create a file case.submit.log with even more debug information.
 

djw

David Webb
New Member
Working on the basis of the preview_run command and the contents of .case.run, I tried out different directives and command options in simple 'echo Hello world' test. This showed, for example, that for four tasks (as used by the Basic CESM test), slurm and mpiexec.hydra combination did not like specifying four nodes with a task on each, or two nodes with two tasks on each.

However in the end that was not the problem with running the CESM test. Instead, in the the file configure_batch.xml, I had unfortunately included <directive>--cpus-per-task={{ thread_count }}</directive> from one of the other machines using slurm. When I removed this directive, the batch system worked correctly and the test ran to completion.

So many thanks. D.
 
Top