Univa Grid Engine and scripts_regression_tests.py

m_mineter@ed_ac_uk · Jun 17, 2020

Hello
I'm porting CESM2 to a linux cluster using gnu compilers and openmpi1.10.1. We will be running on a 40 core node, but I hope to test on 16 cores
I'm using the machine name eddie and that seems to be recognised ok when I run the scripts_regression_tests.py

CIMEROOT is set up correctly (from .bashrc)

I've 3 questions please...

1. the batch system is Univa Grid Engine.
I've created .cime/*.xml files, which I attach. I've checked them against the xsd schema.
In running the scripts_regression_test, the new cases say:

"Batch_system_type is univa
ERROR: Did not find univa in valid values for BATCH_SYSTEM: ['nersc_slurm', 'lc_slurm', 'moab', 'pbs', 'lsf', 'slurm', 'cobalt', 'cobalt_theta', 'none']"

Have I configured univa wrongly in the config_batch.xml, or missed something else?
(Using univa Grid Engine is the same as the old Sun Grid Engine)

Attaching .cime/*.xml (with .txt attached )
Also the output to terminal from scripts_regression_tests.py > eddie_login_node_tests.txt 2>&1
And the output from ./describe_version >eddie-version.txt

2. We have login nodes that can run qsub, and worker nodes that can't run qsub (unless they script an ssh back into a login node to do the qsub )
Each job can last up to 2 days, so automated resubmission will be needed for CESM.. Can configuration allow resubmission/continuation jobs to work via the ssh ?!

3. In running scripts_regression_tests.py I find the test_pylint tests fail on the login nodes because it cant create more threads (see eddie_login_node_tests.txt)
Can I easily configure scripts_regression_tests.py so that it does no multithreading and I might be able to run the whole set of tests on the login node, if much more slowly?

Thanks for your attention!
Mike

jedwards · Jun 17, 2020

1. Batch system valid values are defined in cime/src/drivers/mct/cime_config/config_component.xml
in the BATCH_SYSTEM section, you will need to add univa there.

2. You can use ssh to return to the login node and run qsub from there (see the stampede-skx machine for an example)
You can also use the resubmit-immediate option to case.submit to submit all jobs at once and let the queueing system handle it.

3. scripts_regression_tests.py does nothing to control threads in pylint and the default number of threads is 1. Is it possible there is an alias or different default on your system causing it to try to use more threads?

m_mineter@ed_ac_uk · Jun 23, 2020

Thanks Jim. Just to confirm these questions are closed:
1.- I added "univa" as you said, and that resolved it.
2. - not got there yet - but thanks for confirming that these approaches are supported.
3. On our cluster we have login nodes (usually used for doing qsub); interactive nodes and the routine worker nodes. I now use the interactive worker node to run the script_regression_tests.py. This gives processes more memory than does the login node. That resolved it. (The other thing that opened this way to run the script is that In the past we were unable to run qsub from the interactive nodes - that constraint does not now exist)

Univa Grid Engine and scripts_regression_tests.py

m_mineter@ed_ac_uk

Member

Attachments

jedwards

CSEG and Liaisons

m_mineter@ed_ac_uk

Member