Issue installing on Centos 8 with slurm and lmod

william.wilson · Feb 23, 2021

Our partition name is core which is our default partition. Also, I agree that sbatch should probably be used but if you do need ssh then you would need to set up an ssh keypair for monsoon which goes in .ssh/authorized_keys in your home directory. The following link has information on generating keypairs.

What is ssh-keygen & How to Use It to Generate a New SSH Key?

Ssh-keygen is a tool for creating new authentication key pairs for SSH. Such key pairs are used for automating logins, single sign-on..

www.ssh.com

william.wilson · Mar 5, 2021

In a document we found at ucar.edu it mentions creating a $HOME/.cime directory and putting your customized files there such as config_batch.xml. I put that one in place to change some batch settings and it does not appear to read the xml file. Is this still supported in cesm2.2.0?

https://www.cesm.ucar.edu/events/tutorials/2017/porting-edwards.pdf

jedwards · Mar 5, 2021

It is still supported, what makes you believe that the file is not being read?

william.wilson · Mar 5, 2021

The changes I made in .cime/config_batch.xml were not reflected in env_batch.xml when I created a new case.

erik · Mar 5, 2021

Erik, can you think of any other work around or a config file that maybe hasn't been setup correctly that would cause slurm's "out-of-memory handler" to kill jobs and keep slurm from resubmitting?

Hmmm. Well I would look in the cpl.log file so you see how much memory is being used and make sure you aren't using too much for your specific machine. I'd also talk to your system administrators for your machine to ask about what could cause this error in slurm. You may also need to tell slurm how much memory you need for your case, the default that slurm gives you may be too small. Another question is if you can submit to the queue from a compute node? Sometimes you have to do special things to get that to work (certain modules that need to be added, or adding additional paths for batch executables). But, talk to your sys admins about that and what you need to do for that.

erik · Mar 5, 2021

william.wilson said:
The changes I made in .cime/config_batch.xml were not reflected in env_batch.xml when I created a new case.

The changes in the $HOME/.cime directory only get invoked when you are using the specific machine listed in files in that directory. And only for the user with that under their home directory. If there's a typo somewhere (in the config files, or in your use of it in create_newcase) you won't get it to work. I'd suggest trying to add a new machine there where you do something really simple that is obvious. Or perhaps try it on your laptop. That's the case where I've used this before. Try the simplest thing first and get that to work, and then hopefully you can move onto the more complicated cases. We know this particular machine has some complexities that we don't have on other machines. So trying it in the simpler case can help inform it.

The other thing I wonder about are the details of the batch setup for this machine. And that's something that you'll need help from your sys admins. Does it have login nodes that you can use interactively, or does everything have to be submitted to compute nodes? And are there differences between the login and compute nodes that are causing trouble? I'm thinking that some of that may be coming to play here if compute nodes can't see your $HOME/.cime directory for example.

william.wilson · Mar 5, 2021

erik said:
The changes in the $HOME/.cime directory only get invoked when you are using the specific machine listed in files in that directory. And only for the user with that under their home directory. If there's a typo somewhere (in the config files, or in your use of it in create_newcase) you won't get it to work. I'd suggest trying to add a new machine there where you do something really simple that is obvious. Or perhaps try it on your laptop. That's the case where I've used this before. Try the simplest thing first and get that to work, and then hopefully you can move onto the more complicated cases. We know this particular machine has some complexities that we don't have on other machines. So trying it in the simpler case can help inform it.

The other thing I wonder about are the details of the batch setup for this machine. And that's something that you'll need help from your sys admins. Does it have login nodes that you can use interactively, or does everything have to be submitted to compute nodes? And are there differences between the login and compute nodes that are causing trouble? I'm thinking that some of that may be coming to play here if compute nodes can't see your $HOME/.cime directory for example.

I am the system admin. I took a copy of a working env_batch.xml file and copied it to ~/.cime/config_batch.xml and am running it as me with our specific machine setup. Jon Wells and I are working on getting cesm going. This is my first time trying to install get cesm set up. We are on a cluster that uses slurm for its batch system. Jobs to seem to be firing off but we are having issues.

jedwards · Mar 5, 2021

The files in $HOME/.cime/ are appended to the default files from the source tree, You should not repeat any sections that are in the default file - doing so should result in an error. I haven't tried copying env_batch.xml from a case to $HOME/.cime/config_batch.xml but I'm surprised that tiat did not caues an error. Do you have xmllint in your path?

william.wilson · Mar 5, 2021

The document I was working from did not imply the files being appended. So we will not go down that route for what we want to do. And yes, xmllint is in our path. We are on centos 8.3. Mainly what we need to get past is what Jon Wells is asking about now.

jonwells04 · Mar 5, 2021

erik said:
Hmmm. Well I would look in the cpl.log file so you see how much memory is being used and make sure you aren't using too much for your specific machine. I'd also talk to your system administrators for your machine to ask about what could cause this error in slurm. You may also need to tell slurm how much memory you need for your case, the default that slurm gives you may be too small. Another question is if you can submit to the queue from a compute node? Sometimes you have to do special things to get that to work (certain modules that need to be added, or adding additional paths for batch executables). But, talk to your sys admins about that and what you need to do for that.

Hi Erik and J, thank you again for all the help!

We updated the --mem tag to --mem=32000 in the config_batch.xml file for our machine and have gotten past the memory errors, at least on simpler example cases (I1850Clm50Sp, f09_g17).

We have a new issue:

The batch submits through slurm, 4 nodes are reserved initially (8 cpus of 24 cpus available and 32gb of 128gb available on each node), with a separate node reserved for the st_archive process. The first 4 nodes run for about 3 minutes then ends.

The st_archive job then queues up and when running lasts for about a minute. We were hoping you could shed some light on what you think is happening. We're not entirely sure what the standard behavior is but suspect we are not capturing an error or that we need to tweak slurm/cesm settings to have the processes run continuously on the reserved nodes.

I'm stuck in the queue on our batch system so I'll post this for now and updated the files in a second post later

jedwards · Mar 5, 2021

You should try running the scripts_regression_tests.py, this should illuminate any problems.
To make sure that the run and archive are working correctly an ERR test is a good choice
./create_test ERR.f19_g17.X

william.wilson · Mar 5, 2021

I running the test script and it has been sitting on the following for close to an hour.

test_configure (__main__.K_TestCimeCase) ... ok
test_create_test_longname (__main__.K_TestCimeCase) ... ok
test_env_loading (__main__.K_TestCimeCase) ... skipped 'Skipping env load test - Only works on mappy'
test_self_build_cprnc (__main__.K_TestCimeCase) ...

jedwards · Mar 5, 2021

It shouldn't take that long. Stop it and evaluate the problem.

jonwells04 · Mar 5, 2021

jonwells04 said:
Hi Erik and J, thank you again for all the help!

We updated the --mem tag to --mem=32000 in the config_batch.xml file for our machine and have gotten past the memory errors, at least on simpler example cases (I1850Clm50Sp, f09_g17).

We have a new issue:

The batch submits through slurm, 4 nodes are reserved initially (8 cpus of 24 cpus available and 32gb of 128gb available on each node), with a separate node reserved for the st_archive process. The first 4 nodes run for about 3 minutes then ends.

The st_archive job then queues up and when running lasts for about a minute. We were hoping you could shed some light on what you think is happening. We're not entirely sure what the standard behavior is but suspect we are not capturing an error or that we need to tweak slurm/cesm settings to have the processes run continuously on the reserved nodes.

I'm stuck in the queue on our batch system so I'll post this for now and updated the files in a second post later

The run.cesmtest does not appear to have an error and ends with:
run command is mpirun -np 32 /scratch/jw2636/cesm/scratch/cesmtest/bld/cesm.exe >> cesm.log.$LID 2>&1
check for resubmit
dout_s True
mach monsoon
resubmit_num 0

The env_batch.xml and cesmlog are also attached. It appears that something is being completed and output but resubmission isn't working?

We'll update with results of scripts_regression_tests.py ASAP

Thanks!

jedwards · Mar 6, 2021

This cesm.log seems to indicate that the model is hanging without completing the run. Could you please include all of the component log files as well.

jonwells04 · Mar 6, 2021

jedwards said:
This cesm.log seems to indicate that the model is hanging without completing the run. Could you please include all of the component log files as well.

The component log files (besides rof) are attached. Here is a link to the rof log. Thank you!

jedwards · Mar 6, 2021

The run did complete, it wasn't obvious with just the cesm.log In your case directory should be the slurm log files for the
model run and for the st_archiver - have you checked them for error messages?

jonwells04 · Mar 6, 2021

That sounds like good news to me!

I apologize, we're not CESM experts and expected a longer run time. The st_archive.cesmtest has two warnings but no errors. run.cesmtest doesn't appear to have errors but has a resubmit 0, if that matters.

Now that we potentially have a completed run we'll ask someone on our end to take a look and check the output. Hopefully we're almost there. Thank you again!

jedwards · Mar 6, 2021

7. Porting and validating CIME on a new platform — CIME master documentation

jonwells04 · Mar 6, 2021

Thank you for the link, we'll start port validating as well. I suspect we have further tweaks to make but are happy to be at the validation step now that we can submit and complete case runs. Thanks!

Issue installing on Centos 8 with slurm and lmod

William Wilson

New Member

William Wilson

New Member

CSEG and Liaisons

William Wilson

New Member

Erik Kluzek

CSEG and Liaisons

Erik Kluzek

CSEG and Liaisons

William Wilson

New Member

CSEG and Liaisons

William Wilson

New Member

Jon Wells

New Member

CSEG and Liaisons

William Wilson

New Member

CSEG and Liaisons

Jon Wells

New Member

Attachments

CSEG and Liaisons

Jon Wells

New Member

Attachments

CSEG and Liaisons

Jon Wells

New Member

Attachments

CSEG and Liaisons

Jon Wells

New Member