Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Issue installing on Centos 8 with slurm and lmod

william.wilson

William Wilson
New Member
Our partition name is core which is our default partition. Also, I agree that sbatch should probably be used but if you do need ssh then you would need to set up an ssh keypair for monsoon which goes in .ssh/authorized_keys in your home directory. The following link has information on generating keypairs.

 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Erik, can you think of any other work around or a config file that maybe hasn't been setup correctly that would cause slurm's "out-of-memory handler" to kill jobs and keep slurm from resubmitting?

Hmmm. Well I would look in the cpl.log file so you see how much memory is being used and make sure you aren't using too much for your specific machine. I'd also talk to your system administrators for your machine to ask about what could cause this error in slurm. You may also need to tell slurm how much memory you need for your case, the default that slurm gives you may be too small. Another question is if you can submit to the queue from a compute node? Sometimes you have to do special things to get that to work (certain modules that need to be added, or adding additional paths for batch executables). But, talk to your sys admins about that and what you need to do for that.
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
The changes I made in .cime/config_batch.xml were not reflected in env_batch.xml when I created a new case.

The changes in the $HOME/.cime directory only get invoked when you are using the specific machine listed in files in that directory. And only for the user with that under their home directory. If there's a typo somewhere (in the config files, or in your use of it in create_newcase) you won't get it to work. I'd suggest trying to add a new machine there where you do something really simple that is obvious. Or perhaps try it on your laptop. That's the case where I've used this before. Try the simplest thing first and get that to work, and then hopefully you can move onto the more complicated cases. We know this particular machine has some complexities that we don't have on other machines. So trying it in the simpler case can help inform it.

The other thing I wonder about are the details of the batch setup for this machine. And that's something that you'll need help from your sys admins. Does it have login nodes that you can use interactively, or does everything have to be submitted to compute nodes? And are there differences between the login and compute nodes that are causing trouble? I'm thinking that some of that may be coming to play here if compute nodes can't see your $HOME/.cime directory for example.
 

william.wilson

William Wilson
New Member
The changes in the $HOME/.cime directory only get invoked when you are using the specific machine listed in files in that directory. And only for the user with that under their home directory. If there's a typo somewhere (in the config files, or in your use of it in create_newcase) you won't get it to work. I'd suggest trying to add a new machine there where you do something really simple that is obvious. Or perhaps try it on your laptop. That's the case where I've used this before. Try the simplest thing first and get that to work, and then hopefully you can move onto the more complicated cases. We know this particular machine has some complexities that we don't have on other machines. So trying it in the simpler case can help inform it.

The other thing I wonder about are the details of the batch setup for this machine. And that's something that you'll need help from your sys admins. Does it have login nodes that you can use interactively, or does everything have to be submitted to compute nodes? And are there differences between the login and compute nodes that are causing trouble? I'm thinking that some of that may be coming to play here if compute nodes can't see your $HOME/.cime directory for example.
I am the system admin. I took a copy of a working env_batch.xml file and copied it to ~/.cime/config_batch.xml and am running it as me with our specific machine setup. Jon Wells and I are working on getting cesm going. This is my first time trying to install get cesm set up. We are on a cluster that uses slurm for its batch system. Jobs to seem to be firing off but we are having issues.
 

jedwards

CSEG and Liaisons
Staff member
The files in $HOME/.cime/ are appended to the default files from the source tree, You should not repeat any sections that are in the default file - doing so should result in an error. I haven't tried copying env_batch.xml from a case to $HOME/.cime/config_batch.xml but I'm surprised that tiat did not caues an error. Do you have xmllint in your path?
 

william.wilson

William Wilson
New Member
The document I was working from did not imply the files being appended. So we will not go down that route for what we want to do. And yes, xmllint is in our path. We are on centos 8.3. Mainly what we need to get past is what Jon Wells is asking about now.
 

jonwells04

Jon Wells
New Member
Hmmm. Well I would look in the cpl.log file so you see how much memory is being used and make sure you aren't using too much for your specific machine. I'd also talk to your system administrators for your machine to ask about what could cause this error in slurm. You may also need to tell slurm how much memory you need for your case, the default that slurm gives you may be too small. Another question is if you can submit to the queue from a compute node? Sometimes you have to do special things to get that to work (certain modules that need to be added, or adding additional paths for batch executables). But, talk to your sys admins about that and what you need to do for that.
Hi Erik and J, thank you again for all the help!

We updated the --mem tag to --mem=32000 in the config_batch.xml file for our machine and have gotten past the memory errors, at least on simpler example cases (I1850Clm50Sp, f09_g17).

We have a new issue:

The batch submits through slurm, 4 nodes are reserved initially (8 cpus of 24 cpus available and 32gb of 128gb available on each node), with a separate node reserved for the st_archive process. The first 4 nodes run for about 3 minutes then ends.

The st_archive job then queues up and when running lasts for about a minute. We were hoping you could shed some light on what you think is happening. We're not entirely sure what the standard behavior is but suspect we are not capturing an error or that we need to tweak slurm/cesm settings to have the processes run continuously on the reserved nodes.

I'm stuck in the queue on our batch system so I'll post this for now and updated the files in a second post later
 

jedwards

CSEG and Liaisons
Staff member
You should try running the scripts_regression_tests.py, this should illuminate any problems.
To make sure that the run and archive are working correctly an ERR test is a good choice
./create_test ERR.f19_g17.X
 

william.wilson

William Wilson
New Member
I running the test script and it has been sitting on the following for close to an hour.


test_configure (__main__.K_TestCimeCase) ... ok
test_create_test_longname (__main__.K_TestCimeCase) ... ok
test_env_loading (__main__.K_TestCimeCase) ... skipped 'Skipping env load test - Only works on mappy'
test_self_build_cprnc (__main__.K_TestCimeCase) ...
 

jonwells04

Jon Wells
New Member
Hi Erik and J, thank you again for all the help!

We updated the --mem tag to --mem=32000 in the config_batch.xml file for our machine and have gotten past the memory errors, at least on simpler example cases (I1850Clm50Sp, f09_g17).

We have a new issue:

The batch submits through slurm, 4 nodes are reserved initially (8 cpus of 24 cpus available and 32gb of 128gb available on each node), with a separate node reserved for the st_archive process. The first 4 nodes run for about 3 minutes then ends.

The st_archive job then queues up and when running lasts for about a minute. We were hoping you could shed some light on what you think is happening. We're not entirely sure what the standard behavior is but suspect we are not capturing an error or that we need to tweak slurm/cesm settings to have the processes run continuously on the reserved nodes.

I'm stuck in the queue on our batch system so I'll post this for now and updated the files in a second post later

The run.cesmtest does not appear to have an error and ends with:
run command is mpirun -np 32 /scratch/jw2636/cesm/scratch/cesmtest/bld/cesm.exe >> cesm.log.$LID 2>&1
check for resubmit
dout_s True
mach monsoon
resubmit_num 0

The env_batch.xml and cesmlog are also attached. It appears that something is being completed and output but resubmission isn't working?

We'll update with results of scripts_regression_tests.py ASAP

Thanks!
 

Attachments

  • Monsoon-slurm.zip
    8.5 KB · Views: 4

jedwards

CSEG and Liaisons
Staff member
This cesm.log seems to indicate that the model is hanging without completing the run. Could you please include all of the component log files as well.
 

jedwards

CSEG and Liaisons
Staff member
The run did complete, it wasn't obvious with just the cesm.log In your case directory should be the slurm log files for the
model run and for the st_archiver - have you checked them for error messages?
 

jonwells04

Jon Wells
New Member
That sounds like good news to me!

I apologize, we're not CESM experts and expected a longer run time. The st_archive.cesmtest has two warnings but no errors. run.cesmtest doesn't appear to have errors but has a resubmit 0, if that matters.

Now that we potentially have a completed run we'll ask someone on our end to take a look and check the output. Hopefully we're almost there. Thank you again!
 

Attachments

  • run-st_archive.zip
    1.5 KB · Views: 0

jonwells04

Jon Wells
New Member
Thank you for the link, we'll start port validating as well. I suspect we have further tweaks to make but are happy to be at the validation step now that we can submit and complete case runs. Thanks!
 
Top