Welcome to the new DiscussCESM forum!
We are still working on the website migration, so you may experience downtime during this process.

Existing users, please reset your password before logging in here: https://xenforo.cgd.ucar.edu/cesm/index.php?lost-password/

Wall-clock limit only apply to the short term archival (not the main simulation)

slubis

Sandro Lubis
New Member
Hello,

I have run a CESM2 simulation on CISL (Cheyenne) with the rate of 1.6 Simulated years per day (24 hours). I follow CISL's suggestion by running a simulation with 9 months per 12 hr wall-clock as follows:

./xmlchange --file env_run.xml --id STOP_OPTION --val nmoths
./xmlchange --file env_run.xml --id STOP_N --val 9
./xmlchange JOB_WALLCLOCK_TIME=12:00:00

However, the system keeps rejecting it with the following errors (see bellow). It seems that the wall-clock limit does NOT apply to my main simulation but only to the short-term archival. Any suggestion on how to handle this issue? CISL team asked me to report it here.

===
Submitting job script qsub -q share -l walltime=12:00:00 -A URIC0004 -W depend=afterok:8128909.chadmin1.ib0.cheyenne.ucar.edu -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
ERROR: Command: 'qsub -q share -l walltime=12:00:00 -A URIC0004 -W depend=afterok:8128909.chadmin1.ib0.cheyenne.ucar.edu -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive' failed with error 'qsub:
ERROR: your job has been rejected.

your declared wallclock time (43200 seconds)
exceeds your maximum limit of 21600 seconds
the queue limit is 21600 seconds

Please contact cislhelp@ucar.edu
for assistance' from dir '/glade/u/home/slubis/cases/FWHIST'
===

Best regards,
Sandro
 

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi Sandro,

The issue is that the short-term archive script defaults to the "share" queue, which only has a max wallclock time of six hours. To be honest the archive script should almost never take that long, so I would recommend changing its wallclock time back to 20 minutes like so:

./xmlchange JOB_WALLCLOCK_TIME=0:20:00 --subgroup case.st_archive

before running ./case.submit to submit the case again. Of course if that doesn't work please let us know.

Hope that helps, and have a great day!

Jesse
 

slubis

Sandro Lubis
New Member
Hi Jesse,

Thanks for your suggestion, it does work well. However, I have something very suspicious in my run, as it turns out to be super expensive. I run a simulation for 9 months with a compset of FWHIST in CESM2. It costs me almost ~90000 core hours. I have run a simulation with 384 x 3 for each component (atm, lnd, ocn, ice, etc). But It does not change a lot even if I rerun with 1152 tasks per 1 thread. The model cost is roughly the same that is 15373 pe-hrs/simulated years. It's still very expensive. It's prescribed SST simulation, I don't expect that much even with interactive chemistry, any suggestion?

Best,
Sandro
 

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi Sandro,

It looks like your through-put is pretty close to what is posted on the CESM2 timing page for FWHIST:


which is ~16000 core-hours per simulated year (you can see your throughput for each simulation by looking in your case's timing/cesm_timing.XXX files). So my guess is that the 90,000 hours were actually spent over multiple model runs/submissions (which can happen during debugging), as opposed to just one nine-month simulation.

In terms of reducing cost, it looks like you are generating daily output, so if you don't need that then you can remove that entry from user_nl_cam and likely save on cost. You can also try running at 2-degree resolution, running with specified chemistry, or you can try load-balancing yourself to see if you get slightly better output. However, to be honest the ~16,000 core-hours a year is pretty good for WACCM with full chemistry at 1-degree resolution, as that configuration is just very expensive to run (which to my knowledge is mostly due to chemistry).

Hope that helps, and sorry for the "bad" news!

Jesse
 
Top