Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Wall-clock limit only apply to the short term archival (not the main simulation)

slubis

Sandro Lubis
New Member
Hello,

I have run a CESM2 simulation on CISL (Cheyenne) with the rate of 1.6 Simulated years per day (24 hours). I follow CISL's suggestion by running a simulation with 9 months per 12 hr wall-clock as follows:

./xmlchange --file env_run.xml --id STOP_OPTION --val nmoths
./xmlchange --file env_run.xml --id STOP_N --val 9
./xmlchange JOB_WALLCLOCK_TIME=12:00:00

However, the system keeps rejecting it with the following errors (see bellow). It seems that the wall-clock limit does NOT apply to my main simulation but only to the short-term archival. Any suggestion on how to handle this issue? CISL team asked me to report it here.

===
Submitting job script qsub -q share -l walltime=12:00:00 -A URIC0004 -W depend=afterok:8128909.chadmin1.ib0.cheyenne.ucar.edu -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
ERROR: Command: 'qsub -q share -l walltime=12:00:00 -A URIC0004 -W depend=afterok:8128909.chadmin1.ib0.cheyenne.ucar.edu -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive' failed with error 'qsub:
ERROR: your job has been rejected.

your declared wallclock time (43200 seconds)
exceeds your maximum limit of 21600 seconds
the queue limit is 21600 seconds

Please contact cislhelp@ucar.edu
for assistance' from dir '/glade/u/home/slubis/cases/FWHIST'
===

Best regards,
Sandro
 

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi Sandro,

The issue is that the short-term archive script defaults to the "share" queue, which only has a max wallclock time of six hours. To be honest the archive script should almost never take that long, so I would recommend changing its wallclock time back to 20 minutes like so:

./xmlchange JOB_WALLCLOCK_TIME=0:20:00 --subgroup case.st_archive

before running ./case.submit to submit the case again. Of course if that doesn't work please let us know.

Hope that helps, and have a great day!

Jesse
 

slubis

Sandro Lubis
New Member
Hi Jesse,

Thanks for your suggestion, it does work well. However, I have something very suspicious in my run, as it turns out to be super expensive. I run a simulation for 9 months with a compset of FWHIST in CESM2. It costs me almost ~90000 core hours. I have run a simulation with 384 x 3 for each component (atm, lnd, ocn, ice, etc). But It does not change a lot even if I rerun with 1152 tasks per 1 thread. The model cost is roughly the same that is 15373 pe-hrs/simulated years. It's still very expensive. It's prescribed SST simulation, I don't expect that much even with interactive chemistry, any suggestion?

Best,
Sandro
 

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi Sandro,

It looks like your through-put is pretty close to what is posted on the CESM2 timing page for FWHIST:


which is ~16000 core-hours per simulated year (you can see your throughput for each simulation by looking in your case's timing/cesm_timing.XXX files). So my guess is that the 90,000 hours were actually spent over multiple model runs/submissions (which can happen during debugging), as opposed to just one nine-month simulation.

In terms of reducing cost, it looks like you are generating daily output, so if you don't need that then you can remove that entry from user_nl_cam and likely save on cost. You can also try running at 2-degree resolution, running with specified chemistry, or you can try load-balancing yourself to see if you get slightly better output. However, to be honest the ~16,000 core-hours a year is pretty good for WACCM with full chemistry at 1-degree resolution, as that configuration is just very expensive to run (which to my knowledge is mostly due to chemistry).

Hope that helps, and sorry for the "bad" news!

Jesse
 

wvsi3w

wvsi3w
Member
Hi,

so if Wall-clock limit only apply to the short term archival (not the main simulation), what should we do to do a simulation (for instance for CLM5 using 1850 input data)????

I asked chatGPT and below is the answer:
"To do this, you need to modify the job submission script that you're using to run CESM. Specifically, you need to remove or comment out the line that specifies the wall-clock time limit. This line will likely be in the form of a PBS, SLURM, or other batch system directive."


What do you think about this answer, and if you think this is not helpful what should we do to run the simulation without a wall-clock?
 

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hi wvs3iw,

The main simulation and the short term archiver are two different submitted jobs, and thus have two different wallclock times. To modify the wallclock time of the main simulation you simply need to run the following command in your case directory

Code:
./xmlchange JOB_WALLCLOCK_TIME=<time> --subgroup case.run

Where <time> is whatever wallclock time you want (usually in hh:mm:ss format).

Also, in general there likely isn’t enough CESM examples online for chatGPT to accurately state what the correct procedure is, so I would probably recommend avoiding chatGPT for specific model configuration questions, and instead just post your question on the forum here.

Hope that helps, and have a great day!

Jesse
 

wvsi3w

wvsi3w
Member
Thanks Jesse for the answer,
I agree with chatGPT being deficient on this matter.

However, I need to know how should I run my CLM5 (land component of CESM) on 1850 data for a test. I did all of the create_newcase, case.build, case.setup, and case.submit before and it worked with my configuration.
Prior to this, I ran it for 10 min just to test it and it is completed. Now, I need to know how should I run it without telling wallclock-time?
Because when we put a wallclock-time for it we are saying that it needs to run for that specific time. But, I need it to run for as long as it takes to simulate the input data (1850).
 
Top