Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Time settings for SPINUP run

Evgeny_Chur

Evgeniy Churyulin
New Member
Dear All,

I have a problem with running CTSM model on LEVANTE cluster.

I will try to explain problem:
I set all necessary namelists, model parameters and run the model. I have done several successfully small tests (2 - 3 months) and got output results. In that cases, model works fine. Now, I want to run SPINUP and I have prepared the running script for SPINUP (spinup_test2.bash --> spinup_test2.bash).

I have done 3 test runs with different numbers of nnodes (1 - 8 nodes, 2 - 20 nodes and 3 - 36 nnodes) and every time I got this problem but with different numbers of output files (1 - I got only 2 months, 2 - 5 months, 3 - 8 months). In my log files I have this message:

srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
0: slurmstepd: error: *** STEP 5855979.0 ON l20500 CANCELLED AT 2023-07-04T18:13:03 DUE TO TIME LIMIT ***

and it looks like, that my job was killed by cluster because of time limit, but all my parameters which are related to time have values 12 hours. I attached the log files (cesm, atm and land logs). Information about the error there is only in cesm.log, other files don't have information about errors.

What I know:
1) I got this problem after almost 2 hours of active work and if I increase the number of nnodes, I can get more output results and it looks like that the problem can be related to my settings. However, I have:
a) In my config_batch.xml file there is a time limit equal to 12 hours
b) I set time parameters for CTSM equal to 12 hours:
./xmlchange JOB_WALLCLOCK_TIME=12:00:00
./xmlchange JOB_WALLCLOCK_TIME=01:20:00 --subgroup case.st_archive

my model parameters are presented in set.txt and running parameters in run.txt

My problem:
I don't understand, why I have this 2 hours time limit and how can I solve this problem? Is it possible that I have to change something else in CTSM settings or should I check my LEVANTE SBATCH parameters? Can you help me to solve this problem?


Best regards,
Evgenii
 

Attachments

  • config_batch.txt
    25.6 KB · Views: 0
  • config_machines.txt
    4.1 KB · Views: 0
  • cesm.log.5855979.txt
    34.5 KB · Views: 1
  • cpl.log.5855979.txt
    141 KB · Views: 0
  • lnd.log.5855979.txt
    486.5 KB · Views: 1
  • atm.log.5855979.txt
    27.2 KB · Views: 0
  • run.txt
    1.2 KB · Views: 2
  • set.txt
    15.7 KB · Views: 1

slevis

Moderator
Ask the LEVANTE system administrators if you are submitting to a queue with a 2-hour limit. If so, then you will need to run shorter simulations and restart more frequently.
 
Top