Dear All,
I have a problem with running CTSM model on LEVANTE cluster.
I will try to explain problem:
I set all necessary namelists, model parameters and run the model. I have done several successfully small tests (2 - 3 months) and got output results. In that cases, model works fine. Now, I want to run SPINUP and I have prepared the running script for SPINUP (spinup_test2.bash --> spinup_test2.bash).
I have done 3 test runs with different numbers of nnodes (1 - 8 nodes, 2 - 20 nodes and 3 - 36 nnodes) and every time I got this problem but with different numbers of output files (1 - I got only 2 months, 2 - 5 months, 3 - 8 months). In my log files I have this message:
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
0: slurmstepd: error: *** STEP 5855979.0 ON l20500 CANCELLED AT 2023-07-04T18:13:03 DUE TO TIME LIMIT ***
and it looks like, that my job was killed by cluster because of time limit, but all my parameters which are related to time have values 12 hours. I attached the log files (cesm, atm and land logs). Information about the error there is only in cesm.log, other files don't have information about errors.
What I know:
1) I got this problem after almost 2 hours of active work and if I increase the number of nnodes, I can get more output results and it looks like that the problem can be related to my settings. However, I have:
a) In my config_batch.xml file there is a time limit equal to 12 hours
b) I set time parameters for CTSM equal to 12 hours:
./xmlchange JOB_WALLCLOCK_TIME=12:00:00
./xmlchange JOB_WALLCLOCK_TIME=01:20:00 --subgroup case.st_archive
my model parameters are presented in set.txt and running parameters in run.txt
My problem:
I don't understand, why I have this 2 hours time limit and how can I solve this problem? Is it possible that I have to change something else in CTSM settings or should I check my LEVANTE SBATCH parameters? Can you help me to solve this problem?
Best regards,
Evgenii
I have a problem with running CTSM model on LEVANTE cluster.
I will try to explain problem:
I set all necessary namelists, model parameters and run the model. I have done several successfully small tests (2 - 3 months) and got output results. In that cases, model works fine. Now, I want to run SPINUP and I have prepared the running script for SPINUP (spinup_test2.bash --> spinup_test2.bash).
I have done 3 test runs with different numbers of nnodes (1 - 8 nodes, 2 - 20 nodes and 3 - 36 nnodes) and every time I got this problem but with different numbers of output files (1 - I got only 2 months, 2 - 5 months, 3 - 8 months). In my log files I have this message:
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
0: slurmstepd: error: *** STEP 5855979.0 ON l20500 CANCELLED AT 2023-07-04T18:13:03 DUE TO TIME LIMIT ***
and it looks like, that my job was killed by cluster because of time limit, but all my parameters which are related to time have values 12 hours. I attached the log files (cesm, atm and land logs). Information about the error there is only in cesm.log, other files don't have information about errors.
What I know:
1) I got this problem after almost 2 hours of active work and if I increase the number of nnodes, I can get more output results and it looks like that the problem can be related to my settings. However, I have:
a) In my config_batch.xml file there is a time limit equal to 12 hours
b) I set time parameters for CTSM equal to 12 hours:
./xmlchange JOB_WALLCLOCK_TIME=12:00:00
./xmlchange JOB_WALLCLOCK_TIME=01:20:00 --subgroup case.st_archive
my model parameters are presented in set.txt and running parameters in run.txt
My problem:
I don't understand, why I have this 2 hours time limit and how can I solve this problem? Is it possible that I have to change something else in CTSM settings or should I check my LEVANTE SBATCH parameters? Can you help me to solve this problem?
Best regards,
Evgenii