Scheduled Downtime
On Wednesday 09 March 2022 from 6am to 10am MT, the website will be down for maintenance

Job not completing at wallclock time

rsansom

Rachel Sansom
New Member
I'm currently porting CESM2 to our local hpc at the University of Leeds. I have been having issues with resubmissions. The initialisation job runs, and any resubmit just runs from the beginning again. I can't seem to find any errors in the log files, nor any confirmation that the job finished successfully. It seems that it is perhaps just being cut off at the wallclock time limit without actually wrapping up the run. If I use the immediate resubmission flag, I have the same issue of each run beginning from the start again, instead of picking up where the last one left off. The batch scheduler we use is sge, so I wonder if it's an issue porting with that. Any ideas on where the configuration might be wrong?

I have attached log files, stout and sterr, and my .xml configs.
 

Attachments

  • run.FX2000_f19_f19_mg16_resub_nosta.o3395095.txt
    2.1 KB · Views: 1
  • run.FX2000_f19_f19_mg16_resub_nosta.e3395095.txt
    142 bytes · Views: 2
  • rof.log.220114-114222.txt
    5.6 KB · Views: 0
  • ocn.log.220114-114222.txt
    34.4 KB · Views: 0
  • lnd.log.220114-114222.txt
    61.7 KB · Views: 1
  • ice.log.220114-114222.txt
    93.9 KB · Views: 0
  • cpl.log.220114-114222.txt
    67 KB · Views: 4
  • config_machines.xml.txt
    4.6 KB · Views: 0
  • config_compilers.xml.txt
    2.9 KB · Views: 1
  • config_batch.xml.txt
    3.3 KB · Views: 0

jedwards

CSEG and Liaisons
Staff member
According to the coupler log your model is not completing and so a second run will start from the begining. I can't be sure but it looks as if you are hanging after day 3. It wrote a timestamp of 2022-01-14 12:09:29. Can you tell me when the wallclock time expired?
 

rsansom

Rachel Sansom
New Member
Thanks for your reply. So it had JOB_WALLCLOCK_TIME of 30 mins, and I can see model execution starting at 2022-01-14 11:42:25 and my job end time is down as 12:12:22. I can see in the job info that it was killed due to wallclock time. I still wonder if I haven't configured the batch scheduler correctly, so somehow the timings of the run are not being communicated properly, but I'm not sure what exactly the culprit would be.
 

jedwards

CSEG and Liaisons
Staff member
If you just want to test the resubmit capability there are tests in the scripts_regression_tests.py to do that. If you want to test with the
particular case you are attempting then you need to increase the JOB_WALLCLOCK_TIME and possibly also reduce the run length so that it will have
sufficient time to run to completion.
 
Top