Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Unexpected termination of the simulation, CESM2, BW1850

xyli@geo_uio_no

New Member
Dear all,

I submitted a job with the following steps,
$ ./create_newcase --case ~/cases/BW1850_restart2 --compset BW1850 --res f09_g17 --machine tetralith
$ ./case.setup
$ ./case.build

I also changed the following parameters,
STOP_N: 24
REST_OPTION: nhours
REST_N: 2

The simulation ran for a while and was then terminated unexpectedly.
I checked all the log files but couldn't find the error message.

Any idea about how this happens without obvious error message?
Attached please kindly find my log files.

Cheers,

Xiangyu

 

jedwards

CSEG and Liaisons
Staff member
It's failing while trying to write the cam restart file - have you tried restart tests with simple cases on this machine?  For example try./create_test ERS.f09_g17_rx01.A  and if that works try ./create_test ERS.f09_f09_mg17.F1850Do all of the compute nodes have write access to the output file?
 

xyli@geo_uio_no

New Member
Thanks.

I tested the test run and the machine does not seem to allow interactive run for more than 4 nodes.
$ #./create_newcase --case ~/cases/FW1850_SO2 --compset FW1850 --res f09_f09_mg17 --machine fram --project NN1004K

./create_test ERS.f09_g17.A --machine fram --project NN1004K
Testnames: ['ERS.f09_g17.A.fram_intel']
Creating test directory /cluster/work/users/xiangyuli/cesm/ERS.f09_g17.A.fram_intel.20190412_144817_hica32
RUNNING TESTS:
ERS.f09_g17.A.fram_intel
Starting CREATE_NEWCASE for test ERS.f09_g17.A.fram_intel with 1 procs
Finished CREATE_NEWCASE for test ERS.f09_g17.A.fram_intel in 1.499997 seconds (PASS)
Starting XML for test ERS.f09_g17.A.fram_intel with 1 procs
Finished XML for test ERS.f09_g17.A.fram_intel in 0.272974 seconds (PASS)
Starting SETUP for test ERS.f09_g17.A.fram_intel with 1 procs
Finished SETUP for test ERS.f09_g17.A.fram_intel in 1.875918 seconds (PASS)
Starting SHAREDLIB_BUILD for test ERS.f09_g17.A.fram_intel with 1 procs
Finished SHAREDLIB_BUILD for test ERS.f09_g17.A.fram_intel in 141.881331 seconds (PASS)
Starting MODEL_BUILD for test ERS.f09_g17.A.fram_intel with 4 procs
Finished MODEL_BUILD for test ERS.f09_g17.A.fram_intel in 23.417793 seconds (PASS)
Starting RUN for test ERS.f09_g17.A.fram_intel with 1 proc on interactive node and 32 procs on compute nodes
Finished RUN for test ERS.f09_g17.A.fram_intel in 2.760764 seconds (FAIL). [COMPLETED 1 of 1]
Case dir: /cluster/work/users/xiangyuli/cesm/ERS.f09_g17.A.fram_intel.20190412_144817_hica32
Errors were:
submit_jobs case.test
Submit job case.test
ERROR: Command: 'sbatch --time 00:59:00 -p normal --account NN1004K .case.test --skip-preview-namelist' failed with error 'sbatch: error: --nodes >= 4 required for normal and optimist jobs
sbatch: error: Batch job submission failed: Node count specification invalid' from dir '/cluster/work/users/xiangyuli/cesm/ERS.f09_g17.A.fram_intel.20190412_144817_hica32'

Due to presence of batch system, create_test will exit before tests are complete.
To force create_test to wait for full completion, use --wait
At test-scheduler close, state is:
FAIL ERS.f09_g17.A.fram_intel (phase RUN)
Case dir: /cluster/work/users/xiangyuli/cesm/ERS.f09_g17.A.fram_intel.20190412_144817_hica32
test-scheduler took 172.466439962 seconds

Yes, I have the write permission to all nodes.
Fresh simulations indeed work well.
This problem only occur when restarting a crashed simulation.
For crashed simulations, there is no folder "rest/".
So I copied all the restart files from the run directory.
Could this be a problem?


 

jedwards

CSEG and Liaisons
Staff member
It looks like you haven't properly defined your batch queues.   It will be a lot easier to solve problems using something small like that A case than with a full BW1850 simulation.  You may want to discuss with your system administrators or support staff. 
 
Top