liushan@mail_iap_ac_cn
New Member
Hi all,
I encountered a problem when I run a CCSM3 case, with T42_gx1v3 resolution and component set B.
In fact, I have finished a control run of T31_gx3v5 for 100 model years on the same machine without any problem.
But when I change the resolution to T42_gx1v3, the run failed after 12 years and 5 months (model time). Then I restart this run, however, the restart run also failed after another 12 years and 5 months.
When I check the job.o file, it writes:
rank 4 in job 1 compute-0-17_53982 caused collective abort of all ranks
exit status of rank 4: killed by signal 9
rank 0 in job 1 compute-0-17_53982 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
Thu Sep 9 18:22:39 CST 2010 -- CSM EXECUTION HAS FINISHED
There is no error messages in various log files except the ice.log, which writes ’(ice) terminating before coupler’.
The input data I used is distributed with CCSM3.
1. Could anyone give me a clue that why it always failed after 12 years and 5 months?
2. Another question, I restart the former run from 0012-01-01. After the restart run failed, I compared the data file ‘cam2.h0.0012-03-01.nc’ with the same time period data file of the former run. But they are not totally same. Say, V wind value varies from -19.2612 to 19.0705 m/s in the former run, but -23.415 to 26.3333 m/s in the restart run. Restart run can continue the original run bit by bit just like it had not stopped. But why they are not totally same?
Thanks in advance!
I encountered a problem when I run a CCSM3 case, with T42_gx1v3 resolution and component set B.
In fact, I have finished a control run of T31_gx3v5 for 100 model years on the same machine without any problem.
But when I change the resolution to T42_gx1v3, the run failed after 12 years and 5 months (model time). Then I restart this run, however, the restart run also failed after another 12 years and 5 months.
When I check the job.o file, it writes:
rank 4 in job 1 compute-0-17_53982 caused collective abort of all ranks
exit status of rank 4: killed by signal 9
rank 0 in job 1 compute-0-17_53982 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
Thu Sep 9 18:22:39 CST 2010 -- CSM EXECUTION HAS FINISHED
There is no error messages in various log files except the ice.log, which writes ’(ice) terminating before coupler’.
The input data I used is distributed with CCSM3.
1. Could anyone give me a clue that why it always failed after 12 years and 5 months?
2. Another question, I restart the former run from 0012-01-01. After the restart run failed, I compared the data file ‘cam2.h0.0012-03-01.nc’ with the same time period data file of the former run. But they are not totally same. Say, V wind value varies from -19.2612 to 19.0705 m/s in the former run, but -23.415 to 26.3333 m/s in the restart run. Restart run can continue the original run bit by bit just like it had not stopped. But why they are not totally same?
Thanks in advance!