problem with the number of processes set

wjp2000_cn@yahoo_com_cn · Jul 16, 2006

Recently, I have tried to compile and run CCSM3.0 on a linux cluter linked with a Gigabyte Ethernet and a SGI4700 machine. For the cluster, the processor is Xeon64 with 3.4 GHz, and the compiler used is intel icc and ifort.
Now the verion can be run successfully for the X and A compset. For compset B, the T31_gx3v5 resolution runs something OK, and a monthly inital run with ntasks set(cpl 4, cam 10, clm 1, csim 5, pop 12) completed successfully. For compset C of T31_gx3v5 resolution, a monthly inital run with ntasks set(cpl 13, datm 1, dlnd 1, dice 1, pop 16) completed successfully too.
For compset B of the T42_gx1v3 resolution, a monthly initial run with ntasks set(cpl 3, cam 8, clm 1, csim 4, pop 16) proceeds OK just end of the main integertion step. There are error occurs in the output_driver subroutine of pop module. The compset C has similar problem. On SGI4700, a monthly initial run of the compset B says an error "Tm < Tmin" in csim at the 27th day, and then exited.
Futher, cam can only run on more than 4 processes for T42_gx1v3 on this machine, and pop mudule only on morr than 12 processes. On a SGI4700 machine, cam can only run on more than 6 processes, and pop only on more than 16 proceses. cpl and clm can run on any number of processes.
Is there anyone help me?

njn01 · Jul 17, 2006

wjp2000_cn said:
Now the verion can be run successfully for the X and A compset. For compset B, the T31_gx3v5 resolution runs something OK, and a monthly inital run with ntasks set(cpl 4, cam 10, clm 1, csim 5, pop 12) completed successfully. For compset C of T31_gx3v5 resolution, a monthly inital run with ntasks set(cpl 13, datm 1, dlnd 1, dice 1, pop 16) completed successfully too.
For compset B of the T42_gx1v3 resolution, a monthly initial run with ntasks set(cpl 3, cam 8, clm 1, csim 4, pop 16) proceeds OK just end of the main integertion step. There are error occurs in the output_driver subroutine of pop module. The compset C has similar problem. On SGI4700, a monthly initial run of the compset B says an error "Tm < Tmin" in csim at the 27th day, and then exited.

You have successful initial 1-month T31_gx3v5 B and C runs, but there is some sort of problem at the end of a T42_gx1v3 run, correct?

Are you running the CCSM model "out of the box," with no changes to any code, timesteps, initial conditions, etc, except for the number and distribution of processors?

What are the error messages at the end of the log files for each of the component models? Are there any system error messages?

wjp2000_cn said:
Futher, cam can only run on more than 4 processes for T42_gx1v3 on this machine, and pop mudule only on morr than 12 processes. On a SGI4700 machine, cam can only run on more than 6 processes, and pop only on more than 16 proceses. cpl and clm can run on any number of processes.
Is there anyone help me?

I don't understand what you mean when you say that cam and pop can only run on more than a certain number of processors. Why not?

In general, our group is unable to provide customized support for the CCSM model on unsupported platforms, due to very limited rersources. Perhaps with more information from you, the community will be able to provide suggestions to help you out.

wjp2000_cn@yahoo_com_cn · Jul 20, 2006

If I donot modify the codes and scipts, CCSM cannot run on our systems. For CCSM to run, I have modified some codes, including the replacements of "MPI_REAL8" with "MPI_DOUBLE_PRECISION", modification of initial values of variables in shr_msg_mod.F90, direct assignments of logname value in some codes, and addition of some open statements before read statements, etc.
On xeron cluster, when the processes for cam is 2, or the processes for pop is 1, 4, 5, 8, 10, the error mesage like this: "Bad file descriptor, write fd=8 error". On SGI4700, when the processes for cam is 2, 4, the error message is similar. When the processes for pop is 1, 4, 5, 8, 10, the error message like this : " dead connection".
Yesterday, I have debuged the code step by step with written statements, and found the location of occurring error at the end of a monthly initialrun for pop module. The location is in tavg.F, row 1752,1753, 1755, these statements are for netcdf ouput and the data is of 3D arrays. When these statements are remarked, the T42_gx1v3 initialrun for a month with the configuration(cpl 4, cam 10, clm 1, csim 5, pop 12) completed successfully on xeron clustes. The netcdf I installed is netcdf3.6.1. I have tried two version. One is the source code version and compiled myself with ifort, the other is the downloaded binary version for linux2.6-x86. If the 1752,1753,1755 line are not remarked, the error message is identical, something like "Bad file descriptor".
From these experiences, I feels that there may be something fault in my netcdf. Could anyone help me analyze and give some suggestions?

problem with the number of processes set

wjp2000_cn@yahoo_com_cn

New Member

njn01

Member

wjp2000_cn@yahoo_com_cn

New Member