wl@eimb_ru
Member
I have managed to get running the CCSM3 (not 3.0.1 beta14) on Debian Linux with kernel 2.6.15, the SMP computer with 4 AMD Opteron processors.
The compiler is PGI 6.1-2, 64 bit target on x86-64 Linux.
I had to use MPICH-2, because only dead components worked with MPICH-1.
I got the error with active components and MPICH-1, described in the MPICH documentation: "p4_error: Found a dead connection while looking for messages". The documentation states also, it's a system bug, and the work around this error will be in the future versions of MPICH.
The problem.
All tests, prescribed in the section 7 of the CCSM manual are passed, however, logs contain a lot of messages
(shr_timer_start) ERROR: invalid timer number: *****
(shr_timer_stop) ERROR: invalid timer number: *****
(shr_timer_start) ERROR: invalid timer number: *****
(shr_timer_stop) ERROR: invalid timer number: *****
(shr_timer_start) ERROR: invalid timer number: 32767
(shr_timer_stop) ERROR: invalid timer number: 32767
These messages appear only in the coupler's logs immediately after the messages
"(tStamp_write) cpl model date xxx xxx xxx xxx".
or
"comm diag xxx sorr 21 xxxxxxxxxx send ice Faxc_snow"
Always 6 "invalid timer number" messages appear.
I tried adding "write" statements to the coupler code, inside the main integration loop, and tracking all timer numbers, assigned by the coupler.
35 timer number variables are OK, they have values from 0 to 35.
Timer variable t29 was 32767, but it was never used in the code.
I also tried modifying the format of the respective statement, printing error message, and got the following:
(shr_timer_start) ERROR: invalid timer number: 411286210
(shr_timer_stop) ERROR: invalid timer number: 411286210
(shr_timer_start) ERROR: invalid timer number: 1066474452
(shr_timer_stop) ERROR: invalid timer number: 1066474452
(shr_timer_start) ERROR: invalid timer number: 32767
(shr_timer_stop) ERROR: invalid timer number: 32767
Looks like some timer numbers were assigned with -1, converted to the unsigned number.
Increasing maximum number of timers from 200 to 400 didn't help.
These messages appeared with MPICH-2 and with dead components also. Model with dead components, run with MPICH-1 (latest version), didn't produce such messages.
Now I run 2 cpl, 2 cam, 2 clm, 2 csim, and 2 pop.
This error didn't prevent from test passes, however, that *is* an error, and I have to get rid of it.
Any ideas? What could it be?
The compiler is PGI 6.1-2, 64 bit target on x86-64 Linux.
I had to use MPICH-2, because only dead components worked with MPICH-1.
I got the error with active components and MPICH-1, described in the MPICH documentation: "p4_error: Found a dead connection while looking for messages". The documentation states also, it's a system bug, and the work around this error will be in the future versions of MPICH.
The problem.
All tests, prescribed in the section 7 of the CCSM manual are passed, however, logs contain a lot of messages
(shr_timer_start) ERROR: invalid timer number: *****
(shr_timer_stop) ERROR: invalid timer number: *****
(shr_timer_start) ERROR: invalid timer number: *****
(shr_timer_stop) ERROR: invalid timer number: *****
(shr_timer_start) ERROR: invalid timer number: 32767
(shr_timer_stop) ERROR: invalid timer number: 32767
These messages appear only in the coupler's logs immediately after the messages
"(tStamp_write) cpl model date xxx xxx xxx xxx".
or
"comm diag xxx sorr 21 xxxxxxxxxx send ice Faxc_snow"
Always 6 "invalid timer number" messages appear.
I tried adding "write" statements to the coupler code, inside the main integration loop, and tracking all timer numbers, assigned by the coupler.
35 timer number variables are OK, they have values from 0 to 35.
Timer variable t29 was 32767, but it was never used in the code.
I also tried modifying the format of the respective statement, printing error message, and got the following:
(shr_timer_start) ERROR: invalid timer number: 411286210
(shr_timer_stop) ERROR: invalid timer number: 411286210
(shr_timer_start) ERROR: invalid timer number: 1066474452
(shr_timer_stop) ERROR: invalid timer number: 1066474452
(shr_timer_start) ERROR: invalid timer number: 32767
(shr_timer_stop) ERROR: invalid timer number: 32767
Looks like some timer numbers were assigned with -1, converted to the unsigned number.
Increasing maximum number of timers from 200 to 400 didn't help.
These messages appeared with MPICH-2 and with dead components also. Model with dead components, run with MPICH-1 (latest version), didn't produce such messages.
Now I run 2 cpl, 2 cam, 2 clm, 2 csim, and 2 pop.
This error didn't prevent from test passes, however, that *is* an error, and I have to get rid of it.
Any ideas? What could it be?