Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Any ideas about error "Invalid timer number"?

I have managed to get running the CCSM3 (not 3.0.1 beta14) on Debian Linux with kernel 2.6.15, the SMP computer with 4 AMD Opteron processors.
The compiler is PGI 6.1-2, 64 bit target on x86-64 Linux.
I had to use MPICH-2, because only dead components worked with MPICH-1.
I got the error with active components and MPICH-1, described in the MPICH documentation: "p4_error: Found a dead connection while looking for messages". The documentation states also, it's a system bug, and the work around this error will be in the future versions of MPICH.

The problem.
All tests, prescribed in the section 7 of the CCSM manual are passed, however, logs contain a lot of messages

(shr_timer_start) ERROR: invalid timer number: *****
(shr_timer_stop) ERROR: invalid timer number: *****
(shr_timer_start) ERROR: invalid timer number: *****
(shr_timer_stop) ERROR: invalid timer number: *****
(shr_timer_start) ERROR: invalid timer number: 32767
(shr_timer_stop) ERROR: invalid timer number: 32767

These messages appear only in the coupler's logs immediately after the messages
"(tStamp_write) cpl model date xxx xxx xxx xxx".
or
"comm diag xxx sorr 21 xxxxxxxxxx send ice Faxc_snow"

Always 6 "invalid timer number" messages appear.

I tried adding "write" statements to the coupler code, inside the main integration loop, and tracking all timer numbers, assigned by the coupler.

35 timer number variables are OK, they have values from 0 to 35.
Timer variable t29 was 32767, but it was never used in the code.

I also tried modifying the format of the respective statement, printing error message, and got the following:

(shr_timer_start) ERROR: invalid timer number: 411286210
(shr_timer_stop) ERROR: invalid timer number: 411286210
(shr_timer_start) ERROR: invalid timer number: 1066474452
(shr_timer_stop) ERROR: invalid timer number: 1066474452
(shr_timer_start) ERROR: invalid timer number: 32767
(shr_timer_stop) ERROR: invalid timer number: 32767

Looks like some timer numbers were assigned with -1, converted to the unsigned number.

Increasing maximum number of timers from 200 to 400 didn't help.

These messages appeared with MPICH-2 and with dead components also. Model with dead components, run with MPICH-1 (latest version), didn't produce such messages.

Now I run 2 cpl, 2 cam, 2 clm, 2 csim, and 2 pop.

This error didn't prevent from test passes, however, that *is* an error, and I have to get rid of it.

Any ideas? What could it be?
 
I have sligntly modified error messages in the file shr_timer_mod.F90 to write the name of the component, which it is linked to.
That is, for example, "'(pop) ERROR: invalid timer number', n"
instead of
"'ERROR: invalid timer number', n"
Modified copies of files were put to the SourceMods. This allowed each component to produce unique error messages.
The error messages are generated by the coupler only,
they always appear in tuples by 6 (as listed above),
amount of tuples changes with changing of the component numbers, the dependency is still unclear for me.

The question:

There are much more timer variables declared in the coupler's source, then used. Why?
 

kauff

New Member
wl2776 said:
I
35 timer number variables are OK, they have values from 0 to 35.
Timer variable t29 was 32767, but it was never used in the code.

I also tried modifying the format of the respective statement, printing error message, and got the following:

(shr_timer_start) ERROR: invalid timer number: 411286210
(shr_timer_stop) ERROR: invalid timer number: 411286210
(shr_timer_start) ERROR: invalid timer number: 1066474452
(shr_timer_stop) ERROR: invalid timer number: 1066474452
(shr_timer_start) ERROR: invalid timer number: 32767
(shr_timer_stop) ERROR: invalid timer number: 32767

Looks like some timer numbers were assigned with -1, converted to the unsigned number.

Increasing maximum number of timers from 200 to 400 didn't help.

Any ideas? What could it be?

Apparently this bit of code is missing:
call shr_timer_get(t29,"This is timer t29")
so that integer::t29 does not have a valid timer number (eg. 1-200),
or somehow the value of t29 has been corrupted.

I've never seen this error in the released code, so I'll guess it's the result of some code modification that you've done.
 
Thank you for the answer. Unfortunately, it looks like we don't fully understand each other.

kauff said:
Apparently this bit of code is missing:
call shr_timer_get(t29,"This is timer t29")
so that integer::t29 does not have a valid timer number (eg. 1-200), or somehow the value of t29 has been corrupted.

Yes, that's right, the timer t29 is not initialized.
I have mentioned, that this is never used in the code.
So are timers ti2,ti3,ti4,ti5,ti6,ti7,ti8,ti9, and tm6,tm7,tm8,tm9

Please, look in the file
${ccsm3_0_source_dir}/models/cpl/cpl6/main.f90

Lines 98 through 102 contain declarations of the timer variables. They contain 12 EXTRANEOUS variables, which are NEVER USED in the code. I have listed those timer variables above.

kauff said:
I've never seen this error in the released code, so I'll guess it's the result of some code modification that you've done.

No.
I didn't modify the coupler code.

All my code modifications consisted of write statements in order to trace the program execution.
I also modified the code of CAM and CLM, by placing calls of shr_msg_stdio after the MPI initialization.
 
I have put write and call shr_sys_flush statements immediately after all calls of shr_timer_start.

Errors appear after the line 774, containing
call flux_atmOcn(con_Xo2c%bundle,bun_Sa2c_o,cpl_control_dead_ao,bun_aoflux_o )

Probably, there are memory leaks in this subroutine.
Or, the compiler generates the weird code.

call shr_timer_start(t14) produces these messages. Other calls of timer start and stop don't.
Commenting timer t14 out gives, that such messages are produced after call shr_timer_start(t15).
t14=16 during all of these.
 
The solution was adding 'save' at line 101 of the file models/cpl/cpl6/flux_mod.F90
This line contains the timer variable declarations in the flux_atmOcn subroutine.
The errors were produced by the timers inside this subroutine.
For some reason the timer variable values were not saved between the subroutine calls.

The 'save' word is present several lines below, but it seems not working.
 

kauff

New Member
wl2776 said:
The solution was adding 'save' at line 101 of the file models/cpl/cpl6/flux_mod.F90
This line contains the timer variable declarations in the flux_atmOcn subroutine.
The errors were produced by the timers inside this subroutine.
For some reason the timer variable values were not saved between the subroutine calls.

The 'save' word is present several lines below, but it seems not working.

Apparently the PGI compiler ("PGI 6.1-2, 64 bit target on x86-64 Linux") is not saving the variable values even though there is a save statement. This is a scary thought -- hopefully this is not happening in other subroutines.
 
Top