Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Infiniband for CCSM3

scs_wy@yahoo_cn

New Member
Hi everyone
I try to run CCSM3 in my cluster:(AMD Opteron Suse10 PGI7.0Infinibandtorqueintel mpi)
I can run T42_gx1v3 successfully but can't run T42_gx3v5 smoothly.It is strange that no error output just stop the process.
I never encountered such a situation.I guess may be infiniband network type can't support CCSM3 completely,right??
Any one can give me some advice ???Thanks!!!
 
I don't think infiniband could be guilty on this as for my tests I noticed that the communication was established with ssh client, in the end.

What do you call 'not so smooth'? Maybe it is because the data throughput is too much more than the other smaller gx1v3 that makes that difference if you compare both tests, and the data traffic reaches a level where processor time is less than data demand.
 

scs_wy@yahoo_cn

New Member
avenger said:
I don't think infiniband could be guilty on this as for my tests I noticed that the communication was established with ssh client, in the end.

What do you call 'not so smooth'? Maybe it is because the data throughput is too much more than the other smaller gx1v3 that makes that difference if you compare both tests, and the data traffic reaches a level where processor time is less than data demand.

Thank you for your hint,avenger.
This is the last part of the output file.
---------------------------------------------------------------------------------------------
(main) start of main integration loop
(main) -------------------------------------------------------------------------
(tStamp_write) cpl model date 0001-01-01 00000s wall clock 2009-05-31 16:19:57 avg dt 0s dt 0s
(cpl_map_npFixNew3) compute bilinear weights & indicies for NP region.
(cpl_map_npFixNew3) compute bilinear weights & indicies for NP region.
(main) -------------------------------------------------------------------------
(main) start of main integration loop
(main) -------------------------------------------------------------------------
(tStamp_write) cpl model date 0001-01-01 00000s wall clock 2009-05-31 16:19:57 avg dt 0s dt 0s
(cpl_map_npFixNew3) compute bilinear weights & indicies for NP region.
(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0
(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0

(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0
(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0
(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0
(cpl_bundle_copy) WARNING: bundle aoflux_o has accum count = 0
(flux_atmOcn) FYI: this routine is not threaded
(flux_atmOcn) FYI: this routine is not threaded(flux_atmOcn) FYI: this routine is not threaded

(flux_atmOcn) FYI: this routine is not threaded(flux_atmOcn) FYI: this routine is not threaded

(flux_atmOcn) FYI: this routine is not threaded
(flux_atmOcn) FYI: this routine is not threaded
print_memusage iam 3 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 34124 20737 886 678 0print_memusage iam 1 stepon after dynpkg. -1 in the next line means unavailableprint_memusage iam 0 stepon after dynpkg. -1 in the next line means unavailable
print_memusage iam 2 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 34155 20763 886 678 0


print_memusage: size, rss, share, text, datastack= 38963 25219 918 678 0
print_memusage: size, rss, share, text, datastack= 33893 20700 886 678 0
print_memusage iam 10 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 33675 20355 891 678 0
print_memusage iam 16 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 33904 20152 890 678 0
print_memusage iam 9 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 33684 20288 891 678 0
print_memusage iam 8 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 33967 20395 892 678 0
print_memusage iam 31 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 33571 20001 891 678 0
print_memusage iam 38 stepon after dynpkg. -1 in the next line means unavailable
678 0
print_memusage: size, rss, share, text, datastack= 35513 21935 891 678 0
print_memusage iam 35 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 35130 20223 876 678 0
print_memusage iam 17 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 35420 21744 891 678 0
print_memusage iam 18 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 35297 21621 891 678 0


print_memusage iam 26 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 35019 21490 891 678 0
print_memusage iam 25 stepon after dynpkg. -1 in the next line means unavailableprint_memusage iam 24 stepon after dynpkg. -1 in the next line means unavailableprint_memusage iam 14 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 35339 21525 891 678 0
print_memusage iam 20 stepon after dynpkg. -1 in the next line means unavailable
print_memusage: size, rss, share, text, datastack= 35091 21550 892 678 0
print_memusage iam 32 stepon after dynpkg. -1 in the next line means unavailable
rank 92 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 92: killed by signal 9
rank 91 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 91: killed by signal 9
rank 90 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 90: killed by signal 9
rank 88 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 88: killed by signal 9
rank 111 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 111: killed by signal 9
rank 108 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 108: killed by signal 9
rank 100 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 100: killed by signal 9
rank 97 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 97: killed by signal 9
rank 96 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 96: killed by signal 9
rank 126 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 126: killed by signal 9
rank 118 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 118: killed by signal 9
rank 117 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 117: killed by signal 9
rank 115 in job 1 node49_59375 caused collective abort of all ranks
exit status of rank 115: killed by signal 9
Stopping mpdboot ...
Sun May 31 16:19:58 CST 2009 -- CSM EXECUTION HAS FINISHED
---------------------------------------------------------------------------------------------

It seems that the CCSM3 has run but stop suddenly by unknown reasons.
Every sub-module(cpl,atm,ocn,lnd,ice)'s logfile has no error output just stop.
I don't understand MPI,so could you tell me some more detail??
many thanks!!
 
Well, I am also beginning on MPI but, you said you installed Intel MPI, right?.. When I installed intel MPI I had run some tests according to the documentation to see if they were really talking to each other, and as the boxes were reachable both via the infiniband and the gigabit network, I took extra care to assure the addresses are routed throught the ib0 interface.

As you already run ccsm3 under the lower resolution it seems not be the case. Maybe you skipped one or another setting while building the gx3v5 version. Would it too hard to try and rebuild based on gx1v3 from scratch (like you done to gx3v5)?..

The termination codes from the processes are really abnormal process termination (the SIGTERM is generally a last resort to kill processes without waiting anything else to happen like saving data or such). This kind of signal shall not be issued even in error (where SIGQUIT should be), but maybe it is just something on the scripts, and the one who made it issued the -9 (SIGTERM) to a kill command.

I'm sorry not to give you accurate informations but I hope these help you find out the solution. As these forums are quite inactive, I hope you don't mind if I try to share some ideas. :)

Soon I will be testing CCSM3 on two SGI machines with intel 10 or 11 compilers and also intel mpi. When I reach that stage I think I would be of more help.
 
Top