Problem running CCSM4 on IBM p6

chris_fletcher@utoronto_ca · Jun 10, 2010

Hi, we're trying to port CCSM4 to the IBM Power6 in Toronto (where CCSM3 runs well) and the model builds ok but we can't get it to run. The error message in ccsm.log is:

0: INTERNAL ERROR : catalog was closed, or catalog was not initialized.
0: sayMessage will not print the error message.

which seems to be something strange at system-level -- has anyone seen it before?

Here are the details:
System: IBM Power6, AIX, xlf v12.1.0.2, NetCDF-4, IBM default MPI, LoadLeveller
Run: Compset B1850CN, 0.9x1.25_gx1v6, "generic IBM" case settings, 128PEs.
(NOTE: the error doesn't seem to depend on the details of the run; we've tried a few different configurations, all failed)

The atmosphere initializes ok, then it crashes while initializing the land model. The last few lines of lnd.log look like this:

total runoff cells numr = 116332 numrl = 84511 numro = 31821
rtm decomp info proc = 0 begr = 1 endr = 7282 numr = 7282
proc = 0 begrl= 1 endrl= 6409 numrl= 6409
proc = 0 begro= 1 endro= 873 numro= 873

And the last few lines of ccsm.log (error message at the end):

15: proc= 15 clump no = 1 clump id= 16 beg pft = 26685 end pft = 28331 total pfts per clump = 1647
1: rtm decomp info proc = 1 begr = 7283 endr = 14564 numr = 7282
1: proc = 1 begrl= 6410 endrl= 12430 numrl= 6021
1: proc = 1 begro= 874 endro= 2134 numro= 1261
7: rtm decomp info proc = 7 begr = 51011 endr = 58302 numr = 7292
7: proc = 7 begrl= 41437 endrl= 45596 numrl= 4160
7: proc = 7 begro= 9575 endro= 12706 numro= 3132
14: rtm decomp info proc = 14 begr = 102485 endr = 109838 numr = 7354
14: proc = 14 begrl= 75192 endrl= 80532 numrl= 5341
14: proc = 14 begro= 27294 endro= 29306 numro= 2013
15: rtm decomp info proc = 15 begr = 109839 endr = 116332 numr = 6494
15: proc = 15 begrl= 80533 endrl= 84511 numrl= 3979
15: proc = 15 begro= 29307 endro= 31821 numro= 2515
0:INTERNAL ERROR : catalog was closed, or catalog was not initialized.
0: sayMessage will not print the error message.

The only thing that ran successfully was compset X, where all models are dead. Every other configuration we've tried fails (e.g. compset C) with the same error each time.

Google suggested that adding -binitfini poe_remote_main to the linking step would provide more informative error messages, but this didn't change anything.

Thanks in advance for any suggestions.

eaton · Jun 11, 2010

It might be useful to try running CAM in standalone mode, i.e., try running the script
$CCSM_ROOT/models/atm/cam/bld/run-ibm.csh
You'll need to look at the script and edit a few lines that set the locations of things like the root of the source tree, input data, and work directories. Also, if you're not using the LSF batch system then set ntasks manually since the LSB_HOSTS environment variable won't be present.

The reason this may help is that the CAM standalone script depends more on your default runtime environment; it doesn't try to control it as much as the CCSM scripts do, and it's possible that there are settings in the CCSM scripts that are appropriate to NCAR but not to your installation.

chris_fletcher@utoronto_ca · Jun 11, 2010

Hi, thanks for the suggestions.

We tried running CAM stand-alone; it fails in the same way (but at a slightly different place -- while initializing the ice model). But I did find something that could be relevant. The file

$CCSM_ROOT/models/atm/cam/bld/Makefile.in

that shipped with the CCSM4 release contains modifications to C and F Flags related to an xlf compiler patch for v12.1.0.6:

-tb -B/contrib/xlf/12.01.0000.0006_fppatch/ -qxflag=fixdivsimpl

We didn't have those files on our system, so I removed those switches from the Makefile and it compiled without a problem.

We are running xlf 12.1.0.2 here (sysadmin are upgrading us soon, apparently). So could it be our compiler versions that are causing these crashes? And what was wrong with 12.1.0.6 that needed patching? Are NCAR still using this patch on bluefire?

Thanks again.

eaton · Jun 11, 2010

Good catch to remove the options that implement a compiler fix. That should have gotten into the release. I don't know the details of what was being fixed, but it was related to SMT. We needed a firmware fix for SMT, and the compiler patch was a temporary workaround that prevented the compiler from generating the instructions that were causing the SMT problem. We now have the firmware patch and are no longer using the compiler patch. The problems that were related to SMT manifested themselves as irreproducible results. I haven't ever seen anything like the error message you're getting.

If CAM standalone exhibits the same problem as the full CCSM then I'd continue working with it to figure out the problem since it's a simpler testing environment.

Maybe one of the first things to try is to make sure you can run serially, e.g., use a low resolution like 10x15 and just run a few steps interactively. To do that supply configure with the arguments "-hgrid 10x15 -nospmd -nosmp". If that's successful then at least you know the compiler is working for a non-threaded build. Then start adding complexity. Try pure mpi before running with threading. Threading typically seems to cause more problems than mpi.

chris_fletcher@utoronto_ca · Jun 14, 2010

OK, I have managed to run CAM in serial mode. Interestingly, the model fails with the same error when I add MPI back in (via -spmd), but it runs fine with threading enabled (-nospmd -smp ntasks = 1).

We are going to try to get our system up to the same specs as bluefire. Could you please give us the current versions being used for:
xlf
xlc
poe
MPI
System firmware

Also, it seems xlf v13.1 is now available. Would CCSM be expected to run under it?

Thanks again.

eaton · Jun 15, 2010

There's a local command I can use to get the following info:

***************************************************
NCAR SOFTWARE LEVELS: Mon Jun 14 19:30:47 MDT 2010.
***************************************************

AIX: bos.mp 5.3.10.1
CSM: csm.core 1.7.1.4
LoadLeveler: LoadL.full 3.5.1.3
GPFS: gpfs.base 3.2.1.14
VSD: rsct.vsd.vsdd 4.1.0.23
POE: ppe.poe 5.1.1.3
PESSL: pessl.rte.smp 3.3.0.2
ESSL: essl.rte.smp 4.4.0.1
FORTRAN: xlfrte 12.1.0.7
PERL: perl.rte 5.8.2.100
C: xlC.rte 10.1.0.3

I can make a specific request to our IBM rep for more info if that doesn't contain everything you need (I don't see anything specific about mpi, but maybe that's part of another package).

We're still using xlf 12 compiler. Porting to a new compiler is always an adventure. Of course we expect the newer one to work, but if it's not giving bit-for-bit identical answers then a port validation is required which is alot of work.

chris_fletcher@utoronto_ca · Jun 21, 2010

Just to update, we have now identified the likely source of this error (our outdated poe distribution) and I think we can close this thread.

CCSM now runs on our machine with updated poe/MPI (from old version 5.1.0.2 to new version 5.1.1.6). We have not yet installed the firmware patch to fix the p6 reproducibility bug, but it sounds like this is unrelated to the runtime errors we were seeing.

Thanks for all the suggestions.

Problem running CCSM4 on IBM p6

chris_fletcher@utoronto_ca

New Member

eaton

CSEG and Liaisons

chris_fletcher@utoronto_ca

New Member

eaton

CSEG and Liaisons

chris_fletcher@utoronto_ca

New Member

eaton

CSEG and Liaisons

chris_fletcher@utoronto_ca

New Member