Main menu

Navigation

CESM in CLM mode on Linux Intel Cluster

4 posts / 0 new
Last post
dbhart.sandia@...
CESM in CLM mode on Linux Intel Cluster

I am running the CESM1 model in the CLM only mode (the I* compsets). I am trying to run the I_1948-2004 compset, and every time I do so, the model crashes sometime during month 7 of year 6. This has occurred on two different generic_linux_intel systems, both with 8-PES per node. The problem has occurred with ICN, I4804, I4804CN and I compsets, and occurs regardles of the number of nodes used (tested with 8-64 PES).

I have also tried this by changing to a 5-year restart and resubmission. This appears to get around the failure, but will be a problem for us in our desired application of the code.

If anyone has any suggestions as to why this would be occurring, or where I should try to get more debug information to track down the problem, I would greatly appreciate it.

Thanks,
David Hart
Geoscience Research and Applications
Sandia National Laboratories

Machines are CentOS 5.5, Intel 11.1, 2x Intel Xeon Quad-Core, 48GB RAM, Infiniband connections.

The error output from ccsm.log is as follows (no other log files show the error):
[.....]
BalanceCheck: soil balance error nstep = 97378 point = 22492 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 97381 point = 12313 imbalance = -0.000001 W/m2
pio_support::pio_die:: myrank= 0 : ERROR: nf_mod.F90: 582 :
NetCDF: Not a valid ID
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libc.so.6 000000377C6799B0 Unknown Unknown Unknown
ccsm.exe 0000000000D8BDF3 Unknown Unknown Unknown
ccsm.exe 0000000000C03203 Unknown Unknown Unknown
ccsm.exe 0000000000BC20C4 Unknown Unknown Unknown
ccsm.exe 0000000000BB6144 Unknown Unknown Unknown
ccsm.exe 0000000000A3B716 shr_dmodel_mod_mp 685 shr_dmodel_mod.F90
ccsm.exe 0000000000A38D7C shr_dmodel_mod_mp 527 shr_dmodel_mod.F90
ccsm.exe 0000000000AE7753 shr_strdata_mod_m 555 shr_strdata_mod.F90
ccsm.exe 00000000004C7FCE datm_comp_mod_mp_ 750 datm_comp_mod.F90
ccsm.exe 00000000004BFF9F atm_comp_mct_mp_a 96 atm_comp_mct.F90
ccsm.exe 0000000000424895 MAIN__ 2037 ccsm_driver.F90
ccsm.exe 000000000040E01C Unknown Unknown Unknown
libc.so.6 000000377C61D994 Unknown Unknown Unknown
ccsm.exe 000000000040DF29 Unknown Unknown Unknown
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 13782 on
node node19.cluster exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libmlx4-rdmav2.so 00002AAAAB0FC710 Unknown Unknown Unknown

David Hart
Geoscience Research & Applications
Sandia National Laboratories
Albuquerque, NM USA

devarajun@...

Please check the input data files and also the size of the files. It seems to me the problem with input data.

Dev

Dev

dbhart.sandia@...

Dev,

Thanks. I went through all the input data, and the main problem is that, of course, there is no printout which file is the problem. I verified that this is not an input data issue by copying all the data files I was using to a different architecture system, and it worked fine. So I think this must be an issue with MPI or NetCDF somewhere...

I thought that it might, somehow, be a memory issue (though these nodes have 48GB of RAM each), but I was able to run in single point mode (PTS), and it crashed in _exactly_ the same place, but was using 1/10th the memory at the time. My next guess would be some sort of stack issue, but I've tried the stacksize setting, and it did nothing.

I would still like to get this going on my local cluster, rather than the other system I verified the input data on, so if anyone has any suggestions on what in my MPI or NetCDF (or HDF5?) builds is bad, I'd appreciate the help.

David

David Hart
Geoscience Research & Applications
Sandia National Laboratories
Albuquerque, NM USA

bozbiyik@...

I am also running the CESM land model using the configuration I-2000CN using Cray XT5.
I get the following error around the same time step. It looks very similar.

The log file goes like this:

[...]
BalanceCheck: soil balance error nstep = 97285 point = 5110 imbalance = 0.000000 W/m2
BalanceCheck: soil balance error nstep = 97568 point = 3735 imbalance = -0.000001 W/m2

pio_support::pio_die:: myrank= 0 : ERROR: nf_mod.F90: 582
: NetCDF: Not a valid ID
_pmii_daemon(SIGCHLD): [NID 00836] PE 0 exit signal Aborted
[NID 00836] 2011-03-08 16:14:40 Apid 1585660: initiated application termination
...

Any help is very much welcome!

Anil Bozbiyik

Climate and Environmental Physics
Bern, Switzerland

Log in or register to post comments

Who's new

  • jwolff
  • tinna.gunnarsdo...
  • sarthak2235@...
  • eolivares@...
  • shubham.gandhi@...