Main menu

Navigation

Runtime problems, porting to a new Linux cluster

8 posts / 0 new
Last post
brose@...
Runtime problems, porting to a new Linux cluster

I am porting CESM 1.2.0 to a new Linux cluster at U.Albany.

I have successfuly compiled and ran complete test cases with -compset X and -compset S.

My problem is that for every other case that I have tried (including -compsets B, E and F at various resolutions), I get a successful build, but the model fails at runtime.

Here is the last few lines of an example cesm.log file from one of my failed runs (in this case -compset B, but I get the same errors for any other compset I've tried):

 

 Opened existing file b40.1850.track1.1deg.006.cam.i.0863-01-01-00000.nc

       65536

 Opened existing file 

 /data/rose_scr/cesm_inputdata/atm/cam/topo/USGS-gtopo30_0.9x1.25_remap_c051027.

 nc      131072

--------------------------------------------------------------------------

mpirun noticed that process rank 84 with PID 59514 on node snow-04 exited on signal 11 (Segmentation fault).

--------------------------------------------------------------------------

3 total processes killed (some possibly by mpirun during cleanup)

 

The crash does NOT always occur on the same node. It DOES always seems to occur after reading the topography input file, as above. I have checked the input file (which I obtained from svn repository at NCAR) and it seems fine (e.g. I can open and view it with ncview).

I am using the latest version 14 of the intel compilers, and netcdf libraries that were build with the same compilers.

Any sugggestions?

Thanks,

Brian 

Brian Rose
University at Albany

brose@...

I have isolated the error to the FV dynamical core of CAM.

The crash occurs during execution of subroutine cam_initial() in $CCSMROOT/models/atm/cam/src/dynamics/fv/inital.F90 The same runtime error occurs for any compset that includes the finite volume CAM. A test configuration using a different dynamical core builds and runs to completion successfully (-compset F_AMIP_CAM5  -res ne30np4_gx1v6)

I am using version 14.0.1 of the intel compilers, and openmpi 1.6.4. Any suggestions as to what is causing the finite volume CAM model to fail?

 

Brian Rose
University at Albany

jedwards

Hi Brian,


Have you tried compiling with DEBUG=TRUE in the env_build.xml ?   If it works in this mode then you can try reducing the optimization for just the fv-dycore files or even just the inital.F90 file.   You may be able to run by just reducing the optimazation of inital.F90.   We haven't updated yet to intel 14.x, it's certainly possible that you've uncovered a new compiler bug.

CESM Software Engineer

brose@...

Thanks for the suggestions jedwards

It does indeed compile and run successfully with DEBUG=TRUE. It also compiles and runs successfully with DEBUG=FALSE and FFLAGS:= -O1 (but fails at runtime with -O2 or -O3).

I'm afraid that changing optimization settings for only one section of the code is beyond my scripting abilities. I'll be happy to test this if you can post instructions on how to set it up.

- Brian

Brian Rose
University at Albany

brose@...

What's the latest version of intel compiler that have been tested with CESM?

Brian Rose
University at Albany

jedwards

Hi Brian,

Intel 13.1.2 is what we are currently using.  


To compile a file or files with different compiler flags create a Depends.{machine} or Depends.{compiler} file in your case directory where {machine} or {compiler} matchs the cesm name for your machine or compiler, then put the special Makefile instructions in that file.   So for example to run inital.F90 at reduced optimization you might write a Depends.intel file that looks like:

inital.o: inital.F90
    $(FC) -c $(INCLDIR) $(INCS) $(FFLAGS) $(FREEFLAGS) -O0 $<

Note that the space in front of $(FC) should be a tab and that you will need to clean_build or at least touch the inital.F90 file before running build again.

 

- Jim

CESM Software Engineer

brose@...

Thanks Jim.

After a lot of trial and error, I have found the offending file: $CCSMROOT/models/atm/cam/src/dynamics/fv/spmd_dyn.F90

I can now compile and run successfully with optimization set to -O2 globally, and the following line in Depends.intel

spmd_dyn.o: spmd_dyn.F90

$(FC) -c $(INCLDIR) $(INCS) $(FFLAGS) $(FREEFLAGS) -O1 $<

Going from -O1 to -O2 on this particular file is causing my run-time crash.

Thanks for your help in sleuthing this out. i hope this post helps others who may migrate to intel 14.*

Brian

Brian Rose
University at Albany

jedwards

Hi Brian,

Thank you!   Thiis will be great help to us.

Jim

 

CESM Software Engineer

Log in or register to post comments

Who's new

  • 1658093099@...
  • mborreggine@...
  • kabirtam@...
  • suns@...
  • liangpeng0405@...