errors executing CESM1.0.4 after successful build

hoell@geog_ucsb_edu · Apr 4, 2013

Hello,I'm having difficulty running CESM1.0.4 after what appears to be a successful build. Two executables are created in different places after the build. One of the executables interacts with the nodes/processors, but fails because some necessary components are out of place. The other executable does not interact with the nodes/processors. I would be grateful if someone could shed some insight into my problem. I have outlined my methods below.Here is my system information:Operating system: Red Hat Enterprise Version 5, Linux 2.6.18-238.el5
Fortran compiler: PGI v.11.10
MPI: OpenMPI v1.4.3 built with PGI v.11.10
netCDF: netCDF 4.1.2 built with OpenMPI v1.4.3Here is how CESM1.0.4 was built on my machine. The Macros* files were updated with the appropriate paths prior to the build.> /home/hoell/Desktop/cesm/experiments/source/control/scripts/create_newcase -case test -res 1.9x2.5_gx1v6 -compset B_1850-2000_CN -mach generic_linux_pgi -scratchroot $CSMSCRATCH -din_loc_root_csmdata $CSMDATA -max_tasks_per_node 8where $CSMSCRATCH = "/home/hoell/Desktop/cesm/experiments/run/cesm_scratch" and $CSMDATA = "/home/hoell/Desktop/cesm/inputdata"Last line of CESM output: Successfully created the case for generic_linux_pgi> ./configure -caseLast line of CESM output: Successfully configured the case for generic_linux_pgi> ./test.generic_linux_pgi.buildLast lines of CESM output:- Locking file env_build.xml
- Locking file Macros.generic_linux_pgi
CCSM BUILDEXE SCRIPT HAS FINISHED SUCCESSFULLYWhen CESM builds, it produces two .exe files in separate places. I execute them as follows. In the future, I'll be sending them to the queue using a script similar to the examples included, but for now I'm testing to see whether the executables run properly with minimum effort.1) $CSMSCRATCH/test.ccsm.exe> mpirun -np 4 test.ccsm.exe(t_initf) Read in prof_inparm namelist from: drv_in
PGFIO-F-209/OPEN/unit=99/'OLD' specified for file which does not exist.
File name = drv_in
In source file /home/hoell/Desktop/cesm/experiments/source/control/models/drv/driver/seq_io_mod.F90, at line number 164-- This executable interacts with the nodes and processors, but cannot find some files. These files, such as drv_in are located in $CSMSCRATCH/run, where ccsm.exe resides. I try that executable.2) $CSMSCRATCH/run/ccsm.exe[node95:29390] *** An error occurred in MPI_Group_range_incl
[node95:29390] *** on communicator MPI_COMM_WORLD
[node95:29390] *** MPI_ERR_RANK: invalid rank
[node95:29390] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)-- This executable does not interact well with the nodes or processors.Does anyone have any insights on what I may be doing wrong?Thank you, Andy

tcraig · Apr 4, 2013

I'm not sure where the test.ccsm.exe binary is coming from. i don't believe we normally get that. what happens if you do

cd $CSMSCRATCH/run
mpirun -np ccsm.exe

all the input files are in the run directory. you probably have to cd to that directory before you startup the binary. that's what's done in the batch scripts that are generated.

hoell@geog_ucsb_edu · Apr 4, 2013

Hi,
I followed your directions and I got the same error as in my initial post. Do you have any other suggestions? Thanks.
Andy

tcraig · Apr 4, 2013

i assume you did -np 4. does drv_in exist in your run directory?

hoell@geog_ucsb_edu · Apr 4, 2013

Yes, I did use np -4.
Yes, drv_in resides in run/.
Andy

tcraig · Apr 5, 2013

It seems like the model is not running in the run directory even when you launch it there. Have you
checked with the systems folks to confirm? Have you tried running a batch job?

hoell@geog_ucsb_edu · Apr 5, 2013

I have tried running a batch job, but unfortunately the result is the same.
The system administrators don't understand the strange behavior either. The suggested that I ask the CESM folks.
We agree that the model seems to not run from the directory in which it resides. It seems strange that the model would build the executable in the directory and then behave as if the .exe should not be in that particular directory. All of the help documentation indicates that ccsm.exe in $SCRATCH/run should be the executable to run.

jedwards · Apr 9, 2013

Have you tried running a simple hello_word type mpi program on this system? That may be a reasonable next step.

hoell@geog_ucsb_edu · Apr 9, 2013

Yes, a simple program works with our mpi program. Also, I am able to run CAM5 on the system using the same configuration.

sorensen@gfy_ku_dk · May 7, 2013

Hi Andy, Did you ever get to the bottom of this error? I'm experiencing exactly the same on a CESM 1.1.1 build. Cheers,

jedwards · May 7, 2013

I think that this may be due to a bug in the rather dated version of openmpi you are using. Can you try again with a newer version? You reported 1.4.3 latest is 1.7.1

sorensen@gfy_ku_dk · May 8, 2013

I found the error on my system, which might not be related at all (but I did get the same/similar error). [manjula:22198] *** An error occurred in MPI_Group_range_incl[manjula:22198] *** on communicator MPI_COMM_WORLD[manjula:22198] *** MPI_ERR_RANK: invalid rank[manjula:22198] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort My problem turned out to be that, in the file env_mach_pes.xml, the number of PEs, TOTALPES, is automatically set to the maximum of the NTASKS_* variables (which is 16 by default). So all I had to do was to: a) Use at least -np 16 or b) set all NTASKS in env_mach_pes.xml to the desired number and run cesm_setup -clean and cesm_setup again.

zhongq@cma_gov_cn · Jun 22, 2013

The same error occured in my case run "-compset X -res f19_g16", as:[cn3645:23109] *** An error occurred in MPI_Group_range_incl
[cn3645:23109] *** reported by process [321323009,47863115546624]
[cn3645:23109] *** on communicator MPI_COMM_WORLD
[cn3645:23109] *** MPI_ERR_RANK: invalid rank
[cn3645:23109] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cn3645:23109] *** and potentially your MPI job)The openmpi version is 1.7.1, the -np is 16(default),but the error still exists. Is there any other solution for this error? Any suggestion would be thankful.

santos · Jun 24, 2013

In env_mach_pes.xml, were the NTASKS variables also set to use 16 pes? A mismatch between the two was the cause of the earlier problem in this thread. Otherwise, this is probably a different issue.

zhongq@cma_gov_cn · Jun 25, 2013

Thank you Santos. Yes, the NTASKS=16 in env_mach_pes.xml as the attachment shows.Do you have other suggestions?

santos · Jun 25, 2013

Hmm. This is not clear to me. It seems like there is something wrong with either the environment on your machine, or your batch script.I noticed that you made another post here, and you were asked to try a different version of the model. Which version are you using?

santos · Jun 26, 2013

I don't see any batch system information in your script (the lines that start with "#PBS" or "#BSUB"). Make sure that you can run a simple "Hello World" MPI program on your machine with 16 processors, before trying to port CESM.

santos · Jun 30, 2013

Hmm. This is a different error from the one you mentioned before. What have you changed since then?At a glance it looks like this could still be a problem related to MPI, but I'm not confident. I am not familiar with MCT's m_GlobalSegMap module.

zhongq@cma_gov_cn · Aug 26, 2013

I am using cesm version 1_0_4 now. To verify your hunch that there may be something wrong with the env set or the batch script, i upload the detail revisions when i port the model. could you help to check them.since the case are successfully build and ccsm.exe are produced, can i confirm that the machine environment are set correctly? To the batch script, only the mpi command is reset, the number of tasks is default number.your help are greatly appreciated.

zhongq@cma_gov_cn · Aug 26, 2013

Yes, the hello world can run successfullyhowever, when excute ccsm.exe through the same batch command, the error occurs as follows :330m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
331 008.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
332 --------------------------------------------------------------------------
333 MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
334 with errorcode 2.Do you have further idea or suggestion on this problem? Thank you!

errors executing CESM1.0.4 after successful build

New Member

Member

New Member

Member

New Member

Member

New Member

CSEG and Liaisons

New Member

New Member

CSEG and Liaisons

New Member

New Member

Member

New Member

Member

Member

Member

New Member

New Member