Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

failed to run single-point case

Dear all,

I was trying to run a single-point case with PTS_MODE with CLM4.0 but failed.
When I created the case, I set max_tasks_per_node to be 8.
I checked the env_mach_pes.xml and it is as follows:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%








































%%%%%%%%%%%%%%%%%%%%%%%

So it does not support parallel mode becuase there is only one grid. I also set USE_MPISERIAL to FALSE (otherwise it would complain), and everythings goes smoothly until run.

Here is the error message:

%%%%%%%%%%%%%%%%

(seq_comm_printcomms) ID layout : global pes vs local pe for each ID
gpe LND ATM OCN ICE GLC CPL GLOBAL CPLATM CPLLND CPLICE CPLOCN CPLGLC nthrds
--- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
0 : 0 0 0 0 0 0 0 0 0 0 0 0 1

(t_initf) Read in prof_inparm namelist from: drv_in


1 pes participating in computation for CLM

-----------------------------------

NODE# NAME
( 0) water
application called MPI_Abort(comm=0x84000002, 1) - process 0
rank 0 in job 3 water_42640 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In the run file, I tried both "mpirun -np 1 ./ccsm.exe >&! ccsm.log.$LID"
and "./ccsm.exe >&! ccsm.log.$LID", both do not work.

If someone can offer any advice, I will appreciate it.

Thanks,
Rui
 
I have also tried "Running Supported Single-point/Regional Datasets
" and failed again.

I creat a case using this "./create_newcase -case /home2/meir02/ccsm4_0/test -mach generic_linux_pgi -compset I -res pt1_pt1 -scratchroot /home2/meir02/ccsm4_0/test -din_loc_root_csmdata /home2/meir02/ccsm4_0/inputdata -max_tasks_per_node 8"

Then I went to the test directory to search the namelist CLM_1PT_NAME in env_conf.xml but did not find it.

Hope to find some answer in this forum.
Thanks,

Rui
 
Similar error message occurs to the run using second mode "Running Supported Single-point/Regional Datasets". This may be MPI issue. Hope to find some clues on this forum.

Rui

meir02 said:
Dear all,

I was trying to run a single-point case with PTS_MODE with CLM4.0 but failed.
When I created the case, I set max_tasks_per_node to be 8.
I checked the env_mach_pes.xml and it is as follows:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%








































%%%%%%%%%%%%%%%%%%%%%%%

So it does not support parallel mode becuase there is only one grid. I also set USE_MPISERIAL to FALSE (otherwise it would complain), and everythings goes smoothly until run.

Here is the error message:

%%%%%%%%%%%%%%%%

(seq_comm_printcomms) ID layout : global pes vs local pe for each ID
gpe LND ATM OCN ICE GLC CPL GLOBAL CPLATM CPLLND CPLICE CPLOCN CPLGLC nthrds
--- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
0 : 0 0 0 0 0 0 0 0 0 0 0 0 1

(t_initf) Read in prof_inparm namelist from: drv_in


1 pes participating in computation for CLM

-----------------------------------

NODE# NAME
( 0) water
application called MPI_Abort(comm=0x84000002, 1) - process 0
rank 0 in job 3 water_42640 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In the run file, I tried both "mpirun -np 1 ./ccsm.exe >&! ccsm.log.$LID"
and "./ccsm.exe >&! ccsm.log.$LID", both do not work.

If someone can offer any advice, I will appreciate it.

Thanks,
Rui
 
For the same purpose of run, this time I changed MPISERIAL_SUPPORT to true and keep USE_MPISERIAL to be true. But the model failed while compiling ccsm.

Here is the error message

"-Ktrap=fp -Mfree /home2/meir02/cesm1_0/models/drv/driver/ccsm_driver.F90
gmake: *** No rule to make target `/home2/meir02/cesm1_0/test/test/lib/libmpi-serial.a', needed by `/home2/meir02/cesm1_0/test/test/run/ccsm.exe'. Stop."

Is it because the machine does not support mpiserial so I can not force that namelist to be true. I also did not find any "mpi-serial" related lib under /usr/local/mpi/lib.

How to tell whether it supports mpiserial and how to run the single-point if it does not support mpiserial? I wish to get some advice from this forum.

Thank you for any help.
Rui
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Rui

Currently it's fairly complicated to set MPISERIAL_SUPPORT to TRUE. You have to modify the Macro's files as well as the mkbatch files in scripts/ccsm_utils/Machines. The release in May will make it easier to do this. So I would recommend installing MPI and getting help from your systems folks to learn how to run MPI and get it installed.

If installing MPI is a problem -- then getting MPISERIAL_SUPPORT to work, you'll need to look at how machines that have MPISERIAL_SUPPORT=TRUE reference USE_MPISERIAL==TRUE and do things different in the Macros and mkbatch files. For example, bluefire has MPISERIAL_SUPPORT=TRUE, and has this in: Macros.bluefire:

.
.
.

ifeq ($(USE_MPISERIAL),TRUE)
FC := xlf90_r
CC := cc_r
else
FC := mpxlf90_r
CC := mpcc_r
endif

.
.
.

ifeq ($(USE_MPISERIAL),TRUE)
INC_MPI := $(CODEROOT)/utils/mct/mpi-serial
LIB_MPI :=
else
INC_MPI :=
LIB_MPI :=
endif
.
.
.
ifeq ($(MODEL),mct)
.
.
.
ifeq ($(USE_MPISERIAL),TRUE)
CONFIG_ARGS= --enable-mpiserial
endif
.
.
.
endif

ifeq ($(MODEL),pio)
.
.
.
ifeq ($(USE_MPISERIAL),TRUE)
CONFIG_ARGS += --enable-mpiserial
endif
endif

That's pretty much what you need for your Macro's file. You need to make sure the compiler names are correct for your machine for both the serial and MPI versions.

The mkbatch.bluefire file then has this in it...


if ($USE_MPISERIAL == "FALSE") then
mpirun.lsf /contrib/bin/ccsm_launch /contrib/bin/job_memusage.exe ./ccsm.exe >&! ccsm.log.$LID
else
/contrib/bin/job_memusage.exe ./ccsm.exe >&! ccsm.log.$LID
endif

You need to add the same if to your mkbatch file (although you would remove the "/contrib/bin/job_memusage.exe" part and just run ccsm.exe).

Good luck...
 
Dear Erik,

Thank you very much for your reply. I now understand that if MPISERIAL is TRUE, we can run the model without involving MPI, otherwise we need MPI to run the model even it is for a single-point using only one processor. However, in my first and second post of this thread, I was using MPI and it was working fine for global run but failed during single-point run and regional run. Could you provide any clues to solve it?

Best regards,
Rui
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Hi Rui

OK so the problem from the beginning of the thread is...


"NODE# NAME
( 0) water
application called MPI_Abort(comm=0x84000002, 1) - process 0
rank 0 in job 3 water_42640 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In the run file, I tried both "mpirun -np 1 ./ccsm.exe >&! ccsm.log.$LID"
and "./ccsm.exe >&! ccsm.log.$LID", both do not work."

I can't really tell from this what is going on. I suggest you try a dead simple mpi program and see if you can get it to work, and have your systems people help you with getting MPI up and going. Sometimes there are systems type of things that you have to do to get it to work. Also you want to look in all the log files so you have a better idea where it's crashing.

The User's Guide has a section about trouble shooting run-time problems...

http://www.cesm.ucar.edu/models/cesm1.0/clm/models/lnd/clm/doc/UsersGuide/x3390.html

use the advice it gives to query the log files and see if you can see where it's dying.
 
Dear Erik,

Thanks for the information. I have figured out that it's becasue running a single-point does not support global initial file, which has been mentioned in the CLM guide within CESM, but not within CCSM; and I was reading CLM user guide from CCSM before.

Sincerely,
Rui
 
Top