Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Problem: model executables put by mpirun in different working directories

cam and clm appear in my home directory instead of $EXEROOT/all
Since that they cannot read input files and don't work.
Other executables appear in right place.

My system configuration.
Single SMP computer with 4 AMD Opteron processors.
OS Debian Linux, kernel 2.6.15

mpich 1.2.7p1
configured with --with-device=ch_p4 --with-comm=shared

PGI compiler v 6.1-2
I use 'generic_linux' model configuration.

I also used recommendations from PGI for compiling MPICH: http://www.pgroup.com/resources/mpich/mpich126_pgi60.htm

I have added the following lines in the file shr_msg_mod.F90

line 77: character(256) :: curr_wd

line 95: call GetCWD(curr_wd)

line 100: write(6,F00) model,":: curr_wd is",curr_wd

in other words I have made this module to get its current working directory and print it.
I have put this file in the directories SourceModssrc.cam, SourceModssrc.cpl, SourceModssrc.clm, SourceModssrc.pop, SourceModssrc.csim

I also modified the model run script. Now it generates mpirun procgroup file of the following form:

chena 1 /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all/cpl
chena 1 /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all/csim
chena 1 /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all/csim
chena 1 /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all/clm
chena 1 /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all/clm
chena 1 /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all/pop
chena 1 /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all/pop
chena 1 /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all/cam
chena 1 /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all/cam

chena is the name of my computer.
I am trying to run components by 2 tasks.
I have created the case with create_test, but changed the number of tasks in env_mach.generic_linux and in .cache/generic_linux.cache

Then the model 'run' script executes mpirun and I get all listed processes in memory (top and ps x show them).
These processes sit in memory and consume some processor resources. They don't produce any output file and are terminated either by the scheduler or manually by qdel.

Then, log files contain the following:

The 'main' log file, where PBS redirects the output of run scripts:

t_setoption: option disabled: Usr Sys
(shr_msg_chdir) atm:: curr_wd is/home/veremeev
(shr_msg_chdir)
(shr_msg_chdir) file atm_stdio.nml doesn't exist, cwd has *not* been changed
(shr_msg_chStdOut) file atm_stdio.nml doesn't exist, unit 6 has *not* been changed
(shr_msg_chStdIn) file atm_stdio.nml doesn't exist, unit 5 has *not* been changed
(cpl_comm_init) setting up communicators, name = atm
===================================

... and :

t_setoption: option disabled: Usr Sys
(shr_msg_chdir) lnd:: curr_wd is/home/veremeev
(shr_msg_chdir)
(shr_msg_chdir) file lnd_stdio.nml doesn't exist, cwd has *not* been changed
(shr_msg_chStdOut) file lnd_stdio.nml doesn't exist, unit 6 has *not* been changed
(shr_msg_chStdIn) file lnd_stdio.nml doesn't exist, unit 5 has *not* been changed
(cpl_comm_init) setting up communicators, name = lnd
===================================

... and:

(shr_msg_chdir) ice:: curr_wd is/home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/all
(shr_msg_chdir) read ice_stdio.nml, changed cwd to /home/veremeev/TER.01a.T31_gx3v5.B.generic_linux.040330/ice
(shr_msg_chStdIn) read ice_stdio.nml, unit 5 connected to ice.stdin

I cannot find anywhere similar output from the coupler and ocean components. But they write in their logs that have read successfuly input files, therefore they are OK.

To summarize, mpirun places clm and cam in my home directory and other components to the $EXEROOT/all
Since that cam and clm cannot read stdio files and other data and don't work.
Other components are OK.

I also tried to use mpiexec from P. Wykoff instead of MPICH's mpirun (http://www.osc.edu/~pw/mpiexec/index.php).

It acted in the otherwise order, cam and clm appeared in $EXEROOT/all and other components somewhere else.

I cannot understand the reasons of such strange behaviour.
Help me, please.
 
There are 2 bugs. In CAM and in CLM.

Calls of MPI_INIT were in incorrect places.

For CAM:

File: cam.F90, line 101

call shr_msg_stdio('atm')

must be AFTER

call cpl_interface_init(cpl_fields_atmname,mpicom)

For CLM:
File program_csm.F90, line 165:

call shr_msg_stdio ('lnd')

must be moved lower, AFTER calls of MPI setup.

mpirun runs programs with rsh.
Before connecting with rsh to a compuite node it constructs a very long command line with various -p4xxx switches (-p4pg -p4wd ). This command line is processed by MPI_INIT subroutine.

Initially, the program being run, appears in the user's home directory (/home/veremeev was in my case).
Then MPI_INIT should place it in the correct location, specified by the -p4wd switch.

In present case clm and cam appeared in /home/veremeev and immediately tried loading stdio files.
With bad luck, obviously.
Then MPI_INIT placed them to $EXEDIR/all but this already didn't matter.
 
"The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output."
 
Top