Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

mpiexec error

I started getting an error when I submitted my waccm simulation to the NASA pleiades-san system. Once it began to run, it came up with this error in the cesm log file and stopped:

asremexec (host 'r307i3n3'): request failed - unable to start interactive
connection
/nasa/sgi/mpt/2.08r7/bin/mpiexec_mpt.real: line 335: 21243
Killed $mpicmdline_prefix -f $paramfile

I contacted NAS support at NASA and they said this:

There is an unresolved issue on pleiades that causes MPI jobs to fail at startup. The only know workaround is to retry. The following knowledge base article has more details.

http://www.nas.nasa.gov/hecc/support/kb/MPT-Startup-Failures_469.html

I tried the link suggestions but to no avail. They said there is currently not a timetable for this resolution to be fixed. Does anyone know of a workaround to get it to run still?
Thanks,
 

jedwards

CSEG and Liaisons
Staff member
How did you implement this in CESM?    In your run script change
Code:
mpiexec_mpt -n ${maxtasks} $EXEROOT/cesm.exe >&! cesm.log.$LID<br />to<br />   touch cesm.log.$LID   <br />   setenv PATH $PATH:<kbd>/u/scicon/tools/bin</kbd><br /><code>  several_tries</code>  mpiexec -n ${maxtasks} $EXEROOT/cesm.exe >>& cesm.log.$LID<br /><br />If everything is working correctly you should then see the error message <kbd>repeated SEVERAL_TRIES_NTRIES</kbd> times.  <br />If you do not then either it's not working correctly or the timeout variable <kbd>SEVERAL_TRIES_MAXTIME</kbd>  should be increased.  <br /><br /><br />
 
Hi,Yes I implemented it correctly. It gave me this message in the cesm log file:/nasa/sgi/mpt/2.08r7/bin/mpiexec_mpt.real: line 335: 55006 Killed                  $mpicmdline_prefix -f $paramfile
pfe21.akren 335> more cesm.log.140520-164732
asremexec (host 'r307i7n10'): request failed - unable to start interactive connection
/nasa/sgi/mpt/2.08r7/bin/mpiexec_mpt.real: line 335: 12420 Killed                  $mpicmdline_prefix -f $paramfile
pfe21.akren 336> more cesm.log.140521-110841
/u/scicon/tools/bin/several_tries is running the command mpiexec_mpt -n 384 /nobackup/akren/waccm_test/bld/cesm.exe
asremexec (host 'r311i1n3'): request failed - unable to start interactive connection
/nasa/sgi/mpt/2.08r7/bin/mpiexec_mpt.real: line 335: 36888 Killed                  $mpicmdline_prefix -f $paramfile
/u/scicon/tools/bin/several_tries is running the command mpiexec_mpt -n 384 /nobackup/akren/waccm_test/bld/cesm.exe
asremexec (host 'r311i1n3'): request failed - unable to start interactive connection
/nasa/sgi/mpt/2.08r7/bin/mpiexec_mpt.real: line 335: 36965 Killed                  $mpicmdline_prefix -f $paramfile
/u/scicon/tools/bin/several_tries is running the command mpiexec_mpt -n 384 /nobackup/akren/waccm_test/bld/cesm.exe
asremexec (host 'r311i1n3'): request failed - unable to start interactive connection
/nasa/sgi/mpt/2.08r7/bin/mpiexec_mpt.real: line 335: 37042 Killed                  $mpicmdline_prefix -f $paramfile
/u/scicon/tools/bin/several_tries is running the command mpiexec_mpt -n 384 /nobackup/akren/waccm_test/bld/cesm.exe
asremexec (host 'r311i1n3'): request failed - unable to start interactive connection
/nasa/sgi/mpt/2.08r7/bin/mpiexec_mpt.real: line 335: 37119 Killed                  $mpicmdline_prefix -f $paramfile
/u/scicon/tools/bin/several_tries is running the command mpiexec_mpt -n 384 /nobackup/akren/waccm_test/bld/cesm.exe
asremexec (host 'r311i1n3'): request failed - unable to start interactive connection
/nasa/sgi/mpt/2.08r7/bin/mpiexec_mpt.real: line 335: 37196 Killed                  $mpicmdline_prefix -f $paramfile
/u/scicon/tools/bin/several_tries is running the command mpiexec_mpt -n 384 /nobackup/akren/waccm_test/bld/cesm.exe
asremexec (host 'r311i1n3'): request failed - unable to start interactive connection
/nasa/sgi/mpt/2.08r7/bin/mpiexec_mpt.real: line 335: 37276 Killed                  $mpicmdline_prefix -f $paramfile

However, I have tried again today to run the model, and it appears that it is up and running again as it did not give that error. I just wanted to let you and others know that the problem may be fixed.
 
Top