mmills@ucar_edu
Member
This thread is devoted to issues related to running CESM/WACCM on the Pleiades cluster at NASA's Advanced Supercomputing (NAS) Division. Scripts included with the released version of CESM are designed to build and run on Pleiades out of the box.
In this post, I am posting fixes for to issues we've run in to running the current release (CESM1.0.2) on Pleiades. The first has prevented some users from building CESM on Pleiades. The second relates to the long-term archiving script. Both issues should be addressed in future releases.
1) "Catastrophic disk failure"
The CESM1.0.2 scripts build CESM with Intel Fortran (ifort) version 10.1. In February, NAS moved to a new NFS server for Pleiades. This caused some file inode numbers to grow larger than 2^32. The Intel 10.x compilers are not able to compile files with large inode numbers and will fail with Catastrophic errors, or the link step will fail with Internal Compiler errors. FYI, You can get the inode number
for a directory with the command:
and look in the first column. Similarly, you can get the inode numbers for your files with:
The simple solution is to upgrade to one of the Intel 11.x compilers. Do do so, modify the env_mach_specific file in your case directory. Replace the following line:
with
Then modify your Macros.pleiades or Macros.pleiades_wes file. Move the -132 from the FFLAGS line to the FIXEDFLAGS line.
You can now build the model with ifort 11.
2) Long-term archiving
The long-term archiving scripts move output from the short-term archive directory on /nobackup/$USER to the mass store system on lou. However, the scripts in the release fail to delete the files from the short-term archive after they have been copied to lou. Instead, the file remain in tempdirs in the short-term archive.
To fix this, go to your source code directory and modify the file scripts/ccsm_utils/Tools/ccsm_mswrite. Look for this section:
Modify this section by adding several lines after the "ssh -q bridge2..." line:
Files should now be deleted after archiving.
In this post, I am posting fixes for to issues we've run in to running the current release (CESM1.0.2) on Pleiades. The first has prevented some users from building CESM on Pleiades. The second relates to the long-term archiving script. Both issues should be addressed in future releases.
1) "Catastrophic disk failure"
The CESM1.0.2 scripts build CESM with Intel Fortran (ifort) version 10.1. In February, NAS moved to a new NFS server for Pleiades. This caused some file inode numbers to grow larger than 2^32. The Intel 10.x compilers are not able to compile files with large inode numbers and will fail with Catastrophic errors, or the link step will fail with Internal Compiler errors. FYI, You can get the inode number
for a directory with the command:
Code:
ls -ldi directory_name
and look in the first column. Similarly, you can get the inode numbers for your files with:
Code:
ls -li
The simple solution is to upgrade to one of the Intel 11.x compilers. Do do so, modify the env_mach_specific file in your case directory. Replace the following line:
Code:
module load comp/intel/10.1.021_64 mpi-mvapich2/1.4.1/intel netcdf/4.0-i10.1 nas
with
Code:
module load comp/intel/11.0.069_64 mpi-mvapich2/1.4.1/intel netcdf/4.0-i10.1 nas
Then modify your Macros.pleiades or Macros.pleiades_wes file. Move the -132 from the FFLAGS line to the FIXEDFLAGS line.
You can now build the model with ifort 11.
2) Long-term archiving
The long-term archiving scripts move output from the short-term archive directory on /nobackup/$USER to the mass store system on lou. However, the scripts in the release fail to delete the files from the short-term archive after they have been copied to lou. Instead, the file remain in tempdirs in the short-term archive.
To fix this, go to your source code directory and modify the file scripts/ccsm_utils/Tools/ccsm_mswrite. Look for this section:
Code:
# If NAS pleiades at NASA/AMES
if( ${MACH} == "pleiades" | ${MACH} == "pleiades_wes" ) then
set myld = `pwd`
echo "ccsm_mswrite: ssh -q bridge2 scp -q ${myld}/${lf} lou:${rdf} "
ssh -q bridge2 "scp -q ${myld}/${lf} lou:${rdf}"
exit
endif
Modify this section by adding several lines after the "ssh -q bridge2..." line:
Code:
# If NAS pleiades at NASA/AMES
if( ${MACH} == "pleiades" | ${MACH} == "pleiades_wes" ) then
set myld = `pwd`
echo "ccsm_mswrite: ssh -q bridge2 scp -q ${myld}/${lf} lou:${rdf} "
ssh -q bridge2 "scp -q ${myld}/${lf} lou:${rdf}"
sleep 5
echo "$UTILROOT/Tools/ccsm_msread ${rdf} checkmssfile"
$UTILROOT/Tools/ccsm_msread ${rdf} checkmssfile
if (-e checkmssfile) then
echo "cmp -s ${myld}/${lf} checkmssfile"
cmp -s ${myld}/${lf} checkmssfile
if ($status == 0) then
echo rm ${myld}/${lf}
rm -f ${myld}/${lf}
rm -f checkmssfile
else
echo archiving FAILED for file ${myld}/${lf}
endif
endif
exit
endif
Files should now be deleted after archiving.