Main menu

Navigation

CESM on Stampede2 (TACC)

20 posts / 0 new
Last post
mdfowler@...
CESM on Stampede2 (TACC)

Hello all, 

 

I've been attempting to get CESM1.2.2 up and running on Stampede's KNL system, Stampede2, but we've run into a few issues. This seems to be architecture (or potentially compiler type and version) specific, as this model works on the NERSC Cori KNL system but not on Stampede. Unfortunately, we are not able to back-migrate compilers or try other versions as they are not available on this system.

 

 

We are running at a resolution of two degrees in the B_1850-2000_CN compset. The main error message being produced now is the following, which seems to stem from the ice_transport_remap.F90 file: 

forrtl: severe (154): array index out of bounds

Image              PC                Routine            Line        Source

cesm.exe           0000000001D7BFC9  Unknown               Unknown  Unknown

libpthread-2.17.s  00002AFFE1144370  Unknown               Unknown  Unknown

cesm.exe           0000000000E3815B  ice_transport_rem         678  ice_transport_remap.F90

cesm.exe           0000000000E205E8  ice_transport_dri         549  ice_transport_driver.F90

cesm.exe           0000000000DF90D6  ice_step_mod_mp_s         679  ice_step_mod.F90

cesm.exe           0000000000C987E3  ice_comp_mct_mp_i         631  ice_comp_mct.F90

cesm.exe           0000000000417E59  ccsm_comp_mod_mp_        3248  ccsm_comp_mod.F90

cesm.exe           0000000000439142  MAIN__                     91  ccsm_driver.F90

cesm.exe           000000000041467E  Unknown               Unknown  Unknown

libc-2.17.so       00002AFFE1674B35  __libc_start_main     Unknown  Unknown

cesm.exe           0000000000414569  Unknown               Unknown  Unknown

This message was produced after almost 1.5 years during a long simulation, but was again produced after just 15 days when we re-submitted the same run script. It seems that there's a root problem that we're unaware of, but that is leading to unpredictable failures. 

 

We are using the Intel compiler (version 17.0.4), with the following flags supplied to the model in the config_compilers.xml: 

<compiler COMPILER="intel">

  <!-- http://software.intel.com/en-us/articles/intel-composer-xe/ -->

  <ADD_CPPDEFS> -DFORTRANUNDERSCORE -DNO_R16 -DCRMACCEL -DSTRATOKILLER </ADD_CPPDEFS>

  <ADD_CFLAGS compile_threaded="true"> -openmp </ADD_CFLAGS>

  <ADD_FFLAGS compile_threaded="true"> -openmp </ADD_FFLAGS>

  <ADD_LDFLAGS compile_threaded="true"> -openmp </ADD_LDFLAGS>

  <FREEFLAGS> -free </FREEFLAGS>

  <FIXEDFLAGS> -fixed -132 </FIXEDFLAGS>

  <ADD_FFLAGS DEBUG="TRUE"> -g -CU -check pointers -fpe0 -ftz </ADD_FFLAGS>

  <ADD_FFLAGS DEBUG="FALSE"> -O2 </ADD_FFLAGS>

  <FFLAGS> -no-opt-dynamic-align -fp-model precise -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs </FFLAGS>

  <CFLAGS> -O2 -fp-model precise </CFLAGS>

  <FFLAGS_NOOPT> -O0 </FFLAGS_NOOPT>

  <FC_AUTO_R8> -r8 </FC_AUTO_R8>

  <SFC> ifort </SFC>

  <SCC> icc </SCC>

  <SCXX> icpc </SCXX>

  <MPIFC> mpif90 </MPIFC>

  <MPICC> mpicc  </MPICC>

  <MPICXX> mpicxx </MPICXX>

  <CXX_LINKER>FORTRAN</CXX_LINKER>

  <CXX_LDFLAGS> -cxxlib </CXX_LDFLAGS>

  <SUPPORTS_CXX>TRUE</SUPPORTS_CXX>

</compiler>

 

Any insight would be much appreciated. 

Best,

Meg Fowler

max@...
This case did not work for me on Stampede2, so for my initial testing I switched to I_1850_CN, which runs fine. For B_1850_CN did you get runtime warnings from NetCDF? Are you sure all the input files are on hand?

Max R. Dechantsreiter
Performance Jones L.L.C.

mdfowler@...

Thanks for checking this out, Max. It's odd that something would be systematically wrong with the B_1850_CN (or the B_1850-2000_CN) composet. The appropriate files are on hand, but there were a few NetCDF errors/warnings that appeared in the cesm log (i.e., "NetCDF: Invalid dimension", "NetCDF: Variable not found", "NetCDF:Attribute not found", etc.). These had appeared even in simulations that were able to successsfully run for a few months, however, so I'm unsure of how this is related to this particular failure mode. 

 

Did you make any changes in the machine files of CESM to run on Stampede2 instead of 1? If so, could I glance at the modifications you made? 

 

As of now, it seems that we can work around the problem by assigning only 32 tasks per node instead of 64. The error doesn't seem to be related to a memory limitation though (based on some memory usage checking by TACC support), so it's also unclear why this is getting rid of the array index out of bounds problem we were experiencing previously. 

max@...

So ICE is not active in I_1850_CN—that explains why it works, but B_1850* doesn't.

Interesting that you can run with MAX_TASKS_PER_NODE=32; I have been using 64. I will test that myself!

I was able to run B_1850_CN with MAX_TASKS_PER_NODE=64 up to the point where it started writing restart files, or at least that is how it appeared to me, a relatively inexperienced user (I'm really a benchmarker). But all components time-stepped, without any NetCDF warnings (those started in the shutdown phase). To achieve this partial success, I built MODEL="cice" with O0 instead of O2, so I suspect a compiler bug. I am continuing to chase it.

As for sharing my port, personally I would love to—but I am working under contract, so first I'd need to ask my client.

Max R. Dechantsreiter
Performance Jones L.L.C.

mdfowler@...

It certainly does seem to be related to the ice model in particular. How do you go about changing the optimizaiton flag for only that model component? Is this something you include in the machine build script? If it makes any difference, I've still got MAX_TASKS_PER_NODE set to 256 (in theory, each node has 64 cores with 4 threads per core, so 256 logical cores) and specify --ntasks-per-node=32 via SBATCH to get a functioning simulation. 

Understandable that you need to check before sharing these machine files. If your client is open to it though, I'd be incredibly appreciative! Happy to share what we've worked out machine-file wise in return as well. 

-Meg

 

max@...

(I'm in Germany, so this'll be my last post for the day.)

<FFLAGS MODEL="cice"> your compiler flags here </FFLAGS>

I'm working on a binary search for the problem file(s); to do that, I modifed Macros, again using gmake conditionals.

Word around here is that CESM doesn't work with OpenMP on Stampede2, maybe KNL in general—another pending bug hunt for me!

An option not open to me, but perhaps to you, would be build B_1850_CN or whatever on an Intel system that isn't Phi, with Intel tools, to see if it's the hardware/tools or the source.

But I shouldn't say any more mmph mmph mmph

Max R. Dechantsreiter
Performance Jones L.L.C.

mdfowler@...

Interesting, I wasn't aware that you could specify compiler flags for each component of the model - good to know! I've heard that OpenMP is a problem for CESM as well - this has been an issue on multiple machines now, and stems from the ice_transport_remap.F90 file. I believe you can still use OpenMP for the other model components, but the ice model can't be enabled for OMP. 

As far as other systems go, the model seems to work fine on NERSC's Cori-KNL machine, even with 64 tasks per node. This is partly why we're having such a hard time finding the exact cause of this problem - the same model works on one machine, but takes some odd configurations to get working on another machine with fairly similar architecture. So far, the 32 tasks work-around is holding, but it's frustrating to be unable to use the entire node. 

max@...

Do you happen to know the size of memory on Cori-KNL nodes (cat /proc/meminfo)? Never mind—I checked: both Cori and Stampede2 KNL nodes have 96GB. I had suspected a memory issue (32 tasks have twice as much memory apiece as 64). This would have also explained why multithreading fails when ICE is active. It's still possible that on Stampede2 the memory isn't all available for some reason.

I will test 32 tasks per node myself, for B_1850_CN, the compset that I'd most like to run.

My client allowed me to share configuration, but not results; so once I get comfortable with CESM I could give some specifics. But I didn't change much, at least it doesn't seem like a lot.

Max R. Dechantsreiter
Performance Jones L.L.C.

mdfowler@...

I agree that a memory limitation would make sense. That said, we've worked with someone at TACC on this, and from their results it doesn't seem as obvious that memory is the problem. Simulations with 32 tasks per node have plenty of available memory space; it's hard to say with 64 tasks per node, because those jobs are failing too quickly to get a useful memory read on. Perhaps you'll have more luck determining if memory is the main problem. Did the B_1850_CN compset you tried work? 

The test you suggest on Cori might be useful though. I worry that the results might not be directly applicable though, since 64 tasks per node works on Cori but not on Stampede2. Next time I'm testing over there though, I'll be sure to check. 

max@...

All my latest tests have failed, including a run with MAX_TASKS_PER_NODE=32.

There appear to be "magic numbers" in cice—were you aware of that? nproc=320 or 640, for examples; using those you get very (?) different decompositions than with nproc=512 or 1024.

But even using the "blessed" values of nproc my runs are dying with a "ridging error." So now I've increased the iteration limit from 20 to 100; and I'm testing a build with "-fp-model precise" instead of "source" in case there is some precision issue at play.

How many total cores are you running on? Are you using the Intel 17 tools?

Max R. Dechantsreiter
Performance Jones L.L.C.

mdfowler@...

I wasn't aware of any "magic numbers" in cice, no. I assume that it won't lead to different solutions scientifically though, correct? I've dealt with the "ridging error" frequently. One thing to try is to increase your node count by one (so if you're running a job that would only require 8 nodes, try assiging it 9 instead). That's gotten around the issue occasionally. The most recent case in which I've encountered the ice ridging error seems to actually be related to settings in the ocean model. I had specified an in input spun-up ocean file to the POP model, but hadn't changed init_ts_file_fmt from 'bin' to 'nc' (the input file was netCDF, not binary, in my case).

Updating to -fp-model precise should also help. That solved a few error messages we had been encountering previously. Right now, I'm running on 256 cores - 32 tasks per node. I've assigned 9 nodes (for some reason our model wouldn't work on 8 nodes, though the problem doesn't seem to be reproducible). We're using Intel 17 still; I think that's the main one on Stampede2. 

 

 

 

 

max@...
Please see models/ice/cice/bld/cice_decomp.xml (and I have no idea about the science...). I'll check OCN init_ts_file_fmt I already tried "precise" - no apparent effect. I suppose "strict" would the last hope. So you are getting your case to work on Stampede2???

Max R. Dechantsreiter
Performance Jones L.L.C.

max@...

Promise not to laugh—the following is my configuration (excerpted from config_compilers.xml). It is a work in progress....

<compiler COMPILER="intel">
  <!-- http://software.intel.com/en-us/articles/intel-composer-xe/ -->
  <ADD_CPPDEFS> -DFORTRANUNDERSCORE -DNO_R16 -DLinux -DCPRINTEL </ADD_CPPDEFS>
  <CFLAGS> -O2 -fp-model precise </CFLAGSgt;
  <CXX_LDFLAGS> -cxxlib </CXX_LDFLAGS>
  <CXX_LINKER>FORTRAN</CXX_LINKER>
  <FC_AUTO_R8> -r8 </FC_AUTO_R8>
  <AAA_FFLAGS> -g -traceback -convert big_endian -assume byterecl -assume realloc_lhs -fp-model source </AAA_FFLAGS>
  <FFLAGS> $(AAA_FFLAGS) -O2 -ftz -qno-opt-dynamic-align </FFLAGS>
  <FFLAGS_NOOPT> $(AAA_FFLAGS) -O0 </FFLAGS_NOOPT>
  <FIXEDFLAGS> -fixed -132 </FIXEDFLAGS>
  <FREEFLAGS> -free </FREEFLAGS>
  <MPICC> mpiicc </MPICC>
  <MPICXX> mpicxx </MPICXX>
  <MPIFC> mpif90 </MPIFC>
  <SCC> icc </SCC>
  <SCXX> icpc </SCXX>
  <SFC> ifort </SFC>
  <SUPPORTS_CXX>TRUE</SUPPORTS_CXX>
  <ADD_FFLAGS DEBUG="TRUE"> -O0 -g -check uninit -check bounds -check pointers -fpe0 </ADD_FFLAGS>
  <!-- <ADD_FFLAGS DEBUG="FALSE"> -O2 </ADD_FFLAGS> -->
  <ADD_CFLAGS compile_threaded="true"> -qopenmp </ADD_CFLAGS>
  <ADD_FFLAGS compile_threaded="true"> -qopenmp </ADD_FFLAGS>
  <ADD_LDFLAGS compile_threaded="true"> -qopenmp </ADD_LDFLAGS>
  <ADD_CPPDEFS MODEL="pop2"> -D_USE_FLOW_CONTROL </ADD_CPPDEFS>
</compiler>

<compiler MACH="stampede2">
  <CONFIG_ARGS> --host=Linux </CONFIG_ARGS>
  <ADD_CPPDEFS>-DHAVE_NANOTIME</ADD_CPPDEFS>
  <NETCDF_PATH>$(TACC_NETCDF_DIR)</NETCDF_PATH>
  <PNETCDF_PATH>$(TACC_PNETCDF_DIR)</PNETCDF_PATH>
  <PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>
</compiler>

<compiler MACH="stampede2" COMPILER="intel">
  <ADD_SLIBS>$(shell $(NETCDF_PATH)/bin/nf-config --flibs) -L$(TACC_PNETCDF_LIB) -lpnetcdf</ADD_SLIBS>
  <FFLAGS MODEL="cice">$(FFLAGS_NOOPT)</FFLAGS>
  <ADD_CFLAGS>-xMIC-AVX512</ADD_CFLAGS>
  <ADD_FFLAGS>-xMIC-AVX512</ADD_FFLAGS>
  <ADD_FFLAGS_NOOPT>-xMIC-AVX512</ADD_FFLAGS_NOOPT>
</compiler>

Max R. Dechantsreiter
Performance Jones L.L.C.

mdfowler@...

Thanks for sharing your compiler settings! Below are ours as well; a lot of trial and error has gone into this, so it's not a guarantee that everything in there is necessary or will help. But so far, this seems to be a functioning configuration: 

 

<compiler COMPILER="intel">

  <!-- http://software.intel.com/en-us/articles/intel-composer-xe/ -->

  <ADD_CPPDEFS> -DFORTRANUNDERSCORE -DNO_R16 -DCRMACCEL -DSTRATOKILLER </ADD_CPPDEFS>

  <ADD_CFLAGS compile_threaded="true"> -openmp </ADD_CFLAGS>

  <ADD_FFLAGS compile_threaded="true"> -openmp </ADD_FFLAGS>

  <ADD_LDFLAGS compile_threaded="true"> -openmp </ADD_LDFLAGS>

  <FREEFLAGS> -free </FREEFLAGS>

  <FIXEDFLAGS> -fixed -132 </FIXEDFLAGS>

  <ADD_FFLAGS DEBUG="TRUE"> -g -CU -check pointers -fpe0 -ftz </ADD_FFLAGS>

  <ADD_FFLAGS DEBUG="FALSE"> -O2 </ADD_FFLAGS>

  <FFLAGS> -no-opt-dynamic-align -fp-model precise -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs </FFLAGS>

  <CFLAGS> -O2 -fp-model precise </CFLAGS>

  <FFLAGS_NOOPT> -O0 </FFLAGS_NOOPT>

  <FC_AUTO_R8> -r8 </FC_AUTO_R8>

  <SFC> ifort </SFC>

  <SCC> icc </SCC>

  <SCXX> icpc </SCXX>

  <MPIFC> mpif90 </MPIFC>

  <MPICC> mpicc  </MPICC>

  <MPICXX> mpicxx </MPICXX>

  <CXX_LINKER>FORTRAN</CXX_LINKER>

  <CXX_LDFLAGS> -cxxlib </CXX_LDFLAGS>

  <SUPPORTS_CXX>TRUE</SUPPORTS_CXX>

 

</compiler>



<compiler MACH="stampede-knl">

  <PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>

  <NETCDF_PATH>$(TACC_NETCDF_DIR)</NETCDF_PATH>

  <!--PNETCDF_PATH>$(TACC_NETCDF_DIR)</PNETCDF_PATH-->

  <!--  <ADD_CPPDEFS> -DHAVE_NANOTIME -DCLOUDKILLER </ADD_CPPDEFS> THIS IS FOR CLOUDKILLER-->

  <!--  <ADD_CPPDEFS> -DHAVE_NANOTIME -DASYMTSI </ADD_CPPDEFS> THIS IS FOR ASYMTSI-->

  <ADD_CPPDEFS> -DHAVE_NANOTIME </ADD_CPPDEFS>

</compiler>


 

<compiler MACH="stampede-knl"COMPILER="intel">

  <MPICC>mpicc</MPICC>

  <MPIFC>mpif90</MPIFC>

  <MPICXX>mpicxx</MPICXX>

  <SFC>ifort</SFC>

  <SCC>icc</SCC>

  <SCXX>icpc</SCXX>

  <ADD_FFLAGS> -xMIC-AVX512 </ADD_FFLAGS>

  <ADD_CFLAGS> -xHost </ADD_CFLAGS>

  <ADD_SLIBS>$(shell $(NETCDF_PATH)/bin/nf-config --flibs) -L$(TACC_HDF5_LIB) -lhdf5</ADD_SLIBS>

  <ADD_LDFLAGS>-L$(TACC_HDF5_LIB) -lhdf5</ADD_LDFLAGS>

  <TRILINOS_PATH>$(TRILINOS_PATH)</TRILINOS_PATH>

 

</compiler>


max@...
Thanks for sharing.

This is perplexing, as our configurations aren't very different.

You don't have "-D_USE_FLOW_CONTROL" in POP2 CPPDEFS; I did—I've removed it now.

Your CPPDEFS have CRMACCEL and STRATOKILLER, which I don't. No idea what those do, but I'll add them.

Other than those I don't see anything significant other than your SLIBS: are you loading hdf5 and netcdf, or phdf5 and parallel-netcdf? I've been using PnetCDF, and also loading parallel-netcdf (with phdf5). When you invoke "nf-config --flibs" you get "-L$(TACC_HDF5_DIR)/lib but not "-lhdf5" so that's a possible factor, although I've tried it both ways, including adding or not adding to LDFLAGS.

I have never set TRILINOS_PATH, not thinking it mattered because I don't switch on Trilinos— do you?

More experiments, oh joy!

P.S.: mpicc is not the same as mpiicc.

Notes added later: "-DINTEL -DCPRINTEL" get added to Macros anyway, so your removal of "-DIntel -DCPRINTEL" from CPPDEFS seems a no-op. Also I found an entry

<compiler>
  <ADD_CPPDEFS MODEL="pop2"> -D_USE_FLOW_CONTROL </ADD_CPPDEFS>
</compiler>

which applies "_USE_FLOW_CONTROL" to pop2 for all compilers, so again a distinction between our configurations without a difference.

Max R. Dechantsreiter
Performance Jones L.L.C.

mdfowler@...

I'm definitely not sure why our model works and yours doesn't, given the very similar compiler options. Have you also updated settings in config_machines? These are our settings: 

<machine MACH="stampede-knl">

        <DESC>TACC DELL, os is Linux, 16 pes/node, batch system is SLURM</DESC>

        <OS>LINUX</OS>

        <COMPILERS>intel,intelmic,intel14,intelmic14</COMPILERS>

        <MPILIBS>impi,mvapich2,mpi-serial</MPILIBS>

        <RUNDIR>$SCRATCH/cesm/$CASE/run</RUNDIR>

        <EXEROOT>$SCRATCH/cesm/$CASE/bld</EXEROOT>

        <!--sungduk commented out <CESMSCRATCHROOT>$SCRATCH</CESMSCRATCHROOT> -->

        <!--sungduk commented out <DIN_LOC_ROOT>/scratch/projects/xsede/CESM/inputdata</DIN_LOC_ROOT> -->

        <!--sungduk commented out <DIN_LOC_ROOT_CLMFORC>/scratch/projects/xsede/CESM/inputdata/lmwg</DIN_LOC_ROOT_CLMFORC> -->

        <!--sungduk: it seems that STAMPEDE has CESM inputdata in the above directories, but permission problem arises. So I have to use manual downloading option. To do that i added the following two lines. -->

        <DIN_LOC_ROOT>$ENV{WORK}/inputdata</DIN_LOC_ROOT>

        <DIN_LOC_ROOT_CLMFORC>$ENV{WORK}/inputdata</DIN_LOC_ROOT_CLMFORC>

        <DOUT_S_ROOT>$SCRATCH/cesm/archive/$CASE</DOUT_S_ROOT>

        <DOUT_L_MSROOT>csm/$CASE</DOUT_L_MSROOT>

        <CCSM_BASELINE>/work/04268/tg835671/stampede2/ccsm_baselines</CCSM_BASELINE>

        <CCSM_CPRNC>/work/04268/tg835671/stampede2/cprnc</CCSM_CPRNC>

        <BATCHQUERY>squeue</BATCHQUERY>

        <BATCHSUBMIT>sbatch</BATCHSUBMIT>

        <SUPPORTED_BY>srinathv -at- ucar.edu</SUPPORTED_BY>

        <GMAKE_J>16</GMAKE_J>

        <MAX_TASKS_PER_NODE>256</MAX_TASKS_PER_NODE>

        <PES_PER_NODE>64</PES_PER_NODE>

 

</machine>

I wouldn't think that this would make a huge difference, but it's possible that some changes are important (i.e., PES_PER_NODE). Most of the modules (pnetcdf, hdf5, etc.) I use are loaded in the env_mach_specific.stampede-knl file:

# -------------------------------------------------------------------------

# Stampede build specific settings

# -------------------------------------------------------------------------

#source /etc/profile.d/tacc_modules.csh

source /etc/profile.d/z01_lmod.csh

#module purge

#module load TACC TACC-paths Linux cluster cluster-paths perl cmake

 

#Replacing above two lines with following 3 based on TACC support advice 

module reset

module load hdf5 netcdf pnetcdf intel

module load cmake

 

module load impi

I don't believe I switch on Trilinos, no, but it was set before and so I haven't changed it. Have you had any luck? Hopefully some of this wil help. I'll add a warning though, my successfull long simulation has just failed after running for almost 150 years, and I've yet to debug the exact cause of it, so that remains to be done. 

-Meg 

 

max@...

I will try the "module reset..." sequence.

Are you doing a cold start? (CLM_FORCE_COLDSTART in env_run.xml)

I'm using MCT - you? (I gather the choice is MCT or ESMF; I have ESMF pret a porter, but haven't tried it yet.)

As an aside, I did a fresh start of B_1850_CN, to an unpopulated DIN directory:

Getting init_ts_file_fmt from /scratch/01882/maxd/CESM_DIN1/cesm1_2_2_1/inputdata/ccsm4_init/b40.1850.track1.1deg.006/0863-01-01/rpointer.ocn.restart

and in that file I saw that init_ts_file_fmt is set to "nc" so everything regarding input appears kosher.

My env_mach_specific is more complicated than yours, perhaps:

#! /bin/csh -f

# -------------------------------------------------------------------------
# Stampede2 build specific settings
# -------------------------------------------------------------------------

#source /etc/profile.d/tacc_modules.csh
source /etc/profile.d/z01_lmod.csh

module purge
module load TACC perl cmake

if ( $COMPILER != "intel" ) then
        echo "Unsupported COMPILER=$COMPILER"
        exit
else # COMPILER == "intel"
        module load intel/17.0.4
        if ( $MPILIB == "mpi-serial" ) then
                module load hdf5
                module load netcdf
        else if ( $MPILIB != "impi" ) then
                echo "Unsupported MPILIB=$MPILIB"
                exit
        else
                module load impi/17.0.3
                if ( $PIO_TYPENAME == "netcdf" ) then
                        module load hdf5
                        module load netcdf
                else if ( $PIO_TYPENAME == "netcdf4p" ) then
                        module load phdf5
                        module load parallel-netcdf
                else if ( $PIO_TYPENAME == "pnetcdf" ) then
                        module load hdf5
                        module load netcdf
                        module load pnetcdf/1.8.1
                else
                        echo "Unsupported PIO_TYPENAME=$PIO_TYPENAME"
                        exit
                endif
        endif
endif

# -------------------------------------------------------------------------
# Build and runtime environment variables - edit before the initial build
# -------------------------------------------------------------------------

limit stacksize unlimited
limit datasize  unlimited

(I'm still working out some refinements so I won't have to sync too many settings.)

I need to keep PES_PER_NODE=68 because I use that value to construct a pin map, although at this stage I'm not using it. Anyway MAX_TASKS_PER_NODE is what matters to mkbatch.

I ran various cases in total dozens of times since Friday, with my best result being with B_1850-2000_CAM5 (0.9x1.25_gx1v6) using 256 cores (no multithreading), which did fine until dying in cam. By the way, the same build on 320 crashed with what was probably a NaN somewhere, because NetCDF couldn't represent it.

This is all very frustrating; I have the feeling I'm missing something stupid, because a model like this shouldn't be so fragile.

I expect to gain access to Skylake soon, so perhaps I'll have better luck on that, maybe even learn what's going wrong on KNL.

Max R. Dechantsreiter
Performance Jones L.L.C.

mdfowler@...

I'm not forcing a cold start on the CLM model, no. For my case, I want it to start from more spun-up initial conditions. I am also using the MCT interface though. 

As for the env_mach_specific file, I'd only included the top few lines before - mine looks much more similar to yours in its complexity: 

#! /bin/csh -f

 

# -------------------------------------------------------------------------

# Stampede build specific settings

# -------------------------------------------------------------------------

#source /etc/profile.d/tacc_modules.csh

source /etc/profile.d/z01_lmod.csh

#module purge

#module load TACC TACC-paths Linux cluster cluster-paths perl cmake

 

#Replacing above two lines with following 3 based on TACC support advice 

module reset

module load hdf5 netcdf pnetcdf intel

module load cmake

module load impi

 

echo"**These are the modules loaded before compiler and mpi are selected**"

module list

 

# sungduk: added intelACC option for CRM Acceleration (-DCRMACC) turn on

if($COMPILER=="intel"||$COMPILER=="intel14"||$COMPILER=="intelACC")then

  echo"Buidling for Xeon Host"

 

  if($COMPILER=="intel"||$COMPILER=="intelACC")then

     module load intel/17.0.4

    if($MPILIB!="mpi-serial")then

      module load pnetcdf/1.8.1

      setenv PNETCDF_PATH $TACC_PNETCDF_DIR

    endif

  elseif($COMPILER=="intel14")then

    module load intel/14.0.1.106

  endif

  if($MPILIB=="mvapich2")then

    module load mvapich2

  elseif($MPILIB=="impi")then

     module unload mvapich2

 

     if($COMPILER=="intel14")then

     module load impi/4.1.3.049

     else

     module load impi

     endif

  endif

  if($COMPILER=="intel14")then

    setenv TACC_NETCDF_DIR /work/02463/srinathv/netcdf/4.2.1.1/intel/14.0.1.106/snb

  else

    module load hdf5

    module load netcdf

  endif

elseif($COMPILER=="intelmic"||$COMPILER=="intelmic14")then

  echo"Building for Xeon Phi"

  if($COMPILER=="intelmic")then

    module load intel/13.1.1.163

  elseif($COMPILER=="intelmic14")then

    module load intel/14.0.1.106

  endif

  if($MPILIB=="impi")then

      module unload mvapich2

      if($COMPILER=="intelmic14")then

        module load impi/4.1.2.040

      else

        module load impi

      endif

  endif

  if($COMPILER=="intelmic14")then

    setenv TACC_NETCDF_DIR /work/02463/srinathv/netcdf/4.2.1.1/intel/14.0.1.106/mic

  else

    setenv TACC_NETCDF_DIR /work/02463/srinathv/netcdf/4.2.1.1/intel/13.1.1.163/mic

  endif

 

endif

setenv NETCDF_PATH $TACC_NETCDF_DIR

# -------------------------------------------------------------------------

# Build and runtime environment variables - edit before the initial build 

# -------------------------------------------------------------------------

limit stacksize unlimited

limit datasize  unlimited


I don't see too many glaring differences there. One question - does your model throw an error regarding loading the perl module? Mine did previously, and when I checked with TACC I was told that it's no longer a module, but is available by default from the system. We might also be using different impi versions, though I think mine uses the default version most often. 

One more thing to try may be setting I_MPI_PIN_DOMAIN in your mkbatch script. I've used this as an analog to "-c" on Cori-KNL, which users have said is key. If you want to run with 64 tasks per node, set it to 4; for 32 tasks per node, it's set to 8. If you send me an email (mdfowler@uci.edu), I can give you the directory to my Machine files on Stampede as well, perhaps doing a direct "diff" would illuminate some differences that can get the model up and running? 

max@...

Yes, I stopped loading Perl, using /bin/perl by default. Also I think you got bad advice: "module reset" doesn't purge modules; I do a purge, then load only the modules I need, just to be safe. (My configuration achieved self-awareness yesterday: it's now much more sophisticated, after I put in some hours....)

AFAIK there is only one IMPI on Stampede2.

Use I_MPI_DEBUG=4 (setenv I_MPI_DEBUG 4) to see where ranks are placed/bound. I have used I_MPI_PIN_DOMAIN, but at the moment am less concerned about task/thread placement than getting interesting cases to run. But I will check output again, just to confirm the MPI ranks are where they ought to be.

Yesterday I found something, and fixed it. Now I can run B_1850_CN for at least a short simulated while. It still fails in longer tests, and I'm tracking down the reason.

Because I'm getting into sensitive (i.e., competitive advantage) territory, I gladly accept your invitation to take our discussion off line.

Max R. Dechantsreiter
Performance Jones L.L.C.

max@...

Next time you run on Cori, try this. While CESM is running, log onto one of the nodes it's using, get the PIDs of its MPI tasks, and monitor VmPeak in /proc/$pid/status. That gives the maximum memory used by the process since it started—memory highwater, in other words. The information is only available while the process exists, so you have to check while CESM is running.

I will do the same on Stampede2 with a MAX_TASKS_PER_NODE=32 run, if that works for me.

Max R. Dechantsreiter
Performance Jones L.L.C.

Log in or register to post comments

Who's new

  • siyuan
  • yuenai@...
  • petisascom@...
  • xiaoning.wu.1@...
  • nburls@...