CESM on Stampede2 (TACC)

mdfowler@uci_edu · Aug 11, 2017

Hello all, I've been attempting to get CESM1.2.2 up and running on Stampede's KNL system, Stampede2, but we've run into a few issues. This seems to be architecture (or potentially compiler type and version) specific, as this model works on the NERSC Cori KNL system but not on Stampede. Unfortunately, we are not able to back-migrate compilers or try other versions as they are not available on this system. We are running at a resolution of two degrees in the B_1850-2000_CN compset. The main error message being produced now is the following, which seems to stem from the ice_transport_remap.F90 file: forrtl: severe (154): array index out of boundsImage PC Routine Line Sourcecesm.exe 0000000001D7BFC9 Unknown Unknown Unknownlibpthread-2.17.s 00002AFFE1144370 Unknown Unknown Unknowncesm.exe 0000000000E3815B ice_transport_rem 678 ice_transport_remap.F90cesm.exe 0000000000E205E8 ice_transport_dri 549 ice_transport_driver.F90cesm.exe 0000000000DF90D6 ice_step_mod_mp_s 679 ice_step_mod.F90cesm.exe 0000000000C987E3 ice_comp_mct_mp_i 631 ice_comp_mct.F90cesm.exe 0000000000417E59 ccsm_comp_mod_mp_ 3248 ccsm_comp_mod.F90cesm.exe 0000000000439142 MAIN__ 91 ccsm_driver.F90cesm.exe 000000000041467E Unknown Unknown Unknownlibc-2.17.so 00002AFFE1674B35 __libc_start_main Unknown Unknowncesm.exe 0000000000414569 Unknown Unknown UnknownThis message was produced after almost 1.5 years during a long simulation, but was again produced after just 15 days when we re-submitted the same run script. It seems that there's a root problem that we're unaware of, but that is leading to unpredictable failures. We are using the Intel compiler (version 17.0.4), with the following flags supplied to the model in the config_compilers.xml: -DFORTRANUNDERSCORE -DNO_R16 -DCRMACCEL -DSTRATOKILLER -openmp -openmp -openmp -free -fixed -132 -g -CU -check pointers -fpe0 -ftz -O2 -no-opt-dynamic-align -fp-model precise -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs -O2 -fp-model precise -O0 -r8 ifort icc icpc mpif90 mpicc mpicxx FORTRAN -cxxlib TRUE Any insight would be much appreciated. Best,Meg Fowler

max@performancejones_com · Aug 26, 2017

This case did not work for me on Stampede2, so for my initial testing I switched to I_1850_CN, which runs fine.

For B_1850_CN did you get runtime warnings from NetCDF? Are you sure all the input files are on hand?

mdfowler@uci_edu · Aug 28, 2017

Thanks for checking this out, Max. It's odd that something would be systematically wrong with the B_1850_CN (or the B_1850-2000_CN) composet. The appropriate files are on hand, but there were a few NetCDF errors/warnings that appeared in the cesm log (i.e., "NetCDF: Invalid dimension", "NetCDF: Variable not found", "NetCDF:Attribute not found", etc.). These had appeared even in simulations that were able to successsfully run for a few months, however, so I'm unsure of how this is related to this particular failure mode. Did you make any changes in the machine files of CESM to run on Stampede2 instead of 1? If so, could I glance at the modifications you made? As of now, it seems that we can work around the problem by assigning only 32 tasks per node instead of 64. The error doesn't seem to be related to a memory limitation though (based on some memory usage checking by TACC support), so it's also unclear why this is getting rid of the array index out of bounds problem we were experiencing previously.

max@performancejones_com · Aug 28, 2017

So ICE is not active in I_1850_CN—that explains why it works, but B_1850* doesn't.

Interesting that you can run with MAX_TASKS_PER_NODE=32; I have been using 64.
I will test that myself!

I was able to run B_1850_CN with MAX_TASKS_PER_NODE=64 up to the point where it started writing restart files,
or at least that is how it appeared to me, a relatively inexperienced user (I'm really a benchmarker).
But all components time-stepped, without any NetCDF warnings (those started in the shutdown phase).
To achieve this partial success, I built MODEL="cice" with O0 instead of O2, so I suspect a compiler bug.
I am continuing to chase it.

As for sharing my port, personally I would love to—but I am working under contract, so first I'd need to ask my client.

mdfowler@uci_edu · Aug 28, 2017

It certainly does seem to be related to the ice model in particular. How do you go about changing the optimizaiton flag for only that model component? Is this something you include in the machine build script? If it makes any difference, I've still got MAX_TASKS_PER_NODE set to 256 (in theory, each node has 64 cores with 4 threads per core, so 256 logical cores) and specify --ntasks-per-node=32 via SBATCH to get a functioning simulation. Understandable that you need to check before sharing these machine files. If your client is open to it though, I'd be incredibly appreciative! Happy to share what we've worked out machine-file wise in return as well. -Meg

max@performancejones_com · Aug 28, 2017

(I'm in Germany, so this'll be my last post for the day.)

your compiler flags here

I'm working on a binary search for the problem file(s); to do that, I modifed Macros, again using gmake conditionals.

Word around here is that CESM doesn't work with OpenMP on Stampede2, maybe KNL in general—another pending bug hunt for me!

An option not open to me, but perhaps to you, would be build B_1850_CN or whatever on an Intel system that isn't Phi, with Intel tools, to see if it's the hardware/tools or the source.

But I shouldn't say any more mmph mmph mmph

mdfowler@uci_edu · Aug 29, 2017

Interesting, I wasn't aware that you could specify compiler flags for each component of the model - good to know! I've heard that OpenMP is a problem for CESM as well - this has been an issue on multiple machines now, and stems from the ice_transport_remap.F90 file. I believe you can still use OpenMP for the other model components, but the ice model can't be enabled for OMP. As far as other systems go, the model seems to work fine on NERSC's Cori-KNL machine, even with 64 tasks per node. This is partly why we're having such a hard time finding the exact cause of this problem - the same model works on one machine, but takes some odd configurations to get working on another machine with fairly similar architecture. So far, the 32 tasks work-around is holding, but it's frustrating to be unable to use the entire node.

max@performancejones_com · Aug 29, 2017

Do you happen to know the size of memory on Cori-KNL nodes (cat /proc/meminfo)? Never mind—I checked: both Cori and Stampede2 KNL nodes have 96GB. I had suspected a memory issue (32 tasks have twice as much memory apiece as 64). This would have also explained why multithreading fails when ICE is active. It's still possible that on Stampede2 the memory isn't all available for some reason.

I will test 32 tasks per node myself, for B_1850_CN, the compset that I'd most like to run.

My client allowed me to share configuration, but not results; so once I get comfortable with CESM I could give some specifics. But I didn't change much, at least it doesn't seem like a lot.

max@performancejones_com · Aug 29, 2017

Next time you run on Cori, try this.
While CESM is running, log onto one of the nodes it's using, get the PIDs of its MPI tasks, and monitor VmPeak in /proc/$pid/status. That gives the maximum memory used by the process since it started—memory highwater, in other words. The information is only available while the process exists, so you have to check while CESM is running.

I will do the same on Stampede2 with a MAX_TASKS_PER_NODE=32 run, if that works for me.

mdfowler@uci_edu · Aug 31, 2017

I agree that a memory limitation would make sense. That said, we've worked with someone at TACC on this, and from their results it doesn't seem as obvious that memory is the problem. Simulations with 32 tasks per node have plenty of available memory space; it's hard to say with 64 tasks per node, because those jobs are failing too quickly to get a useful memory read on. Perhaps you'll have more luck determining if memory is the main problem. Did the B_1850_CN compset you tried work? The test you suggest on Cori might be useful though. I worry that the results might not be directly applicable though, since 64 tasks per node works on Cori but not on Stampede2. Next time I'm testing over there though, I'll be sure to check.

max@performancejones_com · Aug 31, 2017

All my latest tests have failed, including a run with MAX_TASKS_PER_NODE=32.

There appear to be "magic numbers" in cice—were you aware of that? nproc=320 or 640, for examples; using those you get very (?) different decompositions than with nproc=512 or 1024.

But even using the "blessed" values of nproc my runs are dying with a "ridging error." So now I've increased the iteration limit from 20 to 100; and I'm testing a build with "-fp-model precise" instead of "source" in case there is some precision issue at play.

How many total cores are you running on? Are you using the Intel 17 tools?

max@performancejones_com · Aug 31, 2017

Promise not to laugh—the following is my configuration (excerpted from config_compilers.xml). It is a work in progress....

Code:

<compiler COMPILER="intel">
  <!-- http://software.intel.com/en-us/articles/intel-composer-xe/ -->
  <ADD_CPPDEFS> -DFORTRANUNDERSCORE -DNO_R16 -DLinux -DCPRINTEL </ADD_CPPDEFS>
  <CFLAGS> -O2 -fp-model precise </CFLAGSgt;
  <CXX_LDFLAGS> -cxxlib </CXX_LDFLAGS>
  <CXX_LINKER>FORTRAN</CXX_LINKER>
  <FC_AUTO_R8> -r8 </FC_AUTO_R8>
  <AAA_FFLAGS> -g -traceback -convert big_endian -assume byterecl -assume realloc_lhs -fp-model source </AAA_FFLAGS>
  <FFLAGS> $(AAA_FFLAGS) -O2 -ftz -qno-opt-dynamic-align </FFLAGS>
  <FFLAGS_NOOPT> $(AAA_FFLAGS) -O0 </FFLAGS_NOOPT>
  <FIXEDFLAGS> -fixed -132 </FIXEDFLAGS>
  <FREEFLAGS> -free </FREEFLAGS>
  <MPICC> mpiicc </MPICC>
  <MPICXX> mpicxx </MPICXX>
  <MPIFC> mpif90 </MPIFC>
  <SCC> icc </SCC>
  <SCXX> icpc </SCXX>
  <SFC> ifort </SFC>
  <SUPPORTS_CXX>TRUE</SUPPORTS_CXX>
  <ADD_FFLAGS DEBUG="TRUE"> -O0 -g -check uninit -check bounds -check pointers -fpe0 </ADD_FFLAGS>
  <!-- <ADD_FFLAGS DEBUG="FALSE"> -O2 </ADD_FFLAGS> -->
  <ADD_CFLAGS compile_threaded="true"> -qopenmp </ADD_CFLAGS>
  <ADD_FFLAGS compile_threaded="true"> -qopenmp </ADD_FFLAGS>
  <ADD_LDFLAGS compile_threaded="true"> -qopenmp </ADD_LDFLAGS>
  <ADD_CPPDEFS MODEL="pop2"> -D_USE_FLOW_CONTROL </ADD_CPPDEFS>
</compiler>

<compiler MACH="stampede2">
  <CONFIG_ARGS> --host=Linux </CONFIG_ARGS>
  <ADD_CPPDEFS>-DHAVE_NANOTIME</ADD_CPPDEFS>
  <NETCDF_PATH>$(TACC_NETCDF_DIR)</NETCDF_PATH>
  <PNETCDF_PATH>$(TACC_PNETCDF_DIR)</PNETCDF_PATH>
  <PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>
</compiler>

<compiler MACH="stampede2" COMPILER="intel">
  <ADD_SLIBS>$(shell $(NETCDF_PATH)/bin/nf-config --flibs) -L$(TACC_PNETCDF_LIB) -lpnetcdf</ADD_SLIBS>
  <FFLAGS MODEL="cice">$(FFLAGS_NOOPT)</FFLAGS>
  <ADD_CFLAGS>-xMIC-AVX512</ADD_CFLAGS>
  <ADD_FFLAGS>-xMIC-AVX512</ADD_FFLAGS>
  <ADD_FFLAGS_NOOPT>-xMIC-AVX512</ADD_FFLAGS_NOOPT>
</compiler>

mdfowler@uci_edu · Aug 31, 2017

I wasn't aware of any "magic numbers" in cice, no. I assume that it won't lead to different solutions scientifically though, correct? I've dealt with the "ridging error" frequently. One thing to try is to increase your node count by one (so if you're running a job that would only require 8 nodes, try assiging it 9 instead). That's gotten around the issue occasionally. The most recent case in which I've encountered the ice ridging error seems to actually be related to settings in the ocean model. I had specified an in input spun-up ocean file to the POP model, but hadn't changed init_ts_file_fmt from 'bin' to 'nc' (the input file was netCDF, not binary, in my case).Updating to -fp-model precise should also help. That solved a few error messages we had been encountering previously. Right now, I'm running on 256 cores - 32 tasks per node. I've assigned 9 nodes (for some reason our model wouldn't work on 8 nodes, though the problem doesn't seem to be reproducible). We're using Intel 17 still; I think that's the main one on Stampede2.

mdfowler@uci_edu · Aug 31, 2017

Thanks for sharing your compiler settings! Below are ours as well; a lot of trial and error has gone into this, so it's not a guarantee that everything in there is necessary or will help. But so far, this seems to be a functioning configuration: -DFORTRANUNDERSCORE -DNO_R16 -DCRMACCEL -DSTRATOKILLER -openmp -openmp -openmp -free -fixed -132 -g -CU -check pointers -fpe0 -ftz -O2 -no-opt-dynamic-align -fp-model precise -convert big_endian -assume byterecl -ftz -traceback -assume realloc_lhs -O2 -fp-model precise -O0 -r8 ifort icc icpc mpif90 mpicc mpicxx FORTRAN -cxxlib TRUE

lustre $(TACC_NETCDF_DIR) -DHAVE_NANOTIME
mpicc mpif90 mpicxx ifort icc icpc -xMIC-AVX512 -xHost $(shell $(NETCDF_PATH)/bin/nf-config --flibs) -L$(TACC_HDF5_LIB) -lhdf5 -L$(TACC_HDF5_LIB) -lhdf5 $(TRILINOS_PATH)

max@performancejones_com · Sep 1, 2017

Please see models/ice/cice/bld/cice_decomp.xml (and I have no idea about the science...).

I'll check OCN init_ts_file_fmt

I already tried "precise" - no apparent effect. I suppose "strict" would the last hope.

So you are getting your case to work on Stampede2???

max@performancejones_com · Sep 1, 2017

Thanks for sharing.

This is perplexing, as our configurations aren't very different.

You don't have "-D_USE_FLOW_CONTROL" in POP2 CPPDEFS; I did—I've removed it now.

Your CPPDEFS have CRMACCEL and STRATOKILLER, which I don't. No idea what those do, but I'll add them.

Other than those I don't see anything significant other than your SLIBS: are you loading hdf5 and netcdf, or phdf5 and parallel-netcdf? I've been using PnetCDF, and also loading parallel-netcdf (with phdf5). When you invoke "nf-config --flibs" you get "-L$(TACC_HDF5_DIR)/lib but not "-lhdf5" so that's a possible factor, although I've tried it both ways, including adding or not adding to LDFLAGS.

I have never set TRILINOS_PATH, not thinking it mattered because I don't switch on Trilinos— do you?

More experiments, oh joy!

P.S.: mpicc is not the same as mpiicc.

Notes added later: "-DINTEL -DCPRINTEL" get added to Macros anyway, so your removal of "-DIntel -DCPRINTEL" from CPPDEFS seems a no-op. Also I found an entry

Code:

<compiler>
  <ADD_CPPDEFS MODEL="pop2"> -D_USE_FLOW_CONTROL </ADD_CPPDEFS>
</compiler>

which applies "_USE_FLOW_CONTROL" to pop2 for all compilers, so again a distinction between our configurations without a difference.

mdfowler@uci_edu · Sep 3, 2017

I'm definitely not sure why our model works and yours doesn't, given the very similar compiler options. Have you also updated settings in config_machines? These are our settings: TACC DELL, os is Linux, 16 pes/node, batch system is SLURM LINUX intel,intelmic,intel14,intelmic14 impi,mvapich2,mpi-serial $SCRATCH/cesm/$CASE/run $SCRATCH/cesm/$CASE/bld $ENV{WORK}/inputdata $ENV{WORK}/inputdata $SCRATCH/cesm/archive/$CASE csm/$CASE /work/04268/tg835671/stampede2/ccsm_baselines /work/04268/tg835671/stampede2/cprnc squeue sbatch srinathv -at- ucar.edu 16 256 64 I wouldn't think that this would make a huge difference, but it's possible that some changes are important (i.e., PES_PER_NODE). Most of the modules (pnetcdf, hdf5, etc.) I use are loaded in the env_mach_specific.stampede-knl file:# -------------------------------------------------------------------------# Stampede build specific settings# -------------------------------------------------------------------------#source /etc/profile.d/tacc_modules.cshsource /etc/profile.d/z01_lmod.csh#module purge#module load TACC TACC-paths Linux cluster cluster-paths perl cmake #Replacing above two lines with following 3 based on TACC support advice module resetmodule load hdf5 netcdf pnetcdf intelmodule load cmake module load impiI don't believe I switch on Trilinos, no, but it was set before and so I haven't changed it. Have you had any luck? Hopefully some of this wil help. I'll add a warning though, my successfull long simulation has just failed after running for almost 150 years, and I've yet to debug the exact cause of it, so that remains to be done. -Meg

max@performancejones_com · Sep 3, 2017

I will try the "module reset..." sequence.

Are you doing a cold start? (CLM_FORCE_COLDSTART in env_run.xml)

I'm using MCT - you? (I gather the choice is MCT or ESMF; I have ESMF pret a porter, but haven't tried it yet.)

As an aside, I did a fresh start of B_1850_CN, to an unpopulated DIN directory:

Getting init_ts_file_fmt from /scratch/01882/maxd/CESM_DIN1/cesm1_2_2_1/inputdata/ccsm4_init/b40.1850.track1.1deg.006/0863-01-01/rpointer.ocn.restart

and in that file I saw that init_ts_file_fmt is set to "nc" so everything regarding input appears kosher.

My env_mach_specific is more complicated than yours, perhaps:

Code:

#! /bin/csh -f

# -------------------------------------------------------------------------
# Stampede2 build specific settings
# -------------------------------------------------------------------------

#source /etc/profile.d/tacc_modules.csh
source /etc/profile.d/z01_lmod.csh

module purge
module load TACC perl cmake

if ( $COMPILER != "intel" ) then
        echo "Unsupported COMPILER=$COMPILER"
        exit
else # COMPILER == "intel"
        module load intel/17.0.4
        if ( $MPILIB == "mpi-serial" ) then
                module load hdf5
                module load netcdf
        else if ( $MPILIB != "impi" ) then
                echo "Unsupported MPILIB=$MPILIB"
                exit
        else
                module load impi/17.0.3
                if ( $PIO_TYPENAME == "netcdf" ) then
                        module load hdf5
                        module load netcdf
                else if ( $PIO_TYPENAME == "netcdf4p" ) then
                        module load phdf5
                        module load parallel-netcdf
                else if ( $PIO_TYPENAME == "pnetcdf" ) then
                        module load hdf5
                        module load netcdf
                        module load pnetcdf/1.8.1
                else
                        echo "Unsupported PIO_TYPENAME=$PIO_TYPENAME"
                        exit
                endif
        endif
endif

# -------------------------------------------------------------------------
# Build and runtime environment variables - edit before the initial build
# -------------------------------------------------------------------------

limit stacksize unlimited
limit datasize  unlimited

(I'm still working out some refinements so I won't have to sync too many settings.)

I need to keep PES_PER_NODE=68 because I use that value to construct a pin map, although at this stage I'm not using it. Anyway MAX_TASKS_PER_NODE is what matters to mkbatch.

I ran various cases in total dozens of times since Friday, with my best result being with B_1850-2000_CAM5 (0.9x1.25_gx1v6) using 256 cores (no multithreading), which did fine until dying in cam. By the way, the same build on 320 crashed with what was probably a NaN somewhere, because NetCDF couldn't represent it.

This is all very frustrating; I have the feeling I'm missing something stupid, because a model like this shouldn't be so fragile.

I expect to gain access to Skylake soon, so perhaps I'll have better luck on that, maybe even learn what's going wrong on KNL.

mdfowler@uci_edu · Sep 5, 2017

I'm not forcing a cold start on the CLM model, no. For my case, I want it to start from more spun-up initial conditions. I am also using the MCT interface though. As for the env_mach_specific file, I'd only included the top few lines before - mine looks much more similar to yours in its complexity: #! /bin/csh -f # -------------------------------------------------------------------------# Stampede build specific settings# -------------------------------------------------------------------------#source /etc/profile.d/tacc_modules.cshsource /etc/profile.d/z01_lmod.csh#module purge#module load TACC TACC-paths Linux cluster cluster-paths perl cmake #Replacing above two lines with following 3 based on TACC support advice module resetmodule load hdf5 netcdf pnetcdf intelmodule load cmakemodule load impi echo"**These are the modules loaded before compiler and mpi are selected**"module list # sungduk: added intelACC option for CRM Acceleration (-DCRMACC) turn onif($COMPILER=="intel"||$COMPILER=="intel14"||$COMPILER=="intelACC")then echo"Buidling for Xeon Host" if($COMPILER=="intel"||$COMPILER=="intelACC")then module load intel/17.0.4 if($MPILIB!="mpi-serial")then module load pnetcdf/1.8.1 setenv PNETCDF_PATH $TACC_PNETCDF_DIR endif elseif($COMPILER=="intel14")then module load intel/14.0.1.106 endif if($MPILIB=="mvapich2")then module load mvapich2 elseif($MPILIB=="impi")then module unload mvapich2 if($COMPILER=="intel14")then module load impi/4.1.3.049 else module load impi endif endif if($COMPILER=="intel14")then setenv TACC_NETCDF_DIR /work/02463/srinathv/netcdf/4.2.1.1/intel/14.0.1.106/snb else module load hdf5 module load netcdf endifelseif($COMPILER=="intelmic"||$COMPILER=="intelmic14")then echo"Building for Xeon Phi" if($COMPILER=="intelmic")then module load intel/13.1.1.163 elseif($COMPILER=="intelmic14")then module load intel/14.0.1.106 endif if($MPILIB=="impi")then module unload mvapich2 if($COMPILER=="intelmic14")then module load impi/4.1.2.040 else module load impi endif endif if($COMPILER=="intelmic14")then setenv TACC_NETCDF_DIR /work/02463/srinathv/netcdf/4.2.1.1/intel/14.0.1.106/mic else setenv TACC_NETCDF_DIR /work/02463/srinathv/netcdf/4.2.1.1/intel/13.1.1.163/mic endif endifsetenv NETCDF_PATH $TACC_NETCDF_DIR# -------------------------------------------------------------------------# Build and runtime environment variables - edit before the initial build # -------------------------------------------------------------------------limit stacksize unlimitedlimit datasize unlimited
I don't see too many glaring differences there. One question - does your model throw an error regarding loading the perl module? Mine did previously, and when I checked with TACC I was told that it's no longer a module, but is available by default from the system. We might also be using different impi versions, though I think mine uses the default version most often. One more thing to try may be setting I_MPI_PIN_DOMAIN in your mkbatch script. I've used this as an analog to "-c" on Cori-KNL, which users have said is key. If you want to run with 64 tasks per node, set it to 4; for 32 tasks per node, it's set to 8. If you send me an email (mdfowler@uci.edu), I can give you the directory to my Machine files on Stampede as well, perhaps doing a direct "diff" would illuminate some differences that can get the model up and running?

max@performancejones_com · Sep 5, 2017

Yes, I stopped loading Perl, using /bin/perl by default. Also I think you got bad advice: "module reset" doesn't purge modules; I do a purge, then load only the modules I need, just to be safe. (My configuration achieved self-awareness yesterday: it's now much more sophisticated, after I put in some hours....)

AFAIK there is only one IMPI on Stampede2.

Use I_MPI_DEBUG=4 (setenv I_MPI_DEBUG 4) to see where ranks are placed/bound. I have used I_MPI_PIN_DOMAIN, but at the moment am less concerned about task/thread placement than getting interesting cases to run. But I will check output again, just to confirm the MPI ranks are where they ought to be.

Yesterday I found something, and fixed it. Now I can run B_1850_CN for at least a short simulated while. It still fails in longer tests, and I'm tracking down the reason.

Because I'm getting into sensitive (i.e., competitive advantage) territory, I gladly accept your invitation to take our discussion off line.

CESM on Stampede2 (TACC)

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member