CESM on Stampede2 (TACC)

q1034908455@gmail_com · Feb 8, 2019

Hello, I've been experiencing a similar problem, the model randomly crash at line 1879 in ice_transpost_remap.F90. Our compiler flags conatins "-check uninit", and log indicated that a variable used as array index was uninitialized. None of the conditions of the branches that initializes it were satisfied because of NaN.I'd like to know if you have solved the problem. If so, would you like to share some solution or workaround?

mdfowler@uci_edu · Feb 8, 2019

I wish I had an easy answer for you. There were a series of issues that cropped up along the way before we could solve the error, resulting in additional errors. But if you're running on Stampede2 (KNL), I can at least share the final configuration options with you: In config_compilers.xml: lustre $(TACC_NETCDF_DIR) -DHAVE_NANOTIME Ultimately, it also seemed that we were running out of space on the nodes we were requested. I believe we wound up running with an extra node; so for example, if we ran with 64 tasks per node, and wanted to run on 2 nodes, I would increase the total nodes requested from 2 to 3, just to be on the safe side. There's no real logical reason this should work, and I'm not at all sure that it's still necessary (it shouldn't be), but you could always try it. Let me know if it would help to share any other bits of code or files regarding workflow. Cheers,Meg

q1034908455@gmail_com · Feb 12, 2019

Thanks for answering. Actually I'm not running on Stampede2. I compared our configuration with yours and it seems that only the "-DHAVE_NANOTIME" is different. I have inspected our memory usage and I'm pretty sure we won't be running out of memory. Our model often crash on the 15th day in a month. I'm wondering if your situation is similar. Also, could you please share your compiler and MPI configuration, or modification on the code?

somanath999@gmail_com · Mar 22, 2019

Hi I am facing same kind of issue while running the compset:F_2000_CAM5 and res:f09_f09.Its getting killed with MPI abort error. I am trying to run it on single node with 64 core AMD EPYC processor which has around 512GB memory with Openmpi and GCC.

ALl the available memory is getting used up while running and forcing to kill the job. I am pretty unclear about the solution i have tried different combination of process allocation but still i am getting the same error.For me the compset X is running fine.
Do any one have any suggestion to resolve it.

NetCDF: Invalid dimension ID or name
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node llvm-sp38 exited on signal 9 (Killed).

"NTASKS_ATM" value="32"
"NTASKS_LND" value="32"
"NTASKS_ICE" value="32"
"NTASKS_OCN" value="16"
"NTASKS_CPL" value="16"
"NTASKS_GLC" value="16"
"NTASKS_ROF" value="8"
"NTASKS_WAV" value="8"

"TOTALPES" value="64"

somanath999@gmail_com · Mar 27, 2019

I tried with compset A with res f19_g16 and compset B1850CN with T31_g37. but still i am getting the same error.it's using up all the memory during initialize of components and forcing the machine to kill the job.
any suggestion to resolve this.

mdfowler@uci_edu · Mar 27, 2019

Apologies for the late replies to these. The compiler settings I used were: mpicc mpif90 mpicxx ifort icc icpc -xMIC-AVX512 -xHost $(shell $(NETCDF_PATH)/bin/nf-config --flibs) -L$(TACC_HDF5_LIB) -lhdf5 -L$(TACC_HDF5_LIB) -lhdf5 $(TRILINOS_PATH) As for errors were all the memory is being used, I would try reducing the number of tasks you're assigning to the node. If it's failing with 64 cores, try reducing it further and see if that helps? I'd also try a first test without setting any of the NTASKS_component variables and see what the default options are. They're probalby the same, but maybe it's best not to set them for now. If that still runs you out of memory, does running on two nodes instead work? Or does cutting each model component's NTASKS in half help? Hope some of that helps, but I'm sure you've tried these options before... -Meg

CESM on Stampede2 (TACC)

q1034908455@gmail_com

New Member

mdfowler@uci_edu

New Member

q1034908455@gmail_com

New Member

somanath999@gmail_com

New Member

somanath999@gmail_com

New Member

mdfowler@uci_edu

New Member