Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Failure with COSP on Mira when using OpenMP threading

Probably a compiler/runtime bug rather than a code error, but building CAM5 with cosp (either version 1.3 or 1.4) and OpenMP and running with more than one thread on Mira fails during the first call tocospsimulator_intr_runwith the error (in the core file)***FAULT Encountered unhandled signal 0x0000000b (11) (SIGSEGV)
Generated by interrupt..................0x00000008 (Data TLB Miss Exception DEAR=0x0000001c09ebf160 ESR=0x0000000000800000)

The happens every time. Occurs with multiple versions of CESM, but verified most recently withcesm1_4_beta03./create_newcase -case XXX -compset FAMIPC5 -mach mira -res ne30_g16

(in env_build.xml)

(in env_mach_pes.xml, making sure that there is plenty of memory available)

...

(in user_nl_cam, and with any other cosp settings, including disabling most of the individual options)docosp = .true.

I also tried varying, in env_mach_specific,  setenv XLSMPOPTS "stack=XXX"up to as large as  "stack=1024000000"
 
"Never mind". There does not appear to be a problem with COSP or with the IBM compiler. Like on other systems, the thread stack size needs to be increased in order to use COSP, and not any more than on other systems. Unfortunately, I did not realize that the stack size is hardwired in mkbatch.mira to be 32MB:runjob --label short -p ${procs} -n ${ntasks} ${LOCARGS} --envs BG_THREADLAYOUT=1 --envs OMP_STACKSIZE=32M --envs OMP_NUM_THREADS=${mthrds} : ${EXEROOT}/cesm.exe   >&! cesm.log.$LID       Thus my experiments changing the settings in env_mach_specific did nothing.  If feasible, it might be useful to modify this logic so that the thread stack size can be set in env_mach_specific as on other systems.

(Thanks to Az Mametjanov for figuring this out.)
 

jedwards

CSEG and Liaisons
Staff member
Because of the way env variables work on mira setting them in env_mach_specific doesn't work because they don't get transmitted to the compute environment.  But finding them in the mkbatch is also difficult, obscure, annoying. We will look into ways to improve this in the development code.   
 

jacob@mcs_anl_gov

Rob Jacob
New Member
Are all the setenv commands in env_mach_specific.mira ignored?  It looks like some care was made in setting them and not all are repeated in the runjob command line.
 
Top