This site is migrating to a new forum software on Tuesday, September 24th 2019, you may experience a short downtime during this transition

Main menu

Navigation

new machine build, run error

10 posts / 0 new
Last post
kwythers@...
new machine build, run error

I am working to get a new build (cesm1_2_1) running on a linux cluster. The build appears to finish successfully as I get the message:

- Locking file env_build.xml
CESM BUILDEXE SCRIPT HAS FINISHED SUCCESSFULLY

after building the test case. However, upon submitting the job to our que (we use the PBS queuing system) I get a fairly quick error and the expected run files are not appearing:

wythersk@node1084 [~/cases/testcase] $ ls
archive_metadata.sh env_derived README.case user_nl_cam
Buildconf env_mach_pes.xml README.science_support user_nl_cice
CaseDocs env_mach_specific run user_nl_clm
CaseStatus #env_run.xml# SourceMods user_nl_cpl
cesm_setup env_run.xml testcase.build user_nl_pop2
check_case exedir testcase.clean_build user_nl_rtm
check_input_data inputdata testcase.o1084694 xmlchange
create_production_test LockedFiles testcase.run xmlquery
Depends.intel logs testcase.submit
env_build.xml Macros Tools
env_case.xml preview_namelists

I think I’ve tracked down to a problem opening one of the USGS-gtopo30_4x5_remap_c05020.nc file. Here is the section of the cesm.log file that I am referring to: Does this look familiar to anyone, or am I chasing the wrong thing here? Thank you in advance

8 pes participating in computation
-----------------------------------
TASK# NAME
0 node0316
1 node0316
2 node0316
3 node0316
4 node0316
5 node0316
6 node0316
7 node0316
Opened existing file
/home/reichpb/wythersk/cases/testcase/inputdata/atm/cam/inic/fv/cami_0001-01-01
_4x5_L26_c060608.nc 65536
Opened existing file
/home/reichpb/wythersk/cases/testcase/inputdata/atm/cam/topo/USGS-gtopo30_4x5_r
emap_c050520.nc 131072
forrtl: severe (174): SIGSEGV, segmentation fault occurred

jedwards

Hi,


First please consider cesm1.2.2 instead of 1.2.1.    It's new and improved.  :-)    A lot of times when portring to a new machine the problem is your user environment settings, in particular the

user stack size limit and data limit.   We recommend setting them both to unlimited.   Use the limit command in csh or the ulimit command in bash to check the limit settings.   

CESM Software Engineer

kwythers@...

Here is the results from ulimit. Disk was unlimited. I changed stack to unlimited 

wythersk@node1084 [~/cases/testcase/run] $ ulimit -a

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 191956

max locked memory       (kbytes, -l) unlimited

max memory size         (kbytes, -m) unlimited

open files                      (-n) 10000

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) unlimited

cpu time               (seconds, -t) unlimited

max user processes              (-u) 1024

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

 

wythersk@node1084 [~/cases/testcase/run] $ 

 

However, same error on the USGS netCDF file:

 

 Opened existing file 

 /home/reichpb/wythersk/cases/testcase/inputdata/atm/cam/inic/fv/cami_0001-01-01

 _4x5_L26_c060608.nc       65536

 Opened existing file 

 /home/reichpb/wythersk/cases/testcase/inputdata/atm/cam/topo/USGS-gtopo30_4x5_r

 emap_c050520.nc      131072

 

forrtl: severe (174): SIGSEGV, segmentation fault occurred

 

Other ideas?

jedwards

Can you dump the file using ncdump?  

Check that the md5sum matches the expected value:

78bff47e307c5fb2395204c9f833a480  /glade/p/cesmdata/cseg/inputdata/atm/cam/topo/USGS-gtopo30_128x256_c050520.nc

You may also get more information by compiling with DEBUG=TRUE and setting core file size to a non-zero value.  

 

CESM Software Engineer

kwythers@...

My md5sums look right:

md5sum USGS-gtopo30_4x5_remap_c050520.nc 

0a0b1d5f9403dd00eebc18c521f27234  USGS-gtopo30_4x5_remap_c050520.nc

 

Here is the dump file:

Attachment: 
jedwards

This all looks fine, you are going to need to dig deeper:

You may also get more information by compiling with DEBUG=TRUE and setting core file size to a non-zero value. 

CESM Software Engineer

kwythers@...

Confirming that you mean line 133 in env_run.xml. Change value="0" to value="TRUE"? In additon, I'm not sure where the "core file size" option is changed to a "non-zero" value

wythersk@node1082 [~/cases/testcase] $ grep -n DEBUG env_run.xml 

 

133:<entry id="PIO_DEBUG_LEVEL"   value="0"  /> 

jedwards

core file size is one of the limits in your environment, you printed it out a few posts ago.   DEBUG is set in env_build.xml and you should change the value using the

xmlchange utility ./xmlchange DEBUG=TRUE

 

 

CESM Software Engineer

kwythers@...

Now (with DEBUG TRUE, and core file size set to 1024) I am having trouble with the build process. From:

 

more exedir/atm.bldlog.141007-095202

 

catastrophic error: **Internal compiler error: segmentation violation signal rai

sed** Please report this error along with the circumstances in which it occurred

 in a Software Problem Report.  Note: File and line given may not be explicit ca

use of this error.

compilation aborted for /home/reichpb/wythersk/cesm/dev/1.2.1/models/atm/cam/src

/dynamics/fv/sw_core.F90 (code 1)

gmake: *** [sw_core.o] Error 1

gmake: *** Waiting for unfinished jobs....

 

wythersk@node1084 [~/cases/f45g37_B1850CN] $ 

 

Any chance this is related to my original issue?

jedwards

You failed to report what compiler you are using and you failed to update to cesm1.2.2 as requested.  

Please update to 1.2.2 and let us know what compiler you are using.

CESM Software Engineer

Log in or register to post comments

Who's new

  • jwolff
  • tinna.gunnarsdo...
  • sarthak2235@...
  • eolivares@...
  • shubham.gandhi@...