Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Segmentation fault when writing history tape

ginah@udel_edu

New Member
Hi all,

I am running cam3.1 on an ibm in parallel and have encountered a problem I can't get past. I build the executable and run the model but when it comes to writing the first history tape I get a segmentation fault. Has anyone had this problem when using the model in parallel or would anyone have any other examples of namelist input files and run scripts designed for loadleveller apart from the one supplied in the distribution package?

Any help on this matter would be greatly appreciated,
Thanks, Gina.

My error looks like:

nstep, te 2158 3338227904.71270752 -6.74207538624604563 0.672977540751995540E-03 98460.2334231531422
NSTEP = 2158 8.869766793863316E-05 7.482799460988471E-06 252.719 9.84602E+04 2.429279527729273E+01 0.79 0.23
nstep, te 2159 3338189144.76121378 -6.99457323074340831 0.698181354071695441E-03 98460.2202542149898

INICFILE: Writing clm initial conditions dataset at ./camrun.test3.clm2.i.0000-10-01-00000.nc at nstep = 2159

(PUTFIL): Issuing shell cmd:(mswrite -t 365 ./camrun.test3.clm2.i.0000-10-01-00000.nc /GINAH/csm/camrun.test3/lnd/ini
t/camrun.test3.clm2.i.0000-10-01-00000.nc && /bin/rm ./camrun.test3.clm2.i.0000-10-01-00000.nc )&
sh: mswrite: not found
NSTEP = 2159 8.869911583397781E-05 7.484388238825524E-06 252.715 9.84602E+04 2.429339648030512E+01 0.79 0.23
nstep, te 2160 3338151705.11139965 -7.27428056081136098 0.726101031155778525E-03 98460.2244422104850
ERROR: 0031-250 task 2: Segmentation fault
ERROR: 0031-250 task 0: Segmentation fault
ERROR: 0031-250 task 1: Segmentation fault
ERROR: 0031-250 task 3: Segmentation fault
CAM run failed
 

ginah@udel_edu

New Member
Hi all,

I have been testing different configurations of runs and I can write out history tapes fine but as soon as any of my runs get to one month or timestep 2232 they crash with a segumentation fault. I am baffled and don't know what the problem is......
 

raeder

Member
What's the environment where you're having this segmentation fault problem?

I'm also getting segmentation faults on the new IBM at NCAR.
When they happen seems to depend on the optimization I've
chosen during the compilation. I'm only doing 6 hour runs, but
-O2 -qstrict -Q leads to segfault on the first timestep
-O3 leads to segfault after the last timestep and
debug mode (-O0 -qinitauto=FF911299 -qflttrap=ov:zero:inv:en)
allows it to finish correctly (so there's nothing to debug!)

Kevin Raeder
497-1307
 

olson

Member
which month are you starting the run? Are you asking for the model to dump monthly
initial conditions files? The reason I ask is there is a compiler bug on bluefire, bluevista
where the model seg faults at the point of writing information to build an initial conditions
file. A patch was put into cam3_5_45 to work around this problem. Alternatively, you can
put "inithist = 'none'" in the cam namelist and the model will not attempt to dump
any initial files. The default behavior of CAM is to dump an initial file on Jan 1 of each model year.
 

olson

Member
It's also been noted that if you're running the model in message passing mode only (not the hybrid
mpi/omp mode), the model seg faults in the data ocean model code. You might want to try
hybrid mode if that's possible for you
 
Hi Jerry,

I think I'm having the problem that you mentioned. I'm running CAM3.1 with the SOM on bluevista. I submitted a one year run and it wrote the monthly restart and history files but after it was writing the clm initial conditions file it integrated two time steps and then quit. Here is the end of my output file:

NSTEP = 26278 8.962900357007745E-05 8.485867875905686E-06 250.607 9.84350E+04 2.1720004603793
15E+01 0.86 0.33
nstep, te 26279 3301593435.27761412 -2.01705199956893910 0.201388966477155675E-03 98434.989122866332
8

INICFILE: Writing clm initial conditions dataset at ./b42.lmaxp.nodust.clm2.i.0001-01-01-00000.nc at
nstep = 26279

(PUTFIL): Issuing shell cmd:(mswrite -t 365 ./b42.lmaxp.nodust.clm2.i.0001-01-01-00000.nc /LISA/cs
m/b42.lmaxp.nodust/lnd/init/b42.lmaxp.nodust.clm2.i.0001-01-01-00000.nc && /bin/rm ./b42.lmaxp.nodust
.clm2.i.0001-01-01-00000.nc )&
NSTEP = 26279 8.963032773243502E-05 8.481723635321414E-06 250.606 9.84350E+04 2.1716751274526
99E+01 0.86 0.33
nstep, te 26280 3301576176.38610888 -2.40986280540625275 0.240608536461071703E-03 98434.958797362705
8
ERROR: 0031-250 task 4: Segmentation fault
ERROR: 0031-250 task 5: Segmentation fault
ERROR: 0031-250 task 0: Segmentation fault
ERROR: 0031-250 task 1: Segmentation fault
ERROR: 0031-250 task 7: Segmentation fault
ERROR: 0031-250 task 6: Segmentation fault
ERROR: 0031-250 task 2: Segmentation fault
ERROR: 0031-250 task 3: Segmentation fault
Job /usr/local/lsf/7.0/aix5-64/bin/poejob -euidevice sn_all -euilib us /ptmp/lisa/b42.lmaxp.nodust/b
ld/cam

TID HOST_NAME COMMAND_LINE STATUS TERMINATION_TIME
===== ========== ================ ======================= ===================
00006 bv0303en.u /ptmp/lisa/b42.l Exit (139) 07/30/2008 10:20:53
00007 bv0303en.u /ptmp/lisa/b42.l Exit (139) 07/30/2008 10:20:53
00004 bv1005en.u /ptmp/lisa/b42.l Exit (139) 07/30/2008 10:20:53
00005 bv1005en.u /ptmp/lisa/b42.l Exit (139) 07/30/2008 10:20:53
00002 bv0804en.u /ptmp/lisa/b42.l Exit (139) 07/30/2008 10:20:53
00003 bv0804en.u /ptmp/lisa/b42.l Exit (139) 07/30/2008 10:20:53
00000 bv0306en.u /ptmp/lisa/b42.l Exit (139) 07/30/2008 10:20:53
00001 bv0306en.u /ptmp/lisa/b42.l Exit (139) 07/30/2008 10:20:53

******************************************************************************************************
Any idea what is going on? Do you need more information?

Thanks,
Lisa
 

olson

Member
Lisa,

Try

inithist = 'none'

in the CAMEXP portion of the namelist. That will prevent the production of atm Initial Condition
files
 

ginah@udel_edu

New Member
Hi Jerry and Lisa,

thanks for the information and by setting the inithist='none' fixed my problem.

On another note, I was wondering if you might be able to indicate what processor settings you use when running CAM-SOM on your IBM. I am using a similar system with loadleveller queueing system and would be interested in what you have found to be optimum but not excessive that your job stays in the queue for ever?

What values do you assign to: node, total_tasks, MP_NODES, MP_TASKS_PER_NODE and OMP_NUM_THREADS normally?

If you have a sample runscript that you normally use I would really appreciate a look at it.

Thanks, Gina.
 

rneale

Rich Neale
CAM Project Scientist
Staff member
For running on bluefire now you can use
setenv OMP_NUM_THREADS 4
and in the BSUB commands
#BSUB -n 32 # number of MPI tasks
#BSUB -R "span[ptile=16]" # max tasks per node
 
Top