Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CAM3.1.p1 T85 run killed

Hello,

I just downloaded the source code, T85 dataset, and datasets for all resolutions for CAM3.1.p1. I built the model successfully on my linux machine, and set it up to run for a few years starting in 1996, however it crashes in the same place every time, before the timesteps even begin. I get the simple message "killed" on my command line, and the model output file ends like this, it just gets cut off mid-thought with no error message:
-----------------------------------------------------
Successfully initialized variables for accumulation

Attempting to initialize time variant variables
Reading initial data from initial dataset
(GETFIL): attempting to find local file clmi_0000-09-01_128x256_c040422.nc
(GETFIL): using
/extra/c
-----------------------------------------------------
My namelist contains the following:

&camexp
absems_data = '/extra/ccm33/home/CAM3.1/inputdata_T85/atm/cam2/rad/abs_ems_factors_fastvx.c030508.nc'
aeroptics = '/extra/ccm33/home/CAM3.1/inputdata_T85/atm/cam2/rad/AerosolOptics_c040105.nc'
bnd_topo = '/extra/ccm33/home/CAM3.1/inputdata_T85/atm/cam2/topo/topo-from-cami_0000-09-01_128x256_L26_c040422.nc'
bndtvaer = '/extra/ccm33/home/CAM3.1/inputdata_T85/atm/cam2/rad/AerosolMass_V_128x256_clim_c031022.nc'
bndtvo = '/extra/ccm33/home/CAM3.1/inputdata_T85/atm/cam2/ozone/pcmdio3.r8.64x1_L60_clim_c970515.nc'
bndtvs = '/extra/ccm33/home/CAM3.1/inputdata_T85/atm/cam2/sst/sst_HadOIBl_bc_128x256_1949_2001_c020812.nc'
caseid = 'camrun_3.1_T85_elnino'
iyear_ad = 1950
ncdata = '/extra/ccm33/home/CAM3.1/inputdata_T85/atm/cam2/inic/gaus/cami_0000-09-01_128x256_L26_c040422.nc'
nelapse = -1100
nsrest = 0
start_ymd = 19960601
sstcyc = .false.

FEXCL1 = 'CMFDQ', 'CMFDQR', 'CMFDT', 'CMFMC', 'DCQ', 'DTCOND', 'DTH', 'FICE', 'FLNSOI', 'GCLDLWP', 'ICEFRAC', 'ICLDIWP', 'ICLDLWP', 'LANDFRAC', 'LHFLXOI', 'OMEGAT', 'PRECSC', 'PRECSL', 'QC', 'QFLX', 'SFCLDICE', 'SFCLDICE', 'SFCLDLIQ', 'SFQ', 'SHFLXOI', 'SNOWHICE', 'SNOWHLND', 'SOLIN', 'TGCLDIWP', 'TGCLDLWP', 'TREFHT', 'VD01', 'VV', 'UU'

FINCL1 = 'PRECTMX', 'PRECLFRQ', 'PRECCINT', 'PRECCav', 'OMEGA500', 'CLDST'

/
&clmexp
finidat = '/extra/ccm33/home/CAM3.1/inputdata_T85/lnd/clm2/inidata_2.1/cam/clmi_0000-09-01_128x256_c040422.nc'
fpftcon = '/extra/ccm33/home/CAM3.1/inputdata_T85/lnd/clm2/pftdata/pft-physiology'
fsurdat = '/extra/ccm33/home/CAM3.1/inputdata_T85/lnd/clm2/srfdata/cam/clms_128x256_c031031.nc'
---------------------------------------------------------------

Any thoughts? Do I need to do anything special for a T85 run that you don't have to do for a T42 run? CAM3.0.p1 at T42 resolution worked fine on my machine, so I'm not sure what could have gone wrong with the higher resolution.

Thanks,
Cathy
 

eaton

CSEG and Liaisons
Good chance this is a "not enough memory" problem. T85 requires 4x the memory of T42. Start by making sure your processes have access to all available stack with a command like "ulimit -s unlimited"
 
How much memory does CAM T42 require? Perhaps the cluster here does not have enough memory to run the T85 resolution. I already made sure stacksize, etc. were set to unlimited.

Thanks,
Cathy
 

eaton

CSEG and Liaisons
Here are some numbers for total memory requirements for cam3.1, Eulerian, T85 running on an IBM power4 cluster. The performance tool I used doesn't give separate numbers for stacksize, but these numbers should at least provide a rough idea of the memory required.
The other point they illustrate is that if your cluster nodes don't have enough memory when using, for example, only 2 MPI tasks, you can reduce the per process memory requirement by running with 4 MPI tasks. Another strategy for reducing memory requirement on a single node is to assign only 1 task per node even if the node has 2 cpus. In pure MPI mode this means that 1 processor on the node is idle, but if this avoids oversubscribing the memory, which can result in performance destroying memory swapping, then it may be worth it.
(Sorry the table is so hard to read... I couldn't figure out how to get it preserve spaces.)

# MPI processes Master process size (MB) Slave process size (MB)
------------------------ ------------------------------------- -----------------------------------
2 1390 1130
4 911 600
8 670 336
16 550 202
 
Top