Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

A problem after submitted the job

Hi, We have been utilizing CESM1_1_2 for the simulation runs, but we came across a problem.The case was successfully built, but when we submitted the job, it stopped in less than one minute(when initializing). And we hope you can help.Related files are like these:The end of the ccsm.log file looks like this:Opened existing file
/home/mak/input/share/domains/domain.lnd.fv1.9x2.5_gx1v6.090206.nc
51
Opened existing file
/home/mak/input/share/domains/domain.lnd.fv1.9x2.5_gx1v6.090206.nc
51
Opened existing file
/home/mak/input/lnd/clm2/surfdata/surfdata_1.9x2.5_simyr2000_c091005.nc
51
Opened existing file
/home/mak/input/lnd/clm2/pftdata/pft-physiology.c110425.nc 51
Opened existing file
/home/mak/input/lnd/clm2/surfdata/surfdata_1.9x2.5_simyr2000_c091005.nc
51
rank 0 in job 1 node23.local_38486 caused collective abort of all ranks
exit status of rank 0: killed by signal 9  The end of the cpl.log file is like this:(seq_mct_drv) : Initialize each component: atm, lnd, rof, ocn, ice, and glc
(seq_mct_drv) : Initialize atm component ATM
(seq_mct_drv) : Initialize lnd component LND The end of the lnd.log file looks like this:Attempting to read global land mask from
/home/mak/input/share/domains/domain.lnd.fv1.9x2.5_gx1v6.090206.nc
(GETFIL): attempting to find local file domain.lnd.fv1.9x2.5_gx1v6.090206.nc
(GETFIL): using
/home/mak/input/share/domains/domain.lnd.fv1.9x2.5_gx1v6.090206.nc
lat/lon grid flag (isgrid2d) is T
ncd_inqvid: variable LANDMASK is not on dataset
decomp precompute numg,nclumps,seglen1,avg_seglen,nsegspc= 5663
8 F 35.39375 20.00000
Surface Grid Characteristics
longitude points = 144
latitude points = 96
total number of land gridcells = 5663
Decomposition Characteristics
clumps per process = 1
gsMap Characteristics
lnd gsmap glo num of segs = 485Attempting to read ldomain from
/home/mak/input/share/domains/domain.lnd.fv1.9x2.5_gx1v6.090206.nc We didnt figure out what's going wrong in here, hope you can help.
 

santos

Member
The phrase "killed by signal 9" usually means that the job was killed by the batch system, or a daemon or administrator on the system. Check with the administrators of your machine to see if they know why this job was killed.
 
Thanks for your reply. The administrator didn't do anything to kill this job, so it's probably caused automatically by the batch system or a daemon. How can I acquire more information about that? We still didn't know where go wrong in our case settings. Hope you can help
 

santos

Member
The only thing I can see from the log is that the job was killed. The job might have been killed if you went over one of your job's limits, but that's not likely if it only ran for a moment. It could have been a random system error, but I'm guessing that you tried re-running again and it didn't work.Sometimes a job can get killed because of excessive resource use, so you might want to check that it has enough memory (e.g. that you used enough nodes for the specific compset and resolution you use).It's also possible for the compiler's runtime to send that signal to itself (though it's a bit odd to send SIGKILL to itself). You can try running with DEBUG set to TRUE (this is in env_build.xml).That's about all I can think of. Your system's administrator is who you should ask for information about your particular machine (e.g. the batch system, debugging tools installed there, tools for monitoring resource usage, and so on).
 
Thanks a lot for your patient reply.I tried what you have mentioned which is to turn the DEBUG on to TRUE, then it worked. It was successfully built and started running(for more that 3 hours now), but it is running very slowly for a 5-day simulation. It has some output files(mostly compset log files), some nc files are being made but still very slowly. Now that it has some output files, where can I look for the errors?(I mean the errors that occur when I didn't use debuging) Because I want to submit it without the DEBUG on to make it faster.Thanks a lot for your help.
 

jedwards

CSEG and Liaisons
Staff member
You can turn DEBUG back off and edit the Macros file to reduce optimization in FFLAGS.   It seems like your problem is occuring in lnd initialization so it may be possible to limit the change to the lnd model.  
 
Thanks for your advice. I didn't add any flags on FFLAGS when I was trying to configure the case, and my FFLAGS are like this:FFLAGS:=  -i4 -gopt -Mlist -time -Mextend -byteswapio -Mflushz -KieeeWhat kind of optimization should I reduce in this case? Thanks for your swift reply. 
 
Really sorry to bother you again. I tried some settings to reduce all of the optimizations off from the FFLAGS, but it worked out the same way.I really didn't figure out where went wrong in this case. All of my trials are stopped with end of the ccsm.log file like this:rank 13 in job 1  node22.local_33207   caused collective abort of all ranks  exit status of rank 13: killed by signal 9 sometimes it ends with signal 11.And you are right that the case stopped when initializing lnd component.Hope you can help. 
 

jedwards

CSEG and Liaisons
Staff member
Look at the difference in your build logs between the flags used to build clm with DEBUG true and false.  You can use ifeq ($(MODEL), clm)
     FFLAGS = 
  endif

in the Macros file to change the flags for only the land model.  
 
Thanks so much, it worked.I compared the differences between two logs, the debug file is lack of what you have mentioned.I did what you asked, and it ran successfully.But by the way, will it affect the scientific outcome of the experiments?Thanks so much. 
 

jedwards

CSEG and Liaisons
Staff member
> But by the way, will it affect the scientific outcome of the experiments?A number of factors will affect the results of a simulation.   The hardware, compiler vendor and level and other lower level software can all influence the results of a simulation.   Usually we expect that none of these factors will change the general climate of the model results.   
 
Top