Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Running CCSM3 in the K configuration

I am trying to get a CCSM# run set up in the K configuration on blackforest and as soon as I
change from the default datasets the run bombs for no apparent reason (ie there is no official "error" message). All of
the components stop in a shared subroutine shr_sys_flush which is in
shr_sys_mod.F90. Every component calls this subroutine which executes "call
flush_(unit)". I can not find "flush" in any of the code so assume it is some
sort of machine function.

All the components stop with the following:

D1: In pm_child_sig_handler, signal=15, task=9

The signal and the task number vary, but the rest is identical. Any
idea what this means?

Thanks for any help or suggestions you can offer.


Jake Sewall
 

weiyu

New Member
Blackforest is the supported machine. Everything should work without any changes. You can do a short time test to prove it before you do any change.
This test will run for 10 days, out put restart file at day 5, and then do 5 days restart. The test result will be showed at "TestStatus" file.

Following are the steps for a short time test:
1) under scripts directory:
./create_test -testname TER.01a.T42_gx1v3.K.blackforest
(you can change the resolution as your requirement.)
2) It will create a test case for TER.01a......,
Goto that directory, run "*.build" first, to build the model interactively.
After you finish build the model, submit the job "*.test" with "llsubmit" command.
("*.test" can also build the model in batch mode, if you did not build it interactively first.)
3) After the model finish run, Look at the TestStatus.
If it shows "PASS", everything should be fine.
4) If you change anything, you should also follow the step from 1 to 3.

Please let me know, if there are any questions.

Thanks.

Wei.
 
Wei-

TestStatus.out does not return PASS, it also, however, does not return any useful error. The entire contents of TestStatus.out are

Restart Test log is
Initial Test log is
Usage: restart_compare.pl file1 file2

The model bombs with the error I was receiving previously then tries to restart and bombs again because there were no restart files present.

If you have any other ideas as to what the original "error" message or the contents of TestStatus indicate, please let me know.

Thanks.

Jake Sewall
 

weiyu

New Member
Hi, Jake,

Did you test it with the default initial data or your own data set?
If you tested with the default data set, I woud have to repeat the error and fix it. If you tested with your own data set, I could took look at your log files and see whether I can find the problem very quickly. But we are not going to debug the problem.

Anyway, you need to test first without any changes. If that works, you can try your own data set. For you own data set, you can point me to your running scripts. I have account on blackforest. I can take a look.

By the way, what is the version of the code, ccsm3.0 release version?
What kind of resolution did you use for test?

Thanks.

Wei.
 
Wei-

I used my own data, I don't need to test with the default data, that runs just fine. The problem I have is that when I switch to my data sets, the model doesn't run.
That in itself is not so much a problem as the fact that it doesn't return an error message that I can understand. The end of the stoudt file tells me to see the coupler log. The coupler log is empty.

Every single component ends with the line:

D1: In pm_child_sig_handler, signal=15, task=9

This does not help me debug. I have saturated the code with print statements and every component stops in the same, shared, subroutine. That routine does not have a problem. Consequently, in order to debug further I think I need to know what:

D1: In pm_child_sig_handler, signal=15, task=9

means. I am sorry if that was not clear from my original post. I thought that CCSM folks might know what this error is. If you do not know, do you know who might? Maybe it is a machine code and SCD would know more?

I am using beta22.

Thanks.

Jake
 

weiyu

New Member
Hi, Jake,

D1: In pm_child_sig_handler, signal=15, task=9

task=9 means that this message is from mpi task 9. You can see your poe.cmd file, and know which component is ranked 9.

By the way, would you please point me to your running scripts? I could take a quick look.

Thanks.

Wei
 

murphys

Member
Jake,

the ccsm does not have the best error messags that is for sure. It is something that the CSEG group would like to mitigate as part of its model unification project.

basically that error message is indicating that the model is dying in task 9 which
is one of the MPI tasks. The rest of that message is IBM speak, and even the
consultants down in SCD would have to look up in the manual what signal 15 means etc.

can you provide wei with the exact location of your scripts on blackforest. since
the model runs with our data but not your data, it is probably something you have done to the scripts. He is going to have to take a closer look.

sylvia
 
Wei-

My scripts are in :

/home/blackforest/jsewall/ccsm3_0_beta22/scripts/casek2

I have changed almost nothing in them:

Project #
queue name
Data file name

The default data files are hardwired someplace I couldn't find for the ocean model so in addition to changing the information in the prestage build scripts I had to hardwire my file names in the code.

As the only difference between a case that runs and a case that doesn't is my data there is, obviously, a problem with my data. I had hoped that error message might shed some light on what was wrong with it.

The conclusion I have reached is that my SST data exists only over the ocean, the default files have data everywhere. I think the interpolation routines must have a problem with this and land/sea masks aren't matching up the way they should.

I am testing this (why I need to merge two files, Syvlia) and will see if I am correct. If I am, I will have to see if I can live with the results I get for making up random SST data over the continents or if I need to try to alter the model to deal with an input SST dataset vs an input air temp dataset.

Jake
 

murphys

Member
Hi Jake,

we took a look at the poe.sterr.* file in your script directory. This is where you need to debug. if you look for the first line that contains a captial ERR and look above that you will usually see an error message.

this is the message and the file we viewed:
"A file or directory in the path name does not exist."
in
/home/blackforest/jsewall/ccsm3_0_beta22/scripts/
TER.01a.T42_gx1v3.K.blackforest.110314/poe.stderr.20995.0

you are not specifying the data correctly in the scripts.

Here is a quote from page 33 of the user's manual that may help you out here:

"An empty input data root directory tree is also provided as a future place holder for custom user-generated input datasets. This is set in the env_mach.$MACH file via the environment variable $DIN_LOC_ROOT_USER. If the user wishes to use any user-modified input datasets in place of the officially realseased version, these should be placed in the appropriate subdirectory of $DIN_LOC_ROOT_USER"

check you r env_mach$MACH for this and see what you are setting and make sure your data is in the right place to begin with.

sylvia[/code]
 
Sylvia-

The first line Containing ERROR in the file:

/home/blackforest/jsewall/ccsm3_0_beta22/scripts/
TER.01a.T42_gx1v3.K.blackforest.110314/poe.stderr.20995.0

Is the following:

D3: Message type 1 from source 15
ERROR: 0031-250 task 15: Segmentation fault

The next one says:

ERROR: 0031-250 task 5: Terminated

then

ERROR: 0031-250 task 2: Terminated

then

ERROR: 0031-250 task 1: Terminated

etc. etc. for each task.

I can find the line you refer to, but it is much later in the stderr file. In this case the problem was a trailing / that I left on a directory.

This is not, however, the problem with my original case (not the test one) which finds the data without difficulty (the appropriate datasets appear in /ptmp). This is most likely because I have full path names hardwired in the code and namelists. I have updated env.$machine and presume it will continue to find datasets.

I have fixed the problem in the test case and am running it again.

Thanks

Jake
 

weiyu

New Member
Hi, Jake,

I looked your scripts and error message from casek2 directory.
Here are my opinions:
1) The error message shows the atm has a segmentation fault.
2) There is a message of cp domain.gx1v3_010723.nc, permission not allowed. Looks like either you try to overwrite this data, but it does not have write permission, or the file does not exist.
3) set 2 day's start up run with "env_run", put your data set one by one, so that you know which data set causes the problem.
4) In this point, I think you do not need to run the restart test, since it failed in start up run. 2 days's start up run is enough.


Wei.
 
Top