Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CLM50%SP to CLM50%BGC on Derecho

skramer

Sydney Kramer
New Member
I have been able to successfully build and submit a case using the F2000climo compset with clm5.0:Satellite phenology but am now running into issues trying to run an identical case just with clm5.0:BGC (vert. resol. CN and methane) with prognostic crop.
In other words:

I ran one experiment successfully with:
./create_newcase -case /glade/u/home/skramer/casefiles/2000_docn_ACblob2 -compset=2000_CAM60_CLM50%SP_CICE%PRES_DOCN%DOM_MOSART_CISM2%NOEVOLVE_SWAV -res=f09_f09_mg17 -mach=derecho -project=######## --run-unsupported

but am now running into trouble running the experiment with:
./create_newcase -case /glade/u/home/skramer/casefiles/2000_docn_ACblob2_BGC -compset=2000_CAM60_CLM50%BGC-CROP_CICE%PRES_DOCN%DOM_MOSART_CISM2%NOEVOLVE_SWAV -res=f09_f09_mg17 -mach=derecho -project=########--run-unsupported

I ran the identical equivalent experiment on Cheyenne with CLM50%BGC-CROP without any issue (and an earlier version of CESM2), so I think this is a Derecho issue. I see other forum posts have run into similar issues trying to use CLM50%BGC-CROP on Derecho but I have run into different problems. My case was able to ./case.build fine, but when I submit the case it runs for all 12 hours of wall time and then runs out of wall time, aborts, and I get no output. It is not throwing any errors, and the log is showing nothing of use to identity where the issue is.
The case is located here:

/glade/u/home/skramer/casefiles/2000_docn_ACblob2_BGC
and the case run logs are located here:

/glade/derecho/scratch/skramer/2000_docn_ACblob2_BGC/run

The errors I am getting via email:
PBS Job Id: 5940910.desched1
Job Name: run.2000_docn_ACblob2_BGC
Aborted by PBS Server
Job exceeded resource walltime
See job standard error file

PBS Job Id: 5940911.desched1
Job Name: st_archive.2000_docn_ACblob2_BGC
Aborted by PBS Server
Job deleted as result of dependency on job 5940910.desched1

Is CLM50%BGC-CROP not available on Derecho? If so how is it able to build with it but then simply not run successfully. This is the only change from another experiment that ran without any issue.
Any insight would be greatly appreciated, Thank you!
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
From the log files, I think the model is running fine but it is indeed running out of wall clock time. The cpl logs indicate it ran from 20000101 and died around 20041211. The cpl log indicates it is taking about 23.9 seconds per day. If you multiply that out, it indicates that you can only get about 4.94 years in a 12 hour period. That's about when it is dying.
 

skramer

Sydney Kramer
New Member
From the log files, I think the model is running fine but it is indeed running out of wall clock time. The cpl logs indicate it ran from 20000101 and died around 20041211. The cpl log indicates it is taking about 23.9 seconds per day. If you multiply that out, it indicates that you can only get about 4.94 years in a 12 hour period. That's about when it is dying.
It was only supposed to run for 5 years so I'd be shocked if it died and provided no output a month shy of completion...

Something has to be wrong. When it ran with CLM50%SP on Derecho, it ran 15 years/day. It also ran 10 years in a 12-hour wall time with CLM50%BGC-CROP on Cheyenne. There is no way this model, which is an AGCM (not coupled) would take 12 hours to only run 5 years.

Any ideas?
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
It produced history files for all of the components up until it died, so I'm not sure what you mean by no output.
BGC-Crop is slower but it shouldn't be that much slower. But is this your SP case:

2000_docn_ACblob2

That case seems to have gotten about 10.8 years/day.
 

skramer

Sydney Kramer
New Member
It produced history files for all of the components up until it died, so I'm not sure what you mean by no output.
BGC-Crop is slower but it shouldn't be that much slower. But is this your SP case:

2000_docn_ACblob2

That case seems to have gotten about 10.8 years/day.
It didn't put any of the monthly output files into the archive where they are supposed to go. Yes, that is the SP case, and I was running that as well as another model 2000_docn_Ctrl2 at the same time each day, so each model was running that much per day in the allotted 12-hour wall time window.

I expected the BGC-Crop to be slower but it shouldn't be that much slower.
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
The short-term archive will fail if your main run fails. That's what this message means:

PBS Job Id: 5940911.desched1
Job Name: st_archive.2000_docn_ACblob2_BGC
Aborted by PBS Server
Job deleted as result of dependency on job 5940910.desched1

All of your history files are still in your run directory.

When it says per day, it means per wall-clock day, so you have to divide by two to determine the number of years in a 12-hour period.
 

skramer

Sydney Kramer
New Member
Thank you for letting me know those are located in my run directory, that is very helpful. I am still concerned that the output may not be trustworthy and something is wrong with it taking so much more time than it would have on Cheyenne, and much much slower than the SP case.

Thank you so much for looking into this!
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I don't think it's that much slower than the SP case. As I mentioned, your SP case was getting about 10.8 years/day, which means that it could run about 5.4 years in a 12-hour period. Your BGC-Crop case can almost run 5 years in a 12-hour period.
How many processors were you using in your case on Cheyenne? Your BGC-Crop case is using 512 processors.
 
Top