CESM 2.1.5 F2000climo halting in modal_aero_lw after nerr_dopaer = 1000

BenT

Ben Timmermans
New Member
Hello,

I have a fresh "port" of CESM 2.1.5 to a local cluster, and appear to have a clean build against Intel compilers (2023).

However, running the F2000climo compset, at both F09 and F19, aerosols seem to be causing trouble, ultimately resulting in the job catching too many errors and apparently self-terminating:

In cesm.log:
...
*** halting in modal_aero_lw after nerr_dopaer = 1000
ERROR: Unknown error submitted to shr_abort_abort.
...

In atm.log, shortly before job is terminated:
...
WARNING: Aerosol optical depth is unreasonably high in this layer.
dopaer( 1 , 23 , 3 , 129 )= 106.322967237323
...

The warning is given many times until the job stops.

Initially I was worried about compiling, but what little information I could find on this error suggests it could be corrupted, or mis-read, input data? However, I have checked the input data integrity using "./check_input_data" script which passes all data, and I have not done anything other than follow standard installation and build procedures. Everything seems ok ...

Can anyone suggest what is going on here, or how I could debug it?
Thanks
 

fischer

CSEG and Liaisons
Staff member
Hi Ben,

I just tried running an F2000climo at f19 on our system with intel 2024. I don't have access to intel 2023. I didn't get the warning messages or the error. Something you can try doing is rebuilding with debugging turned on.

./xmlchange DEBUG=TRUE

You can also try using QPC6 for a compset instead on F2000climo. They're basically the same, except QPC6 is an aquaplanet.

Since you ran ./check_input_data --chksum, it makes me wonder is there's an issue with the intel 2023 compiler. Do you have access to any other compilers?

Thanks
Chris
 
Last edited:
Vote Upvote 0 Downvote

BenT

Ben Timmermans
New Member
Hi Chris, many thanks for your input. There are very few examples online of this type of error so I can only assume it's very unusual.

I could switch to GCC although on my platform I think it would require a lot of new package building (i.e. packages and libs against GCC are not already available on the system).

I can also add that these results were obtained under PIO_VERSION=1. I don't know why this was not set to version 2, which I think should be the case?

However, having set version 2, I am now facing an error in the PIO build, which looks clean enough, until it gets to the end and sates:

...
gmake[2]: Leaving directory '/dssgfs01/scratch/benerm/cesm/TEST4_F2000c_f19/bld/intel/impi/debug/nothreads/pio/pio2'
[100%] Built target piof
gmake[1]: Leaving directory '/dssgfs01/scratch/benerm/cesm/TEST4_F2000c_f19/bld/intel/impi/debug/nothreads/pio/pio2'
/opt/software/rocky9/eb/software/CMake/3.26.3-GCCcore-12.3.0/bin/cmake -E cmake_progress_start /dssgfs01/scratch/benerm/cesm/TEST4_F2000c_f19/bld/intel/impi/debug/nothreads/pio/pio2/CMakeFiles 0
ERROR: CIME models require NETCDF in PIO build

This is not terribly helpful, because as far as I can tell, my machine and compiling configs are picking up the NetCDF dependencies ok. I have since tried various permutations of compiler and machine environment options, but every time I get the same error.

Incidentally, I also discovered that a long time back, my colleague already posted on something similar w.r.t. CESM v 2.1.4:


He *was* able to run 2.1.4 ok, although under a different set of package builds (not sure if Intel or GCC). So this seems to imply a mis-configured environment, which could still be the case, although I fail to see where the problem could be.

Unfortunately, I have to say the setup and build process for CESM obfuscates so many aspects of the build environment that it is pretty difficult to work through problems. (Many hours have been consumed so far...)
 
Vote Upvote 0 Downvote

fischer

CSEG and Liaisons
Staff member
Hi Ben,

CESM 2.1.5 defaults to PIO_VERSION=1, so that is expected. Could you tell me the create_newcase command you're using. I want to make sure I'm trying to test the same configuration as you. I'm also going to ask the CAM folks if they have encountered this error before.

Can you try using your colleague's build configuration? It'll work with 2.1.5. But it sounds like the build packages he used might not be available anymore.

Did you have a chance to try the run the DEBUG turned on?

It'll also be helpful if you could attach any files you modified for the port to your machine, and the number of tasks you're trying to use for your tests.

Thanks
Chris
 
Vote Upvote 0 Downvote

BenT

Ben Timmermans
New Member
Hi Chris,

Thanks for coming back on this.

I have been working to resolve bugs in my config, and I now better understand the relationships between the config files (phew ...!).

1. Using Intel compilers, I can now successfully build PIO_VERSION 1 & 2, with pnetcdf recognised ok.
2. QPC6
./create_newcase --case /dssgfs01/scratch/benerm/cesm/TEST8_QPC6_f19 --compset QPC6 --res f19_f19_mg17 --machine anemone
Following your useful suggestion, I tried this case and it appears to run successfully to completion! This is based on my local config_machines.xml file I have created.
3. F2000climo
./create_newcase --case /dssgfs01/scratch/benerm/cesm/TEST7_F2000c_f19 --compset F2000climo --res f19_f19_mg17 --machine anemone
Unfortunately, in spite of my progress, I am still seeing the original error (seemingly associated with aerosol depth) using either PIO_VERSION 1 or 2, with either NetCDF or PNetCDF.

Using DEBUG, I am not sure there is any more failure information to report? I guess this is because the model is actually running, and causing spurious values that are caught by the model itself, rather than completely crashing the code. This is why it feels like an input problem, rather than a build problem.

As I said, I have run the input data check command and it seems ok, but note that I earlier (first off) tried FC2000climo somewhat by mistake. I am vaguely concerned that somehow data might have got muddled up? With that said, I have not modified or messed with any input data, so I don't know why or how that could have happened. What would be the most relevant config files?

Any more thoughts or suggestions?
 
Vote Upvote 0 Downvote
Back
Top