This list is intended to be as extensive as possible. If you have any further suggestions, feel free to reply to this post!
Check for Known Problems
It is recommended to check the known problems lists when starting a new run, and whenever encountering a new issue. The following post has a list of WACCM-specific issues, as well as links to the official CESM release notes:
Both this link and the release notes are updated periodically with new information.
CESM Log Files
If a run has crashed, you may find a useful message in the CESM log files. Look in your run directory for a file with a name that starts with "cesm.log." for error messages from the compiler and batch system. If the model aborts, it will print a message here, which may begin with the string "ERROR" or "ENDRUN". Occasionally, there may be useful information in a different log file, such as the atm or cpl logs.
The Batch System
Your machine's batch system may terminate runs due to excessive resource use (e.g. exceeding wall time limit), or due to an incorrect run file. Before going further in debugging a crash, it may be a good idea to check that your run did not simply exceed a time limit.
One way to check this is to look in the standard output file in your case directory (the name varies by system, but it usually contains a number corresponding to the number of the batch job). If your job was killed by the batch system, the last line of this file will look like this:
Sun Dec 22 20:14:07 MST 2013 -- CSM EXECUTION BEGINS HERE
If CESM ran successfully, or if it failed but the job continued, a status message would be printed after this line. If the job never executed at all, this file may not exist, or may not contain any information specific to CESM.
If the last line in the standard output file says "EXECUTION BEGINS HERE", that means that the job was terminated abnormally, usually due to excessive resource use, or because of a fault in the system.
Many HPC systems produce "light" core files that consist of text backtraces. These may tell you the file and line number where an error occurred, or they may provide a hex code that you can use to get a line number using the addr2line utility. Turning on DEBUG in env_build.xml may improve the output here.
Traditional core files are binary files, which can be analyzed by a debugger. For instance, you can run "gdb /path/to/cesm.exe /path/to/corefile.123", then using the "backtrace" command. There are many other command line and GUI debuggers from different vendors, and different core file formats may require different programs. A detailed explanation of debugger use is beyond the scope of this guide.
Unfortunately, we don't offer direct support for this method in CESM, as debuggers are often quite different on various HPC systems. However, if you have experience running MPI code with a debugger, this may be a quick method for finding a problem.
DEBUG Mode and Compiler Options
Turning on DEBUG mode turns on compiler options that can help in a few ways.
Firstly, DEBUG turns on additional checks provided by the compiler, such as bounds checking. If these checks detect an error and abort the run, this information will appear in the CESM log file.
Secondly, DEBUG turns off optimization and adds the "-g" option, which may cause more detailed information to be provided in core files and debuggers, or in the compiler output in the CESM log file. Removing optimization improves debugging output, but it also changes answers.
Thirdly, DEBUG turns on floating point trapping. If an invalid floating point operation occurs, such as floating point overflow, or arithmetic with a NaN, the run will abort and print an error message explaining the problem in the log file.
In some cases, you may want to enable some checks from DEBUG mode, but not all of them. This can be done by editing the Macros file in your case directory to add the desired compiler flags. The most common reason to do this is to keep most of the DEBUG mode flags, but without turning optimization off.
This suggestion applies mostly to runs involving new code that interacts with the dynamics, and may be useful if there is a crash, or if answers are not reproducible.
In the env_mach_pes.xml file, check the values of the variables with names begining with "NTHRDS". If any of these variables has a value higher than 1, threading is enabled. Since threading is a common source of bugs in new code, you may want to try disabling it. To do so, create a new case with the same settings as your original run, but set all NTHRDS variables to 1 immediately after creating the case. Alternatively, you can reuse your original case, if you clean the build and run configure/cesm_setup with the -clean option beforehand.
Changing the Compiler or MPI Library
Errors in new code are rarely due to compiler errors. However, if you have reason to suspect that the compiler or MPI library is causing a problem, you may change either by giving the relevant options to create_newcase.
Check memory use
Some memory and timing statistics are printed after each model day to the coupler log ("cpl.log.xxxxxx-xxxxxx"). You may want to check that this doesn't exceed the memory available to each processor of your system. If it does, you may find that running with more processors will solve the problem.
Instability in the FV dycore
Crashes in WACCM are sometimes caused by instability in the FV dycore, which is almost always due to a value of "nspltvrm" that is too low, allowing levels to cross in the dycore's vertically Lagrangian advection. In CESM 1.2 and later, this error should not occur for supported WACCM runs, and in cases where it does occur, an error message should be printed to the CESM log file.
However, in CESM 1.0 and CESM 1.1, this error is harder to diagnose, since the run may appear to crash in an unrelated physics parameterization. Attached to this post, there is a gzipped copy of the file "te_map.F90". You can download this file, run gunzip on it, and place it in your CAM SourceMods, or in the models/atm/cam/src/dynamics/fv directory of your CESM source tree. If you do so, it will detect this error before it causes a crash, and print a message to the CESM log file before aborting the run. If it does so, the CESM log file should contain a message recommending an increase to "NSPLTVRM", which is, in fact, the recommended fix.
In CESM 1.2, this namelist setting was introduced. It is not available in earlier CESM versions. If you set "state_debug_checks = .true." in the CAM namelist, it will perform some basic validation of the physics state. Specifically, it will check for infinite and NaN values in variables such as the wind speed, and will also check that some variables, such as temperature, are always positive. If any of these checks fails, an error message will be printed that mentions which variable contained the bad data, as well as the physics package that most likely introduced the erroneous data into the physics state.
This debugging option has a mild cost, so it is fine to leave it on for non-production runs, to catch any errors made during development. It does not change answers.
Sean Patrick Santos
CESM Software Engineering Group