Welcome to the new DiscussCESM forum!
We are still working on the website migration, so you may experience downtime during this process.

Existing users, please reset your password before logging in here: https://xenforo.cgd.ucar.edu/cesm/index.php?lost-password/

How to debug CESM2 simulations

Eric

Eric
Member
Hello,

I was wondering if there is a way to debug CESM2 runs, execute codes line by line, and check variables' value dynamically? I read through the official guide document while did not find associated information. I saw in some threads it is suggested to use debug = .true. but I still don't know how to do this. I would appreciate it if somebody could provide detailed instructions.
 

cacraig

Cheryl Craig
CSEG and Liaisons
Staff member
Yes it is possible to run CESM2 jobs using a debugger.

If you happen to have access to NCAR's cheyenne machine, they have the debugger DDT installed on it. There are CISL web pages which describe how to use it. I personally have not used it, so cannot give direct guidance on it.

I personally use the debugger "Totalview" on a local computing cluster. Note that Totalview is a debugger that must be purchased - it is not free.

To run a CESM job with any debugger, the following lines should work the same. Just replace the "totalview" in the last line with the name of your debugger. Make sure you do ./xmlchange DEBUG=TRUE to get code which you can use with a debugger.

I do the usual create_newcase, case.setup and case.build commands and then instead of doing case.submit I execute the following commands in my sandbox:
------------------------------------
source .env_mach_specific.sh

RUNDIR=`./xmlquery RUNDIR -value`
EXEROOT=`./xmlquery EXEROOT -value`
LID=`date '+%y%m%d-%H%M%S'`

cd $RUNDIR
mkdir timing
mkdir timing/checkpoints
echo `pwd`
export OMP_NUM_THREADS=$nthreads
totalview ${EXEROOT}/cesm.exe
------------------------------------
Note that the .env_mach_specific.sh scripts assumes you are using the bash shell. If you are using csh, there is a corresponding .env_mach_specific.csh that you would run as the first line instead.
 

Eric

Eric
Member
Hi Cheryl,

RUNDIR=`./xmlquery RUNDIR -value`
EXEROOT=`./xmlquery EXEROOT -value`
LID=`date '+%y%m%d-%H%M%S'`

I assume these three lines should be occurring in .env_mach_specific.sh, right? You do not use export so I guess you are not defining environmental variables here.

And is nthreads an environmental variable? I am a bit confused about this command:

export OMP_NUM_THREADS=$nthreads

Should I look for information and replace $nthreads with some specific numbers?

And should I execute these commands on cheyenne login node or casper?

Thank you!!

Best
Eric
 

cacraig

Cheryl Craig
CSEG and Liaisons
Staff member
As my default shell is bash, I simple copy/paste all of these commands in my shell. The script `.env_mach_specific.sh` needs to be run followed by the remaining lines. In my bash shell, I typically copy/paste the entire set of lines into my shell while I am sitting in my sandbox directory (where I ran the case.build command).

My script is a bash script and as I said, I typically run with a single processor and single thread to ease debugging. At the top of my script, I set
np=1
nthreads=1

You can start a bash shell, by typing "bash" and then you can copy/paste these commands in your sandbox directory. Once you are done debugging, you can type "exit" and that will return you to your native shell.

If on the other hand you want to do this all in csh, you will need to convert the lines to use csh syntax.
 

Eric

Eric
Member
Hi Cheryl,

You said that at the top of your script, you set:
np=1
nthreads=1

Would you mind sharing your bash script with me? Thank you!

Best
Eric
 

cacraig

Cheryl Craig
CSEG and Liaisons
Staff member
To run my case in single processor/single thread mode, I do the following exact steps. I used the "script" terminology loosely as I really just copy/paste. I start this process in cime/scripts. If you are already in bash, you can omit the bash first line and exit at the end. Obviously unless you have Totalview, you will need to change that line to use your debugger.

bash
cd cime/scripts
./create_newcase <usual stuff> --res f10_f10_mg37 --pecount 1 (note the f10_f10_mg37 is a very coarse grid but is good for debugging most issues and pecount 1 says to run with a single processor)
cd <your case dir>
./xmlchange NTASKS=1 (This is probably redundant, but I do it every time)
./xmlchange DEBUG=TRUE (compile CESM with DEBUG on)
./case.setup
./case.build
np=1
nthreads=1
source .env_mach_specific.sh
RUNDIR=`./xmlquery RUNDIR -value`
EXEROOT=`./xmlquery EXEROOT -value`
LID=`date '+%y%m%d-%H%M%S'`
cd $RUNDIR
mkdir timing
mkdir timing/checkpoints
echo `pwd`
export OMP_NUM_THREADS=$nthreads
totalview ${EXEROOT}/cesm.exe
exit
 
Last edited:

Eric

Eric
Member
Hi Cheryl,

Thank you for this detailed instruction!

1. I tried this set of commands with the fully coupled case: BHIST_BPRP. However, after I set its ntasks to be 1 and tried building the case, it reminds that build fail:

ERROR: BUILD FAIL: buildexe failed, cat /glade/scratch/xygao/case1_BHIST_BPRP_singlecore/bld/cesm.bldlog.210516-170841

In the log file, the last several lines say that:

ld: failed to convert GOTPCREL relocation; relink with --no-relax
/glade/u/home/xygao/cases/case1_BHIST_BPRP_singlecore/Tools/Makefile:874: recipe for target '/glade/scratch/xygao/case1_BHIST_BPRP_singlecore/bld/cesm.exe' failed
gmake: *** [/glade/scratch/xygao/case1_BHIST_BPRP_singlecore/bld/cesm.exe] Error 1

Do you have any idea about why the build would fail under ntasks=1?

2. And because ntasks=1 makes the model build fail, I just left its pe-layout as the original. That is to say, the debugger will also use 20 nodes, 720 tasks, and one thread for debugging. Here is the command set that I used:

cd my_cesm_sandbox/cime/scripts
./create_newcase --case ~/cases/case4_BHIST_BPRP --compset BHIST_BPRP --res f09_g17 --project $PROJECT
cd ~/cases/case4_BHIST_BPRP
./case.setup
./xmlchange DEBUG=TRUE
qcmd -- ./case.build
np=720
nthreads=1
source .env_mach_specific.sh
RUNDIR=`./xmlquery RUNDIR --value`
EXEROOT=`./xmlquery EXEROOT --value`
LID=`date '+%y%m%d-%H%M%S'`
cd $RUNDIR
mkdir timing
mkdir timing/checkpoints
echo `pwd`
export OMP_NUM_THREADS=$nthreads
ddt --connect ${EXEROOT}/cesm.exe

However, the debugging stopped at line 58 in the main function (cime_driver.F90): call cime_pre_init1(esmf_logfile_option)
and reported error:

MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()

aborting job

Do you have any idea about this debugging fail? I guess there are still some issues with the environment settings. I would appreciate it if you could give me some clues.

Best
Eric
 

cacraig

Cheryl Craig
CSEG and Liaisons
Staff member
Eric-

I'm afraid that I have shared the bulk of my knowledge on this. As I am always debugging CAM F compsets, they are not quite as complex as B compsets and usually work well with ntasks=1. That said, I will reach out privately to a couple of people who I know debug B compsets and use ddt to see if they can give you some additional advice.
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Eric!

I'm also an Erik (with a different spelling). I use the DDT debugger part of ARM forge on cheyenne.

Here's most of the steps for that


CISL also has documentation on the process here:


You need both parts to get this working..
 

Eric

Eric
Member
Hi Erik!

Thank you so much! This is super helpful.

I have a small question: it says to add module load for allinea-forge in the env_mach_specific.xml.ddt
I was wondering where I should add: module load arm-forge/20.2?
It seems strange if I just put it in front of the first line of env_mach_specific.xml.ddt

Best
Eric
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
This comment on an earlier issue shows how it gets imbedded into the env_mach_specific.xml file. Some of the specifics have changed since then, but it shows you the general idea...


And you are right it wouldn't work if added as the first line of the file. It needs to be where the other default module loads happen. And it needs to be in a section that's always invoked and not something that's dependent on the compiler or something else.
 

Eric

Eric
Member
Hi Erik,

Thanks! I have figured out how to module load arm-forge/20.2.

And I executed:
./case.setup --reset --keep env_mach_specific.xml

It reminds:
error: unrecognized arguments: --keep

In the Github page that you sent me, it says:
The "-keep (-k) option is a critical step that is now required.

Did they change the setting so -k is not required now?

Best
Eric
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
This is likely a CESM version issue. You must be using an older version of CESM where this wasn't an option. If you are using CESM2.1.x series, I think it might not have been available then. I'm not sure about CESM2.2.0 either, but it is in the latest version of cime with CESM I'm using.
 

Eric

Eric
Member
Hi Erik,

I see. I just used:

./case.setup --reset

and I checked ./preview_run. It looks like the modification has been in effect.

If the simulation will use 720 tasks in total, I should select MPI processes to be 720 in the DDT interface, right?

And I was wondering how long I can stay in DDT doing the debug? For example, if I stay in DDT for 2 hours, will this consume a large part of my core-hour resources?

Best
Eric
 

Eric

Eric
Member
Hi Erik,

I am now able to debug CESM2.1.3 runs with DDT. However, it works only if I submit the job to the regular queue, and it reports errors if I submit the job to the share queue and do debug. Since I use only one node and one core for debugging, the share queue would be the best choice for me in terms of saving computing time. I guess it is because the modifications in env_run.xml inherently specified the job to be running in the regular queue? Do you have any idea about this?

Thanks!
 
Top