Question on restarting a regional case

stevenDH

Member
Hi all

I have been struggling to run a restart run when doing regional simulations with ctsm5.3.012. I have previously succesfully set up a spinup simulation over a region of Central Africa with various model configurations (I2000Clm50Sp,I2000ClmFATES). Now my goal is to start a new run with transient climate. I know how to set this up with the xml commands and finidat for the FATES initialisation from previous experience with single site runs. The regional land surface and domain datasets were created with the python commands provided for this.

However my regional runs crash early in the simulation, they end just before finishing the initialisation of history output (see lnd log) with a segmentation fault and not much more info (something about memory mismatch in the cesm log attached, not any specific info on the crash in the lnd log). Other (no restart) simulations work fine so it doesn't seem to be related to limited resources or anything like that. Is there currently a known bug related to restarting runs from regional simulations?

The only few things to note is that when submitting my case the model tries to download a mosart restart file (see below) despite my regional runs never having created any mosart output during the spinup runs. That's another side question I have, is it normal that mosart doesn't run for these compsets or is this incompatible with the regional setup?

Any input is much appreciated!
Cheers
Steven


Model mosart missing file finidat = '/scratch/brussel/vo/000/bvo00003/vsc46573/cesm/output/FATES_SPINUP_I2000Clm50Fates_congo_def.03-25/run/FATES_SPINUP_I2000Clm50Fates_congo_def.03-25.mosart.r.2150-01-01-00000.nc'
Checking server ftp://gridanon.cgd.ucar.edu:2811/cesm/inputdata/ with protocol gftp
Setting resource.RLIMIT_STACK to -1 from (-1, -1)
Client protocol gftp not enabled
Checking server ftp://ftp.cgd.ucar.edu/cesm/inputdata/ with protocol wget
Setting resource.RLIMIT_STACK to -1 from (-1, -1)
Using protocol wget with user anonymous and passwd user@example.edu
Could not connect to repo 'ftp://ftp.cgd.ucar.edu/cesm/inputdata/'
This is most likely either a proxy, or network issue .(location 2)
Checking server ftp.cgd.ucar.edu/cesm/inputdata with protocol ftp
Setting resource.RLIMIT_STACK to -1 from (-1, -1)
Using protocol ftp with user anonymous and passwd user@example.edu
server address ftp.cgd.ucar.edu root path cesm/inputdata
ftp login timeout! [Errno 111] Connection refused
Checking server - Revision 70792: /trunk/inputdata with protocol svn
Setting resource.RLIMIT_STACK to -1 from (-1, -1)
Using protocol svn with user and passwd
Loading input file list: 'Buildconf/ctsm.input_data_list'
Loading input file list: 'Buildconf/cpl.input_data_list'
Loading input file list: 'Buildconf/datm.input_data_list'
Loading input file list: 'Buildconf/mosart.input_data_list'
Model mosart missing file finidat = '/scratch/brussel/vo/000/bvo00003/vsc46573/cesm/output/FATES_SPINUP_I2000Clm50Fates_congo_def.03-25/run/FATES_SPINUP_I2000Clm50Fates_congo_def.03-25.mosart.r.2150-01-01-00000.nc'
Cannot download file since it lives outside of the input_data_root '/scratch/brussel/vo/000/bvo00003/vsc46573/cesm/inputdata'
GET_REFCASE is false, the user is expected to stage the refcase to the run directory.
Check case OK
 
Solution
Hi Sam
Thanks so much for the quick reply, both the MOSART_MODE command and the RUN_TYPE=startup also failed identically. Which made clear that it had to be related to my finidat not getting read in properly. I tried your troubleshooting trick (I didn't realise diff worked at directory level that's really usefull!!) and didn't find any notable differences except that in my restart run I had use_init_interp set to true in my user_nl_clm. I don't recall exactly why I added that so I removed it again and now it initialises as it should be!

Many thanks for the help in troubleshouting this, it's much appreciated!
Cheers
Steven

stevenDH

Member
Sorry for some reason I cannot upload the log files, please find the runscript attached and the cesm log printed below

[node714:3345833:0:3345833] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4)
[node714:3345802:0:3345802] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4)
[node714:3345828:0:3345828] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4)
[node714:3345813:0:3345813] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4)
==== backtrace (tid:3345817) ====
0 0x000000000003fc30 __GI___sigaction() :0
1 0x0000000000829016 __fatesrestartinterfacemod_MOD_set_restart_vectors() ???:0
2 0x00000000005675f8 __clmfatesinterfacemod_MOD_restart() ???:0
3 0x0000000000548bb1 __clm_instmod_MOD_clm_instrest() ???:0
4 0x000000000061a748 __restfilemod_MOD_restfile_write() ???:0
5 0x0000000000545c56 __clm_initializemod_MOD_initialize2() ???:0
6 0x0000000000512b85 __lnd_comp_nuopc_MOD_initializerealize() lnd_comp_nuopc.F90:0
7 0x000000000057c888 ESMCI::FTable::callVFuncPtr() ???:0
8 0x000000000057cbbc ESMCI_FTableCallEntryPointVMHop() ???:0
9 0x0000000000866c3b ESMCI::VMK::enter() ???:0
10 0x000000000087da87 ESMCI::VM::enter() ???:0
11 0x000000000057b4b5 c_esmc_ftablecallentrypointvm_() ???:0
12 0x0000000000a6b8fc __esmf_compmod_MOD_esmf_compexecute() ???:0
13 0x0000000000c9a946 __esmf_gridcompmod_MOD_esmf_gridcompinitialize() ???:0
14 0x00000000010da7ec __nuopc_driver_MOD_loopmodelcompss() NUOPC_Driver.F90:0
15 0x00000000010dd19c __nuopc_driver_MOD_initializeipdv02p3() NUOPC_Driver.F90:0
16 0x000000000057c888 ESMCI::FTable::callVFuncPtr() ???:0
17 0x000000000057cbbc ESMCI_FTableCallEntryPointVMHop() ???:0
18 0x0000000000866c3b ESMCI::VMK::enter() ???:0
19 0x000000000087da87 ESMCI::VM::enter() ???:0
20 0x000000000057b4b5 c_esmc_ftablecallentrypointvm_() ???:0
21 0x0000000000a6b8fc __esmf_compmod_MOD_esmf_compexecute() ???:0
22 0x0000000000c9a946 __esmf_gridcompmod_MOD_esmf_gridcompinitialize() ???:0
23 0x00000000010da7ec __nuopc_driver_MOD_loopmodelcompss() NUOPC_Driver.F90:0
24 0x00000000010dd2b9 __nuopc_driver_MOD_initializeipdv02p3() NUOPC_Driver.F90:0
25 0x0000000001110e25 __nuopc_driver_MOD_initializegeneric() NUOPC_Driver.F90:0
26 0x000000000057c888 ESMCI::FTable::callVFuncPtr() ???:0
27 0x000000000057cbbc ESMCI_FTableCallEntryPointVMHop() ???:0
28 0x0000000000866c3b ESMCI::VMK::enter() ???:0
29 0x000000000087da87 ESMCI::VM::enter() ???:0
30 0x000000000057b4b5 c_esmc_ftablecallentrypointvm_() ???:0
31 0x0000000000a6b8fc __esmf_compmod_MOD_esmf_compexecute() ???:0
32 0x0000000000c9a946 __esmf_gridcompmod_MOD_esmf_gridcompinitialize() ???:0
33 0x000000000041d0bc MAIN__() esmApp.F90:0
34 0x0000000000411ae7 main() ???:0
35 0x000000000002a610 __libc_start_call_main() ???:0
36 0x000000000002a6c0 __libc_start_main_alias_2() :0
37 0x0000000000411b15 _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
==== backtrace (tid:3345804) ====
0 0x000000000003fc30 __GI___sigaction() :0
1 0x0000000000829016 __fatesrestartinterfacemod_MOD_set_restart_vectors() ???:0
2 0x00000000005675f8 __clmfatesinterfacemod_MOD_restart() ???:0
3 0x0000000000548bb1 __clm_instmod_MOD_clm_instrest() ???:0
4 0x000000000061a748 __restfilemod_MOD_restfile_write() ???:0
5 0x0000000000545c56 __clm_initializemod_MOD_initialize2() ???:0
6 0x0000000000512b85 __lnd_comp_nuopc_MOD_initializerealize() lnd_comp_nuopc.F90:0
7 0x000000000057c888 ESMCI::FTable::callVFuncPtr() ???:0
8 0x000000000057cbbc ESMCI_FTableCallEntryPointVMHop() ???:0
9 0x0000000000866c3b ESMCI::VMK::enter() ???:0
10 0x000000000087da87 ESMCI::VM::enter() ???:0
11 0x000000000057b4b5 c_esmc_ftablecallentrypointvm_() ???:0
12 0x0000000000a6b8fc __esmf_compmod_MOD_esmf_compexecute() ???:0
13 0x0000000000c9a946 __esmf_gridcompmod_MOD_esmf_gridcompinitialize() ???:0
14 0x00000000010da7ec __nuopc_driver_MOD_loopmodelcompss() NUOPC_Driver.F90:0
15 0x00000000010dd19c __nuopc_driver_MOD_initializeipdv02p3() NUOPC_Driver.F90:0
16 0x000000000057c888 ESMCI::FTable::callVFuncPtr() ???:0
17 0x000000000057cbbc ESMCI_FTableCallEntryPointVMHop() ???:0
18 0x0000000000866c3b ESMCI::VMK::enter() ???:0
19 0x000000000087da87 ESMCI::VM::enter() ???:0
20 0x000000000057b4b5 c_esmc_ftablecallentrypointvm_() ???:0
21 0x0000000000a6b8fc __esmf_compmod_MOD_esmf_compexecute() ???:0
22 0x0000000000c9a946 __esmf_gridcompmod_MOD_esmf_gridcompinitialize() ???:0
23 0x00000000010da7ec __nuopc_driver_MOD_loopmodelcompss() NUOPC_Driver.F90:0
24 0x00000000010dd2b9 __nuopc_driver_MOD_initializeipdv02p3() NUOPC_Driver.F90:0
25 0x0000000001110e25 __nuopc_driver_MOD_initializegeneric() NUOPC_Driver.F90:0
26 0x000000000057c888 ESMCI::FTable::callVFuncPtr() ???:0
27 0x000000000057cbbc ESMCI_FTableCallEntryPointVMHop() ???:0
28 0x0000000000866c3b ESMCI::VMK::enter() ???:0
29 0x000000000087da87 ESMCI::VM::enter() ???:0
30 0x000000000057b4b5 c_esmc_ftablecallentrypointvm_() ???:0
31 0x0000000000a6b8fc __esmf_compmod_MOD_esmf_compexecute() ???:0
32 0x0000000000c9a946 __esmf_gridcompmod_MOD_esmf_gridcompinitialize() ???:0
33 0x000000000041d0bc MAIN__() esmApp.F90:0
34 0x0000000000411ae7 main() ???:0
35 0x000000000002a610 __libc_start_call_main() ???:0
36 0x000000000002a6c0 __libc_start_main_alias_2() :0
37 0x0000000000411b15 _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x14b87543fc2f in ???
#1 0x829016 in ???
#2 0x5675f7 in ???
#3 0x548bb0 in ???
#4 0x61a747 in ???
#5 0x545c55 in ???
#6 0x512b84 in ???
#7 0x14b8769ad887 in ???
#8 0x14b8769adbbb in ???
#9 0x14b876c97c3a in ???
#10 0x14b876caea86 in ???
#11 0x14b8769ac4b4 in ???
#12 0x14b876e9c8fb in ???
#13 0x14b8770cb945 in ???
#14 0x14b87750b7eb in ???
#15 0x14b87750e19b in ???
#16 0x14b8769ad887 in ???
#17 0x14b8769adbbb in ???
#18 0x14b876c97c3a in ???
#19 0x14b876caea86 in ???
#20 0x14b8769ac4b4 in ???
#21 0x14b876e9c8fb in ???
#22 0x14b8770cb945 in ???
#23 0x14b87750b7eb in ???
#24 0x14b87750e2b8 in ???
#25 0x14b877541e24 in ???
#26 0x14b8769ad887 in ???
#27 0x14b8769adbbb in ???
#28 0x14b876c97c3a in ???
#29 0x14b876caea86 in ???
#30 0x14b8769ac4b4 in ???
#31 0x14b876e9c8fb in ???
#32 0x14b8770cb945 in ???
#33 0x41d0bb in ???
#34 0x411ae6 in ???
#35 0x14b87542a60f in ???
#36 0x14b87542a6bf in ???
#37 0x411b14 in ???
#38 0xffffffffffffffff in ???
==== backtrace (tid:3345827) ====
0 0x000000000003fc30 __GI___sigaction() :0
1 0x0000000000829016 __fatesrestartinterfacemod_MOD_set_restart_vectors() ???:0
2 0x00000000005675f8 __clmfatesinterfacemod_MOD_restart() ???:0
3 0x0000000000548bb1 __clm_instmod_MOD_clm_instrest() ???:0
4 0x000000000061a748 __restfilemod_MOD_restfile_write() ???:0
5 0x0000000000545c56 __clm_initializemod_MOD_initialize2() ???:0
6 0x0000000000512b85 __lnd_comp_nuopc_MOD_initializerealize() lnd_comp_nuopc.F90:0
7 0x000000000057c888 ESMCI::FTable::callVFuncPtr() ???:0
8 0x000000000057cbbc ESMCI_FTableCallEntryPointVMHop() ???:0
9 0x0000000000866c3b ESMCI::VMK::enter() ???:0
10 0x000000000087da87 ESMCI::VM::enter() ???:0
11 0x000000000057b4b5 c_esmc_ftablecallentrypointvm_() ???:0
12 0x0000000000a6b8fc __esmf_compmod_MOD_esmf_compexecute() ???:0
13 0x0000000000c9a946 __esmf_gridcompmod_MOD_esmf_gridcompinitialize() ???:0
14 0x00000000010da7ec __nuopc_driver_MOD_loopmodelcompss() NUOPC_Driver.F90:0
15 0x00000000010dd19c __nuopc_driver_MOD_initializeipdv02p3() NUOPC_Driver.F90:0
16 0x000000000057c888 ESMCI::FTable::callVFuncPtr() ???:0
17 0x000000000057cbbc ESMCI_FTableCallEntryPointVMHop() ???:0
18 0x0000000000866c3b ESMCI::VMK::enter() ???:0
19 0x000000000087da87 ESMCI::VM::enter() ???:0
20 0x000000000057b4b5 c_esmc_ftablecallentrypointvm_() ???:0
21 0x0000000000a6b8fc __esmf_compmod_MOD_esmf_compexecute() ???:0
22 0x0000000000c9a946 __esmf_gridcompmod_MOD_esmf_gridcompinitialize() ???:0
23 0x00000000010da7ec __nuopc_driver_MOD_loopmodelcompss() NUOPC_Driver.F90:0
24 0x00000000010dd2b9 __nuopc_driver_MOD_initializeipdv02p3() NUOPC_Driver.F90:0
25 0x0000000001110e25 __nuopc_driver_MOD_initializegeneric() NUOPC_Driver.F90:0
26 0x000000000057c888 ESMCI::FTable::callVFuncPtr() ???:0
27 0x000000000057cbbc ESMCI_FTableCallEntryPointVMHop() ???:0
28 0x0000000000866c3b ESMCI::VMK::enter() ???:0
29 0x000000000087da87 ESMCI::VM::enter() ???:0
30 0x000000000057b4b5 c_esmc_ftablecallentrypointvm_() ???:0
31 0x0000000000a6b8fc __esmf_compmod_MOD_esmf_compexecute() ???:0
32 0x0000000000c9a946 __esmf_gridcompmod_MOD_esmf_gridcompinitialize() ???:0
33 0x000000000041d0bc MAIN__() esmApp.F90:0
34 0x0000000000411ae7 main() ???:0
35 0x000000000002a610 __libc_start_call_main() ???:0
36 0x000000000002a6c0 __libc_start_main_alias_2() :0
37 0x0000000000411b15 _start() ???:0
=================================

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
srun: error: node714: task 20: Segmentation fault (core dumped)
srun: Terminating StepId=11979402.0
[2026-04-23T12:17:09.755] error: *** STEP 11979402.0 ON node714 CANCELLED AT 2026-04-23T12:17:09 DUE TO TASK FAILURE ***
srun: error: node714: tasks 0-19,21-39: Terminated
srun: Force Terminated StepId=11979402.0
 

Attachments

Vote Upvote 0 Downvote

slevis

Moderator
Staff member
Since you mentioned an inconsistency with mosart, I definitely recommend resolving that. In your case's env_build.xml you should change MOSART_MODE to NULL. I hope that resolves the mosart problem and possibly the whole crash. If not, I would next change your RUN_TYPE=hybrid to startup. Then specify the finidat you want in the case's user_nl_clm.

If the run still fails, here's some more general troubleshooting that I do with all my cases. I like looking at diff case_new case_works > dif.out at the whole directory level. Scroll through the dif.out file and see if anything stands out as strange. In dif.out, usually the most interesting diffs will appear in env_run.xml, and in the user_nl_* files. Though the mosart difference would appear in env_build.xml, as I mentioned above.
 
Vote Upvote 0 Downvote

stevenDH

Member
Hi Sam
Thanks so much for the quick reply, both the MOSART_MODE command and the RUN_TYPE=startup also failed identically. Which made clear that it had to be related to my finidat not getting read in properly. I tried your troubleshooting trick (I didn't realise diff worked at directory level that's really usefull!!) and didn't find any notable differences except that in my restart run I had use_init_interp set to true in my user_nl_clm. I don't recall exactly why I added that so I removed it again and now it initialises as it should be!

Many thanks for the help in troubleshouting this, it's much appreciated!
Cheers
Steven
 
Vote Upvote 0 Downvote
Solution
Back
Top