Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CLM run crashes after first resubmit

James King

James King
Member
Hi all,

I'm doing a regional CLM-only run over Africa with anomaly forcing from DATM. To run from 2015-2100 I separate the run into 17-year blocks and set RESUBMIT=4. The model runs fine for the initial set of 17 years and writes history files as expected. However, when the first resubmitted run starts, the model crashes almost immediately with what looks to be an MPI error. In the cesm.log file:

37: NetCDF: Invalid dimension ID or name
37: NetCDF: Variable not found
37: NetCDF: Variable not found
0:(seq_domain_areafactinit) : min/max mdl2drv 1.00000000000000 1.00000000000000 areafact_a_ATM
0:(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00000000000000 areafact_a_ATM
1246:MPT ERROR: Rank 1246(g:1246) received signal SIGSEGV(11).
1246: Process ID: 45777, Host: r11i2n22, Program: /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/bld/cesm.exe
1246: MPT Version: HPE MPT 2.21 11/28/19 04:21:40
1246:
1246:MPT: --------stack traceback-------
1680:MPT ERROR: Rank 1680(g:1680) received signal SIGSEGV(11).
1680: Process ID: 32512, Host: r11i4n13, Program: /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/bld/cesm.exe
1680: MPT Version: HPE MPT 2.21 11/28/19 04:21:40
1680:
1680:MPT: --------stack traceback-------
1605:MCT::m_Rearranger::Rearrange_: TargetAV size is not appropriate for this Rearranger
1605:MCT::m_Rearranger::Rearrange_: error, InRearranger%RecvRouter%lAvsize=3, AttrVect_lsize(TargetAV)=0.
1605:645.MCT(MPEU)::die.: from MCT::m_Rearranger::Rearrange_()
1605:MPT ERROR: Rank 1605(g:1605) is aborting with error code 2.
1605: Process ID: 30829, Host: r12i4n15, Program: /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/bld/cesm.exe
1605: MPT Version: HPE MPT 2.21 11/28/19 04:21:40


etc. I've been told previously not to worry about the NetCDF errors here, and there are no error messages in any of the component logs. My question is - why does the model work in the initial submission but not the resubmission, and is there anything I can change that might persuade it to continue running for more than 17 years? I am using CESM2.2.0 on Cheyenne.

Many thanks,

James
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I was looking at /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/run but the directory seems to be empty. Do you have a case with the error that I can look at? Thanks.
 

James King

James King
Member
Hi Keith,

That's because I cleared the directory out in the process of rebuilding the case to try a few ideas for fixing it. The case directory

/glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.005/run

is part of the same set of experiments and gave the same error message.

Thanks,

James
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
Got it thanks. I don't see anything useful either. I'm not familiar with this rearranger error.
Can you recompile in DEBUG mode and restart again and see if you get a better traceback?
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
You could also try something like a 1 month run and then a 1 month continue run with DEBUG on in both cases to simplify the troubleshooting.
 

James King

James King
Member
Hi Keith,

Thanks for your suggestions. I ran for 2 months in debug mode as suggested and the cesm.log file ended with lots of lines like this:

1548:forrtl: severe (408): fort: (2): Subscript #1 of the array HISTO has value 57990 which is greater than the upper bound of 40633
1548:
1548:Image PC Routine Line Source
1548:cesm.exe 000000000460F586 Unknown Unknown Unknown
1548:cesm.exe 0000000000B697EC histfilemod_mp_hf 3024 histFileMod.F90
1548:cesm.exe 0000000000B82B26 histfilemod_mp_hi 3530 histFileMod.F90
1548:cesm.exe 00000000008F6133 clm_driver_mp_clm 1348 clm_driver.F90
1548:cesm.exe 000000000088921B lnd_comp_mct_mp_l 457 lnd_comp_mct.F90
1548:cesm.exe 0000000000467C55 component_mod_mp_ 737 component_mod.F90
1548:cesm.exe 000000000042F481 cime_comp_mod_mp_ 2626 cime_comp_mod.F90
1548:cesm.exe 000000000044F79C MAIN__ 133 cime_driver.F90
1548:cesm.exe 0000000000407BE2 Unknown Unknown Unknown
1548:libc-2.22.so 00002B12CFC47A35 __libc_start_main Unknown Unknown
1548:cesm.exe 0000000000407AE9 Unknown Unknown Unknown
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
One possible problem I see now is that you are requesting PFT level output for at least one variable that is only available at the column level or higher (TOTLITC). We've seen that causing strange behavior before and the model doesn't inform you of that problem. So I suggest you go through all of your variables and rearrange them so that the subgrid level you are requesting is actually available for each variable.
 
Top