Hi all,
I'm doing a regional CLM-only run over Africa with anomaly forcing from DATM. To run from 2015-2100 I separate the run into 17-year blocks and set RESUBMIT=4. The model runs fine for the initial set of 17 years and writes history files as expected. However, when the first resubmitted run starts, the model crashes almost immediately with what looks to be an MPI error. In the cesm.log file:
37: NetCDF: Invalid dimension ID or name
37: NetCDF: Variable not found
37: NetCDF: Variable not found
0:(seq_domain_areafactinit) : min/max mdl2drv 1.00000000000000 1.00000000000000 areafact_a_ATM
0:(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00000000000000 areafact_a_ATM
1246:MPT ERROR: Rank 1246(g:1246) received signal SIGSEGV(11).
1246: Process ID: 45777, Host: r11i2n22, Program: /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/bld/cesm.exe
1246: MPT Version: HPE MPT 2.21 11/28/19 04:21:40
1246:
1246:MPT: --------stack traceback-------
1680:MPT ERROR: Rank 1680(g:1680) received signal SIGSEGV(11).
1680: Process ID: 32512, Host: r11i4n13, Program: /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/bld/cesm.exe
1680: MPT Version: HPE MPT 2.21 11/28/19 04:21:40
1680:
1680:MPT: --------stack traceback-------
1605:MCT::m_Rearranger::Rearrange_: TargetAV size is not appropriate for this Rearranger
1605:MCT::m_Rearranger::Rearrange_: error, InRearranger%RecvRouter%lAvsize=3, AttrVect_lsize(TargetAV)=0.
1605:645.MCT(MPEU)::die.: from MCT::m_Rearranger::Rearrange_()
1605:MPT ERROR: Rank 1605(g:1605) is aborting with error code 2.
1605: Process ID: 30829, Host: r12i4n15, Program: /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/bld/cesm.exe
1605: MPT Version: HPE MPT 2.21 11/28/19 04:21:40
etc. I've been told previously not to worry about the NetCDF errors here, and there are no error messages in any of the component logs. My question is - why does the model work in the initial submission but not the resubmission, and is there anything I can change that might persuade it to continue running for more than 17 years? I am using CESM2.2.0 on Cheyenne.
Many thanks,
James
I'm doing a regional CLM-only run over Africa with anomaly forcing from DATM. To run from 2015-2100 I separate the run into 17-year blocks and set RESUBMIT=4. The model runs fine for the initial set of 17 years and writes history files as expected. However, when the first resubmitted run starts, the model crashes almost immediately with what looks to be an MPI error. In the cesm.log file:
37: NetCDF: Invalid dimension ID or name
37: NetCDF: Variable not found
37: NetCDF: Variable not found
0:(seq_domain_areafactinit) : min/max mdl2drv 1.00000000000000 1.00000000000000 areafact_a_ATM
0:(seq_domain_areafactinit) : min/max drv2mdl 1.00000000000000 1.00000000000000 areafact_a_ATM
1246:MPT ERROR: Rank 1246(g:1246) received signal SIGSEGV(11).
1246: Process ID: 45777, Host: r11i2n22, Program: /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/bld/cesm.exe
1246: MPT Version: HPE MPT 2.21 11/28/19 04:21:40
1246:
1246:MPT: --------stack traceback-------
1680:MPT ERROR: Rank 1680(g:1680) received signal SIGSEGV(11).
1680: Process ID: 32512, Host: r11i4n13, Program: /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/bld/cesm.exe
1680: MPT Version: HPE MPT 2.21 11/28/19 04:21:40
1680:
1680:MPT: --------stack traceback-------
1605:MCT::m_Rearranger::Rearrange_: TargetAV size is not appropriate for this Rearranger
1605:MCT::m_Rearranger::Rearrange_: error, InRearranger%RecvRouter%lAvsize=3, AttrVect_lsize(TargetAV)=0.
1605:645.MCT(MPEU)::die.: from MCT::m_Rearranger::Rearrange_()
1605:MPT ERROR: Rank 1605(g:1605) is aborting with error code 2.
1605: Process ID: 30829, Host: r12i4n15, Program: /glade/scratch/jamesking/i.clm5.AfrSSP370_climate_CO2_noLULCC.002/bld/cesm.exe
1605: MPT Version: HPE MPT 2.21 11/28/19 04:21:40
etc. I've been told previously not to worry about the NetCDF errors here, and there are no error messages in any of the component logs. My question is - why does the model work in the initial submission but not the resubmission, and is there anything I can change that might persuade it to continue running for more than 17 years? I am using CESM2.2.0 on Cheyenne.
Many thanks,
James