Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Startup run error in B1850

ganbaranaito

takufuu
Member
Hello everyone, I want to use B1850 to do PI experiment. Then, I choose 'startup' for runtype and it successfully run for 60 more model years, however, it crashed at 63yr. Some main error information is listed following. I don't know how to solve it. Any advice is welcome. Thanks in advance.

1. Errors in cesm.log file:
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4765
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_dstwet3 1
d global index: 4621
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4331
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4188
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4620
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4189
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4762
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_dstwet3 1
d global index: 4477
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4474
Image PC Routine Line Source
cesm.exe 0000000002F89744 Unknown Unknown Unknown
cesm.exe 0000000002C1655E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000435EB0 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000431C87 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041885D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000431557 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000414BDE Unknown Unknown Unknown
libc-2.17.so 00002B0655ABAB35 __libc_start_main Unknown Unknown
cesm.exe 0000000000414AE9 Unknown Unknown Unknown
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 129
Image PC Routine Line Source
cesm.exe 0000000002F89744 Unknown Unknown Unknown
cesm.exe 0000000002C1655E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000435EB0 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000431C87 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041885D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000431557 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000414BDE Unknown Unknown Unknown
libc-2.17.so 00002B060C8FBB35 __libc_start_main Unknown Unknown
cesm.exe 0000000000414AE9 Unknown Unknown Unknown
Image PC Routine Line Source
cesm.exe 0000000002F89744 Unknown Unknown Unknown
cesm.exe 0000000002C1655E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000435EB0 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000431C87 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041885D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000431557 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000414BDE Unknown Unknown Unknown
libc-2.17.so 00002AD925E4DB35 __libc_start_main Unknown Unknown
cesm.exe 0000000000414AE9 Unknown Unknown Unknown
Image PC Routine Line Source
cesm.exe 0000000002F89744 Unknown Unknown Unknown
cesm.exe 0000000002C1655E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000435EB0 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000431C87 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041885D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000431557 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000414BDE Unknown Unknown Unknown
libc-2.17.so 00002B0B8C5D9B35 __libc_start_main Unknown Unknown
cesm.exe 0000000000414AE9 Unknown Unknown Unknown
Image PC Routine Line Source
cesm.exe 0000000002F89744 Unknown Unknown Unknown
cesm.exe 0000000002C1655E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000435EB0 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000431C87 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041885D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000431557 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000414BDE Unknown Unknown Unknown
libc-2.17.so 00002B5C4954EB35 __libc_start_main Unknown Unknown
cesm.exe 0000000000414AE9 Unknown Unknown Unknown
Image PC Routine Line Source
cesm.exe 0000000002F89744 Unknown Unknown Unknown
cesm.exe 0000000002C1655E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000435EB0 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000431C87 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041885D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000431557 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000414BDE Unknown Unknown Unknown
libc-2.17.so 00002B7E56EB3B35 __libc_start_main Unknown Unknown
cesm.exe 0000000000414AE9 Unknown Unknown Unknown
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 29
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 60
Image PC Routine Line Source
cesm.exe 0000000002F89744 Unknown Unknown Unknown
cesm.exe 0000000002C1655E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000435EB0 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000431C87 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041885D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000431557 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000414BDE Unknown Unknown Unknown
libc-2.17.so 00002ABDD3A94B35 __libc_start_main Unknown Unknown
cesm.exe 0000000000414AE9 Unknown Unknown Unknown
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 36
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 64
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 98
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 104
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 105
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 111
Image PC Routine Line Source
cesm.exe 0000000002F89744 Unknown Unknown Unknown
cesm.exe 0000000002C1655E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000435EB0 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000431C87 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041885D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000431557 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000414BDE Unknown Unknown Unknown
libc-2.17.so 00002B70B858DB35 __libc_start_main Unknown Unknown
cesm.exe 0000000000414AE9 Unknown Unknown Unknown
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 75
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 71


2. The following is every step I took:

./create_newcase --case $CASEROOT --compset B1850 --res f19_g17 --mach NJU

cd $CASEROOT

./xmlchange --file env_run.xml --id DIN_LOC_ROOT --val $INPUTDIR
./xmlchange --file env_run.xml --id RUNDIR --val $RUNDIR
./xmlchange --file env_run.xml --id RUNTYPE --val 'startup'

./xmlchange NTASKS_ATM=168,NTHRDS_ATM=1,ROOTPE_ATM=0
./xmlchange NTASKS_ICE=168,NTHRDS_ICE=1,ROOTPE_ICE=0
./xmlchange NTASKS_LND=168,NTHRDS_LND=1,ROOTPE_LND=0
./xmlchange NTASKS_CPL=168,NTHRDS_CPL=1,ROOTPE_CPL=0
./xmlchange NTASKS_ROF=168,NTHRDS_ROF=1,ROOTPE_ROF=0
./xmlchange NTASKS_OCN=168,NTHRDS_OCN=1,ROOTPE_OCN=0
./xmlchange NTASKS_GLC=168,NTHRDS_GLC=1,ROOTPE_GLC=0
./xmlchange NTASKS_WAV=168,NTHRDS_WAV=1,ROOTPE_WAV=0
./xmlchange NTASKS_ESP=168,NTHRDS_ESP=1,ROOTPE_ESP=0

./case.setup

./case.build

./xmlchange --file env_run.xml --id RESUBMIT --val '9'
./xmlchange --file env_run.xml --id CONTINUE_RUN --val 'FALSE'
./xmlchange --file env_run.xml --id STOP_N --val '10'
./xmlchange --file env_run.xml --id STOP_OPTION --val 'nyears'
./xmlchange --file env_run.xml --id REST_N --val '5'
./xmlchange --file env_run.xml --id REST_OPTION --val 'nyears'
./xmlchange --file env_run.xml --id DOUT_S --val 'FALSE'

./case.submit

No other changes.
 

sacks

Bill Sacks
CSEG and Liaisons
Staff member
I am transferring this to the atmosphere (CAM) forums, because I think the "singular matrix" errors are coming from CAM.

Can you please add some other information requested here Information to include in help requests , including:
- What model version are you using?
- What compiler and compiler version are you using?

Please also look through that post for the other information requested there.
 

ganbaranaito

takufuu
Member
I am transferring this to the atmosphere (CAM) forums, because I think the "singular matrix" errors are coming from CAM.

Can you please add some other information requested here Information to include in help requests , including:
- What model version are you using?
- What compiler and compiler version are you using?

Please also look through that post for the other information requested there.
Thank you for your reply!
Model version is CESM2.1.3 and compiler is intel18.0.0. And the version of netcdf is 4.6.2, pnetcdf is 1.8.1 and the hdf5 is 1.10.2.
The log files are too large to be uploaded.

The error in the end of cesm.log file has been given before.
The end of atm.log file is here:

240000 620314
Total Mass= 985.179061840328 (mb), Dry Mass= 982.880000048656 (mb
)
Total Precipitable Water = 23.4450772950108 (kg/m**2)
PS max = 1038.99950750644 min = 548.817888738102
U max = 73.8815154907612 min = -70.5741671015342
V max = 56.8320794416801 min = -69.9922782796526
T max = 307.246674863901 min = 184.452984615097
W (mb/day) max = 664.672102643920 min = -922.700451008047
Average Height (geopotential units) = 1237.30659359359
PRECC max = 45.0360539930045 min = 0.000000000000000E+000
PRECL max = 65.5661268350766 min = 0.000000000000000E+000
Total precp= 2.63017809108050 CON= 1.44361683837637 LS=
1.18656125270413

nstep, te 1072224 0.26033455325552850E+10 0.26033478275472641E+10 0.12720019618471122E-03 0.98517906184032807E+05
chem_surfvals_set: ncdate= 620315 co2vmr= 2.847000000000000E-004
READ_NEXT_TRCDATA CH4_CHML
READ_NEXT_TRCDATA contvolcano
READ_NEXT_TRCDATA contvolcano
READ_NEXT_TRCDATA contvolcano
READ_NEXT_TRCDATA contvolcano
READ_NEXT_TRCDATA contvolcano
nstep, te 1072225 0.26033084301162963E+10 0.26033104844930463E+10 0.11386408258086426E-03 0.98517894675270363E+05
nstep, te 1072226 0.26032710017447295E+10 0.26032730139075894E+10 0.11152438458667965E-03 0.98517884076783856E+05
 

katec

CSEG and Liaisons
Staff member
Often when we see NaNs in these fields in the coupler it is because of an instability In the coupled system. These instabilities can occur due to mis-matches in initial conditions or occasionally they just pop up randomly in some runs. So, start by making sure your initial start-up land and atm files match your resolution and model version. Then, you could try perturbing a few parameters to see if they help you get past the instability. If your model crashed at year 63, I might try restarting from year 60 and set clubb_gamma_coef to a value that is 0.000003 larger (you can find the value of clubb_gamma_coef in your atm_in file in your case directory).

Also, look at the model output from just before the crash. If you see any regions with very high winds, that could indicate where the instability is forming and what is needed to address/solve it.

Also, why do you change the RUNDIR? are you pointing to a pre-compiled executable? Consider running a case where you don't change the RUNDIR and let the model build into the same RUNDIR as the case originally expected.
 
Top