Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

ERROR: component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global

ganbaranaito

takufuu
Member
Hello, everyone.

I want to do PI experiment by choosing B1850 compset. My case could successfully build and run with no error. However, the case was crashed after 3 years(model time) with the following error in the cesm.log (ERROR: component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global). I don't know how to solve it. Can you give me some suggestions? Thank you.

And 'RUN_TYPE' is hybrid. 'RUN_REFCASE' is 'b.e20.B1850.f19_g17.release_cesm2_1_0.020'. These settings are defaulting. I don't know whether these can lead to errors.


1. The following is every step I took:

./create_newcase --case $CASEROOT --compset B1850 --res f19_g17 --mach NJU

cd $CASEROOT

./xmlchange --file env_run.xml --id DIN_LOC_ROOT --val $INPUTDIR
./xmlchange --file env_run.xml --id RUNDIR --val $RUNDIR

./xmlchange NTASKS_ATM=168,NTHRDS_ATM=1,ROOTPE_ATM=0
./xmlchange NTASKS_ICE=168,NTHRDS_ICE=1,ROOTPE_ICE=0
./xmlchange NTASKS_LND=168,NTHRDS_LND=1,ROOTPE_LND=0
./xmlchange NTASKS_CPL=168,NTHRDS_CPL=1,ROOTPE_CPL=0
./xmlchange NTASKS_ROF=168,NTHRDS_ROF=1,ROOTPE_ROF=0
./xmlchange NTASKS_OCN=168,NTHRDS_OCN=1,ROOTPE_OCN=0
./xmlchange NTASKS_GLC=168,NTHRDS_GLC=1,ROOTPE_GLC=0
./xmlchange NTASKS_WAV=168,NTHRDS_WAV=1,ROOTPE_WAV=0
./xmlchange NTASKS_ESP=168,NTHRDS_ESP=1,ROOTPE_ESP=0

./case.setup

./case.build

./xmlchange --file env_run.xml --id RESUBMIT --val '0'
./xmlchange --file env_run.xml --id CONTINUE_RUN --val 'FALSE'
./xmlchange --file env_run.xml --id STOP_N --val '10'
./xmlchange --file env_run.xml --id STOP_OPTION --val 'nyears'
./xmlchange --file env_run.xml --id REST_N --val '6'
./xmlchange --file env_run.xml --id REST_OPTION --val 'nmonth'
./xmlchange --file env_run.xml --id DOUT_S --val 'FALSE'

./case.submit


2. The cesm.log file is too large to upload, so I just put some certain error information:

xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4619
ERROR:
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 4333
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4763
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4331
index: 4188
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4621
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4189
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4907
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4765
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4477
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4909
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
xm_wpxp band solver: singular matrix
wp2_wp3 band solver: singular matrix
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Sa_z 1d global
index: 4473
Image PC Routine Line Source
cesm.exe 0000000002F89744 Unknown Unknown Unknown
cesm.exe 0000000002C1655E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 0000000000435EB0 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000431C87 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041885D cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000431557 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000414BDE Unknown Unknown Unknown
libc-2.17.so 00002B2557148B35 __libc_start_main Unknown Unknown
cesm.exe 0000000000414AE9 Unknown Unknown Unknown
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 129

I am looking forward to your replies! Thanks.

Gaya
 

Liu W

liuwei
Member
I also get the error that check_fields found the NAN in WAV instance with cesm2.1.3, and the error can be reproduced with startup in random model time. Could you run the B1850 successfully with startup?
 

ganbaranaito

takufuu
Member
I also get the error that check_fields found the NAN in WAV instance with cesm2.1.3, and the error can be reproduced with startup in random model time. Could you run the B1850 successfully with startup?
Yes, I can run the B1850 with startup with no error, and it has run for 30 more years. BTW, every step I took in ‘startup’ run is same with 'hybrid' run except for 'RUNTYPE'.
 

ganbaranaito

takufuu
Member
I also get the error that check_fields found the NAN in WAV instance with cesm2.1.3, and the error can be reproduced with startup in random model time. Could you run the B1850 successfully with startup?
Hello, have you solved this problem? I am still disturbed by it. Any advice? Thank you in advance.
 

katec

CSEG and Liaisons
Staff member
Hi there, I have seen this error before and It was caused by a mis-match in the startup land files with the version of the land model being used. In your case, I suspect that your ref case, which was made with CESM 2.0, may not work very well with CESM 2.1. On Cheyenne, we have this ref case for B1850:
b.e21.B1850.f19_g17.CMIP6-piControl-2deg.001
Can you give this one a try?
 

katec

CSEG and Liaisons
Staff member
I also get the error that check_fields found the NAN in WAV instance with cesm2.1.3, and the error can be reproduced with startup in random model time. Could you run the B1850 successfully with startup?
This sounds like a bit of an instability in the WAV model. Consider using the CESM2.1 ref case above and if that doesn't help, you may need to increase the timestep for the WAV model to insure stability.
 

ganbaranaito

takufuu
Member
Hi there, I have seen this error before and It was caused by a mis-match in the startup land files with the version of the land model being used. In your case, I suspect that your ref case, which was made with CESM 2.0, may not work very well with CESM 2.1. On Cheyenne, we have this ref case for B1850:
b.e21.B1850.f19_g17.CMIP6-piControl-2deg.001
Can you give this one a try?
Thank you for your reply!
I don't have an account on Cheyenne. Can you share me these restart files?
Thank you in advance!
 

Liu W

liuwei
Member
This sounds like a bit of an instability in the WAV model. Consider using the CESM2.1 ref case above and if that doesn't help, you may need to increase the timestep for the WAV model to insure stability.
Thanks for your reply! I used the CESM2-CMIP6 PI control case (b.e21.B1850.f09_g17.CMIP6-piControl.001) but met the same error. I also try increasing the timestep for WAV model (from 1800s to 3600s/7200s), and it doesn't work. In addition, each time I changed the time step, the mode would terminate at a different time. Should I continually increase the time step for the WAV model? By the way, will the increase in the time step for WAV model significantly change the simulation results? I noticed that the coupling frequency of WAV model in CESM1 is daily.
 

dbailey

CSEG and Liaisons
Staff member
I am moving this to the CAM forum and hopefully someone can answer this there. This sounds like an intermittent issue in the CAM radiation code. Sometimes changing pertlim in user_nl_cam will fix this.
 

ganbaranaito

takufuu
Member
I am moving this to the CAM forum and hopefully someone can answer this there. This sounds like an intermittent issue in the CAM radiation code. Sometimes changing pertlim in user_nl_cam will fix this.
Yes, it is an intermittent issue actually. For example, one run crashed at model year 23. If I restart this run from restart file 23-01-01, it can run pass the crashed point. Then, it may crash at a different point (eg. at model year 28 or longer) or crash due to Seg fault... I also tried to change pertlim, but this method doesn't fix this issue. Anyway, thank you for replying!
 

katec

CSEG and Liaisons
Staff member
Hi, it seems like there are a few different problems floating around in this thread. I'm going to focus on @ganbaranaito 's problem because he started the thread. The main difference between a start-up and hybrid model run are the initial files that are read in to initialize the model. I still think you should try a different, more applicable reference case. Do you have a Globus end point or FTP server that I could send these to for you? Another thing that I see here is that you are only running with 168 tasks on all nodes for your run. On Cheyenne, we offset some components and give them a bit more resources for a 2 degree run. The default layout on Cheyenne looks like:
./xmlchange NTASKS_ATM=288,NTHRDS_ATM=1,ROOTPE_ATM=0
./xmlchange NTASKS_ICE=108,NTHRDS_ICE=1,ROOTPE_ICE=144
./xmlchange NTASKS_LND=144,NTHRDS_LND=1,ROOTPE_LND=0
./xmlchange NTASKS_CPL=288,NTHRDS_CPL=1,ROOTPE_CPL=0
./xmlchange NTASKS_ROF=40,NTHRDS_ROF=1,ROOTPE_ROF=0
./xmlchange NTASKS_OCN=288,NTHRDS_OCN=1,ROOTPE_OCN=288
./xmlchange NTASKS_GLC=36,NTHRDS_GLC=1,ROOTPE_GLC=0
./xmlchange NTASKS_WAV=36,NTHRDS_WAV=1,ROOTPE_WAV=252
./xmlchange NTASKS_ESP=1,NTHRDS_ESP=1,ROOTPE_ESP=0
 

ganbaranaito

takufuu
Member
Hi, it seems like there are a few different problems floating around in this thread. I'm going to focus on @ganbaranaito 's problem because he started the thread. The main difference between a start-up and hybrid model run are the initial files that are read in to initialize the model. I still think you should try a different, more applicable reference case. Do you have a Globus end point or FTP server that I could send these to for you? Another thing that I see here is that you are only running with 168 tasks on all nodes for your run. On Cheyenne, we offset some components and give them a bit more resources for a 2 degree run. The default layout on Cheyenne looks like:
./xmlchange NTASKS_ATM=288,NTHRDS_ATM=1,ROOTPE_ATM=0
./xmlchange NTASKS_ICE=108,NTHRDS_ICE=1,ROOTPE_ICE=144
./xmlchange NTASKS_LND=144,NTHRDS_LND=1,ROOTPE_LND=0
./xmlchange NTASKS_CPL=288,NTHRDS_CPL=1,ROOTPE_CPL=0
./xmlchange NTASKS_ROF=40,NTHRDS_ROF=1,ROOTPE_ROF=0
./xmlchange NTASKS_OCN=288,NTHRDS_OCN=1,ROOTPE_OCN=288
./xmlchange NTASKS_GLC=36,NTHRDS_GLC=1,ROOTPE_GLC=0
./xmlchange NTASKS_WAV=36,NTHRDS_WAV=1,ROOTPE_WAV=252
./xmlchange NTASKS_ESP=1,NTHRDS_ESP=1,ROOTPE_ESP=0
Thank you for very detailed suggestions. About b21 restart files, I saw these files have been updated in the SVN server. I will download them and give it a try.
 

Xun Li

Lix
New Member
Thanks for your reply! I used the CESM2-CMIP6 PI control case (b.e21.B1850.f09_g17.CMIP6-piControl.001) but met the same error. I also try increasing the timestep for WAV model (from 1800s to 3600s/7200s), and it doesn't work. In addition, each time I changed the time step, the mode would terminate at a different time. Should I continually increase the time step for the WAV model? By the way, will the increase in the time step for WAV model significantly change the simulation results? I noticed that the coupling frequency of WAV model in CESM1 is daily.
Hello Liu,I also have the same problem, please do you solve this problem?
 
Top