NaN in Faxa_bcphiwet Faxa_bcphidry

YiyangSun

New Member
Hello everyone.
I am using the cesm2.1.3 version with B2000 compset, i create case use`./create_newcase --case casename --res f19_g17 --compset 2000_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO%ABIO-DIC_MOSART_CISM2%NOEVOLVE_WW3_BGC%BDRD --run-unsupported`。
and i am using startup to run the case,but i encounter the error with
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13090
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13238
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13091
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13375
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13377
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13234
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13383
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13233
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13087
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13381
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13379
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13788
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13665
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13802
Image PC Routine Line Source
libpnetcdf.so.4.0 00002B1A4B6AF82B tracebackqq_ Unknown Unknown
cesm.exe 0000000002E0AE1E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 000000000043A510 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000436200 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041BBDE cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000435A37 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000417F62 Unknown Unknown Unknown
libc-2.31.so 00002B1A4DBB0DCD __libc_start_main Unknown Unknown
cesm.exe 0000000000417E6A Unknown Unknown Unknown
Image PC Routine Line Source
libpnetcdf.so.4.0 00002B01A4F8182B tracebackqq_ Unknown Unknown
cesm.exe 0000000002E0AE1E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 000000000043A510 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000436200 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041BBDE cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000435A37 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000417F62 Unknown Unknown Unknown
libc-2.31.so 00002B01A7482DCD __libc_start_main Unknown Unknown
cesm.exe 0000000000417E6A Unknown Unknown Unknown
Image PC Routine Line Source
libpnetcdf.so.4.0 00002B938EF0982B tracebackqq_ Unknown Unknown
cesm.exe 0000000002E0AE1E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 000000000043A510 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000436200 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041BBDE cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000435A37 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000417F62 Unknown Unknown Unknown
libc-2.31.so 00002B939140ADCD __libc_start_main Unknown Unknown
cesm.exe 0000000000417E6A Unknown Unknown Unknown
Abort(1001) on node 326 (rank 326 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 326
Abort(1001) on node 250 (rank 250 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 250
The weird thing is, i can successfully run a 5 day test, but when i try to continue_run, the case will ERROR. And i change the continue_run to False, for 6 month test, it still ERROR. I ./check_inputdata --chksum, nothing Wrong. Is there any solution for me to do, the inputdata was download by svn through - Revision 70792: /trunk/inputdata.
 

Attachments

hplin

Haipeng Lin
Moderator
Staff member
Hello Yiyang,

Thanks for writing. When you're running a 6 month simulation, at which point does the run crash? If you set ./xmlchange DEBUG=true and rebuild the model and run, does it give you a different error?
 
Vote Upvote 0 Downvote

YiyangSun

New Member
Hello Yiyang,

Thanks for writing. When you're running a 6 month simulation, at which point does the run crash? If you set ./xmlchange DEBUG=true and rebuild the model and run, does it give you a different error?
Hello Mr. Lin

Thanks for your reply! I have list my atm.log at run(1).zip. According to my check of atm.log, the run crash are at the 2th step when i set 6 month and startup run. And if i set CONTINUE_RUN to TRUE from 01-06, it crashed at 0106-00:30 and 0106-01:00. I have to say that this successfully 5 day test is rare. At most time it could run successfully. For example,
If I create a new case and set it to run a 6-month startup run from the beginning, it will fail immediately—usually after about 2–3 timesteps. The error could be a segmentation fault (e.g., “Segmentation fault: address not mapped to object at address …” or “forrtl: severe (174): SIGSEGV, segmentation fault occurred”), and the file mentioned in the backtrace might be tp_core.F90. It could also be “forrtl: severe (174): SIGSEGV, SIGSEGV occurred” with the file rrtmg_lw_rtrnmc.f90. Or it could be the NaN-related issue I mentioned earlier.


I’m not sure what the next step for debugging should be. What surprises and confuses me is that when I turn on DEBUG, the model runs normally and produces output without errors. However, with DEBUG enabled the runtime becomes very long, so I can’t keep running with DEBUG all the time.
 

Attachments

Vote Upvote 0 Downvote

hplin

Haipeng Lin
Moderator
Staff member
Thanks Yiyang. A few other things that you could try is to use the latest production release (2.1.5) GitHub - ESCOMP/CESM at release-cesm2.1.5

Are you able to get CONTINUE_RUN functionality with DEBUG=true?

The segmentation fault issues are also interesting - have you tried increasing the amount of available memory / number of cores for running the model?
 
Vote Upvote 0 Downvote

YiyangSun

New Member
Thanks Yiyang. A few other things that you could try is to use the latest production release (2.1.5) GitHub - ESCOMP/CESM at release-cesm2.1.5

Are you able to get CONTINUE_RUN functionality with DEBUG=true?

The segmentation fault issues are also interesting - have you tried increasing the amount of available memory / number of cores for running the model?
Yes sir, CONTINUE_RUN could work with DEBUG=true. For core, I have set core to 768, per Node is 64.
<values>
<value compclass="ATM">-6</value>
<value compclass="CPL">-6</value>
<value compclass="OCN">-6</value>
<value compclass="WAV">-1</value>
<value compclass="GLC">-6</value>
<value compclass="ICE">-6</value>
<value compclass="ROF">-6</value>
<value compclass="LND">-6</value>
<value compclass="ESP">1</value>
</values>
<desc>number of tasks for each component</desc>
</entry>
<entry id="NTASKS_PER_INST">
<type>integer</type>
<values>
<value compclass="ATM">384</value>
<value compclass="OCN">384</value>
<value compclass="WAV">64</value>
<value compclass="GLC">384</value>
<value compclass="ICE">384</value>
<value compclass="ROF">384</value>
<value compclass="LND">384</value>
<value compclass="ESP">1</value>
<entry id="NTHRDS">
<type>integer</type>
<values>
<value compclass="ATM">1</value>
<value compclass="CPL">1</value>
<value compclass="OCN">1</value>
<value compclass="WAV">1</value>
<value compclass="GLC">1</value>
<value compclass="ICE">1</value>
<value compclass="ROF">1</value>
<value compclass="LND">1</value>
<value compclass="ESP">1</value>
</values>
<desc>number of threads for each task in each component</desc>
</entry>
<entry id="ROOTPE">
<type>integer</type>
<values>
<value compclass="ATM">0</value>
<value compclass="CPL">0</value>
<value compclass="OCN">-6</value>
<value compclass="WAV">0</value>
<value compclass="GLC">0</value>
<value compclass="ICE">-6</value>
<value compclass="ROF">0</value>
<value compclass="LND">0</value>
<value compclass="ESP">0</value>
</values>
 
Vote Upvote 0 Downvote

hplin

Haipeng Lin
Moderator
Staff member
Thanks Yiyang, I would try to increase the cores or available memory to see if the run improves further. I am not sure why the NaN issues arise, but it may help with seemingly "random" segfault locations.
 
Vote Upvote 0 Downvote
Back
Top