Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

NaN in Faxa_bcphiwet Faxa_bcphidry

YiyangSun

New Member
Hello everyone.
I am using the cesm2.1.3 version with B2000 compset, i create case use`./create_newcase --case casename --res f19_g17 --compset 2000_CAM60_CLM50%BGC-CROP_CICE_POP2%ECO%ABIO-DIC_MOSART_CISM2%NOEVOLVE_WW3_BGC%BDRD --run-unsupported`。
and i am using startup to run the case,but i encounter the error with
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13090
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13238
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13091
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13375
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13377
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13234
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13383
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13233
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13087
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13381
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
1d global index: 13379
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphidry
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13788
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13665
ERROR:
component_mod:check_fields NaN found in ATM instance: 1 field Faxa_bcphiwet
1d global index: 13802
Image PC Routine Line Source
libpnetcdf.so.4.0 00002B1A4B6AF82B tracebackqq_ Unknown Unknown
cesm.exe 0000000002E0AE1E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 000000000043A510 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000436200 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041BBDE cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000435A37 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000417F62 Unknown Unknown Unknown
libc-2.31.so 00002B1A4DBB0DCD __libc_start_main Unknown Unknown
cesm.exe 0000000000417E6A Unknown Unknown Unknown
Image PC Routine Line Source
libpnetcdf.so.4.0 00002B01A4F8182B tracebackqq_ Unknown Unknown
cesm.exe 0000000002E0AE1E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 000000000043A510 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000436200 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041BBDE cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000435A37 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000417F62 Unknown Unknown Unknown
libc-2.31.so 00002B01A7482DCD __libc_start_main Unknown Unknown
cesm.exe 0000000000417E6A Unknown Unknown Unknown
Image PC Routine Line Source
libpnetcdf.so.4.0 00002B938EF0982B tracebackqq_ Unknown Unknown
cesm.exe 0000000002E0AE1E shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 000000000043A510 component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000436200 component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041BBDE cime_comp_mod_mp_ 3465 cime_comp_mod.F90
cesm.exe 0000000000435A37 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000417F62 Unknown Unknown Unknown
libc-2.31.so 00002B939140ADCD __libc_start_main Unknown Unknown
cesm.exe 0000000000417E6A Unknown Unknown Unknown
Abort(1001) on node 326 (rank 326 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 326
Abort(1001) on node 250 (rank 250 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 250
The weird thing is, i can successfully run a 5 day test, but when i try to continue_run, the case will ERROR. And i change the continue_run to False, for 6 month test, it still ERROR. I ./check_inputdata --chksum, nothing Wrong. Is there any solution for me to do, the inputdata was download by svn through - Revision 70792: /trunk/inputdata.
 

Attachments

  • run.zip
    316 KB · Views: 0

hplin

Haipeng Lin
Moderator
Staff member
Hello Yiyang,

Thanks for writing. When you're running a 6 month simulation, at which point does the run crash? If you set ./xmlchange DEBUG=true and rebuild the model and run, does it give you a different error?
 
Vote Upvote 0 Downvote

YiyangSun

New Member
Hello Yiyang,

Thanks for writing. When you're running a 6 month simulation, at which point does the run crash? If you set ./xmlchange DEBUG=true and rebuild the model and run, does it give you a different error?
Hello Mr. Lin

Thanks for your reply! I have list my atm.log at run(1).zip. According to my check of atm.log, the run crash are at the 2th step when i set 6 month and startup run. And if i set CONTINUE_RUN to TRUE from 01-06, it crashed at 0106-00:30 and 0106-01:00. I have to say that this successfully 5 day test is rare. At most time it could run successfully. For example,
If I create a new case and set it to run a 6-month startup run from the beginning, it will fail immediately—usually after about 2–3 timesteps. The error could be a segmentation fault (e.g., “Segmentation fault: address not mapped to object at address …” or “forrtl: severe (174): SIGSEGV, segmentation fault occurred”), and the file mentioned in the backtrace might be tp_core.F90. It could also be “forrtl: severe (174): SIGSEGV, SIGSEGV occurred” with the file rrtmg_lw_rtrnmc.f90. Or it could be the NaN-related issue I mentioned earlier.


I’m not sure what the next step for debugging should be. What surprises and confuses me is that when I turn on DEBUG, the model runs normally and produces output without errors. However, with DEBUG enabled the runtime becomes very long, so I can’t keep running with DEBUG all the time.
 

Attachments

  • run (1).zip
    196.7 KB · Views: 1
Vote Upvote 0 Downvote

hplin

Haipeng Lin
Moderator
Staff member
Thanks Yiyang. A few other things that you could try is to use the latest production release (2.1.5) GitHub - ESCOMP/CESM at release-cesm2.1.5

Are you able to get CONTINUE_RUN functionality with DEBUG=true?

The segmentation fault issues are also interesting - have you tried increasing the amount of available memory / number of cores for running the model?
 
Vote Upvote 0 Downvote

YiyangSun

New Member
Thanks Yiyang. A few other things that you could try is to use the latest production release (2.1.5) GitHub - ESCOMP/CESM at release-cesm2.1.5

Are you able to get CONTINUE_RUN functionality with DEBUG=true?

The segmentation fault issues are also interesting - have you tried increasing the amount of available memory / number of cores for running the model?
Yes sir, CONTINUE_RUN could work with DEBUG=true. For core, I have set core to 768, per Node is 64.
<values>
<value compclass="ATM">-6</value>
<value compclass="CPL">-6</value>
<value compclass="OCN">-6</value>
<value compclass="WAV">-1</value>
<value compclass="GLC">-6</value>
<value compclass="ICE">-6</value>
<value compclass="ROF">-6</value>
<value compclass="LND">-6</value>
<value compclass="ESP">1</value>
</values>
<desc>number of tasks for each component</desc>
</entry>
<entry id="NTASKS_PER_INST">
<type>integer</type>
<values>
<value compclass="ATM">384</value>
<value compclass="OCN">384</value>
<value compclass="WAV">64</value>
<value compclass="GLC">384</value>
<value compclass="ICE">384</value>
<value compclass="ROF">384</value>
<value compclass="LND">384</value>
<value compclass="ESP">1</value>
<entry id="NTHRDS">
<type>integer</type>
<values>
<value compclass="ATM">1</value>
<value compclass="CPL">1</value>
<value compclass="OCN">1</value>
<value compclass="WAV">1</value>
<value compclass="GLC">1</value>
<value compclass="ICE">1</value>
<value compclass="ROF">1</value>
<value compclass="LND">1</value>
<value compclass="ESP">1</value>
</values>
<desc>number of threads for each task in each component</desc>
</entry>
<entry id="ROOTPE">
<type>integer</type>
<values>
<value compclass="ATM">0</value>
<value compclass="CPL">0</value>
<value compclass="OCN">-6</value>
<value compclass="WAV">0</value>
<value compclass="GLC">0</value>
<value compclass="ICE">-6</value>
<value compclass="ROF">0</value>
<value compclass="LND">0</value>
<value compclass="ESP">0</value>
</values>
 
Vote Upvote 0 Downvote

hplin

Haipeng Lin
Moderator
Staff member
Thanks Yiyang, I would try to increase the cores or available memory to see if the run improves further. I am not sure why the NaN issues arise, but it may help with seemingly "random" segfault locations.
 
Vote Upvote 0 Downvote
Top