jhe6@ncsu_edu
New Member
Hi, I'm working on the modification of aerosol part in CESM/CAM5 (CESM1.0.5) on yellowstone. My model can run for several months, however, it then crashed due to segmentation fault. Below is the error message in ccsm.log fileERROR: 0031-250 task 210: Segmentation fault 214:forrtl: error (78): process killed (SIGTERM) 214:Image PC Routine Line Source 214:libpthread.so.0 00002B639E3B0245 Unknown Unknown Unknown 214:libpoe.so 00002B63A571EBA3 Unknown Unknown Unknown 214:INFO: 0031-306 pm_atexit: pm_exit_value is 1. 212:forrtl: error (78): process killed (SIGTERM) 212:Image PC Routine Line Source 212:libpthread.so.0 00002B312C33E245 Unknown Unknown Unknown 212:libpoe.so 00002B31336ACBA3 Unknown Unknown Unknown 212:INFO: 0031-306 pm_atexit: pm_exit_value is 1. 194:forrtl: error (78): process killed (SIGTERM) 194:Image PC Routine Line Source 194:libpthread.so.0 00002B0F9456E245 Unknown Unknown Unknown 194:libpoe.so 00002B0F9B8DCBA3 Unknown Unknown Unknown 194:INFO: 0031-306 pm_atexit: pm_exit_value is 1. Besides, there is also a core file (core_lite) generated. Thread 12 (Thread 0x2aad727bb700 (LWP 9096)):#0 0x00002aad656f0d03 in epoll_wait () from /lib64/libc.so.6#1 0x00002aad6c2dd9b5 in poe_exiting_thread () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpoe.so#2 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#3 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 11 (Thread 0x2aad729bc700 (LWP 9097)):#0 0x00002aad656f0d03 in epoll_wait () from /lib64/libc.so.6#1 0x00002aad6c2e20c8 in pm_child_sig_thread () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpoe.so#2 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#3 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 10 (Thread 0x2aad72bbd700 (LWP 9098)):#0 0x00002aad64f71245 in sigwait () from /lib64/libpthread.so.0#1 0x00002aad6c2dfba3 in pm_async_thread () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpoe.so#2 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#3 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 9 (Thread 0x2aad766f6700 (LWP 9175)):#0 0x00002aad64f6d3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1 0x00002aad731cc2cb in hal_ibl_user_intr_hndlr () from /opt/ibmhpc/pe1209/base/gnu/lib64/libhal64_ibm.so#2 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#3 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 8 (Thread 0x2aad768f7700 (LWP 9176)):#0 0x00002aad656f0d03 in epoll_wait () from /lib64/libc.so.6#1 0x00002aad731cbf34 in hal_ibl_async_intr_hndlr () from /opt/ibmhpc/pe1209/base/gnu/lib64/libhal64_ibm.so#2 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#3 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 7 (Thread 0x2aad7ed47700 (LWP 9216)):#0 0x00002aad64f6d3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1 0x00002aad67a3b53a in shm_dispatcher_thread(void*) () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpami.so#2 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#3 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 6 (Thread 0x2aad7f149700 (LWP 9227)):#0 0x00002aad64f6d3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1 0x00002aad67a2c793 in _compl_hndlr_thr(void*) () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpami.so#2 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#3 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 5 (Thread 0x2aad7f5ef700 (LWP 9238)):#0 0x00002aad64f6d3dc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1 0x00002aad67a4e3a8 in rc_ibl_intr_hndlr(void*) () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpami.so#2 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#3 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 4 (Thread 0x2aad7f7f0700 (LWP 9239)):#0 0x00002aad656f0d03 in epoll_wait () from /lib64/libc.so.6#1 0x00002aad67a4e0de in rc_ibl_async_intr_hndlr(void*) () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpami.so#2 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#3 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 3 (Thread 0x2aad7f9f1700 (LWP 9240)):#0 0x00002aad64f6d75b in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0#1 0x00002aad67a14a58 in _timer_arm(timer_service_t*) () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpami.so#2 0x00002aad67a15048 in _lapi_tmr_thrd(void*) () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpami.so#3 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#4 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 2 (Thread 0x2aad7fbf2700 (LWP 9259)):#0 0x00002aad656e92f3 in select () from /lib64/libc.so.6#1 0x00002aad6c0b0175 in Connection::Wait() () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpnsd.so#2 0x00002aad6c09febd in internal_pnsd_api_wait_for_updates(int, unsigned int*, char*, nrt_adapter_t*, unsigned short*, char**, int*, char**) () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpnsd.so#3 0x00002aad6c0a019c in pnsd_api_wait_for_updates () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpnsd.so#4 0x00002aad67a40739 in preempt_monitor_thread(void*) () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpami.so#5 0x00002aad64f697f1 in start_thread () from /lib64/libpthread.so.0#6 0x00002aad656f070d in clone () from /lib64/libc.so.6Thread 1 (Thread 0x2aad6c08fc60 (LWP 9022)):#0 0x00002aad656b57bd in waitpid () from /lib64/libc.so.6#1 0x00002aad65649329 in do_system () from /lib64/libc.so.6#2 0x00002aad65649660 in system () from /lib64/libc.so.6#3 0x00002aad6c2dd6e8 in pm_linux_print_coredump () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpoe.so#4 0x00002aad6c2dd7ee in pm_lwcf_signal_handler () from /opt/ibmhpc/pe1209/base/gnu/lib64/libpoe.so#5 #6 0x0000000000a0d089 in rrtmg_lw_rtrnmc_mp_rtrnmc_ ()#7 0x0000000000a05c8d in rrtmg_lw_rad_mp_rrtmg_lw_ ()#8 0x00000000005bb151 in radlw_mp_rad_rrtmg_lw_ ()#9 0x00000000005a7efe in radiation_mp_radiation_tend_ ()#10 0x0000000000e8a832 in tphysbc_ ()#11 0x0000000000560f55 in physpkg_mp_phys_run1_ ()#12 0x000000000048ff53 in cam_comp_mp_cam_run1_ ()#13 0x000000000047e99a in atm_comp_mct_mp_atm_run_mct_ ()#14 0x000000000041085f in ccsm_comp_mod_mp_ccsm_run_ ()#15 0x0000000000422db1 in MAIN__ ()#16 0x000000000040ec2c in main () It seems that there is something to do with rrtmg_lw_rtrnmc_mp_rtrnmc_(), which I have never modified. Since I have modified many modules and there is no problem about compile and running the model for several months, it is difficult for me to locate where is exactly the bug in the code which then caused the segmentation fault and then crashed the model. Because the ccsm.log file didn't give much useful information about the reason the model crashed, is there any way to do something to the ccsm.log file so that it can point out which module or subroutine caused the model crash? Thank you very much!