Error running multi-instance single-point case

geoxiyang@gmail_com · Feb 23, 2015

Hi, I was trying to run a multi-instance case for a single point on yellowstone. I created the case using PTCLM (so it creates surface data for the point). I added modified source mods to the case sourcemods folder, and changed user_nl_clm to include a few variables in the history tape. And the run went successfully.Then I was trying to change env_mach_pes.xml:# MAX_TASKS_PER_NODE comes from $case/Tools/mkbatch.$machine@ ptile = $MAX_TASKS_PER_NODE / 2 @ nthreads = 1 @ atm_tasks = $ptile * $num_instances * 2 @ lnd_tasks = $ptile * $num_instances * 2 @ ice_tasks = $ptile * $num_instances @ ocn_tasks = $ptile * $num_instances @ cpl_tasks = $ptile * $num_instances @ glc_tasks = $ptile * $num_instances @ rof_tasks = $ptile * $num_instances * 2 @ wav_tasks = $ptile * $num_instances
./xmlchange NTHRDS_ATM=$nthreads,NTASKS_ATM=$atm_tasks,NINST_ATM=$num_instances./xmlchange NTHRDS_LND=$nthreads,NTASKS_LND=$lnd_tasks,NINST_LND=$num_instances./xmlchange NTHRDS_ICE=$nthreads,NTASKS_ICE=$ice_tasks,NINST_ICE=1./xmlchange NTHRDS_OCN=$nthreads,NTASKS_OCN=$ocn_tasks,NINST_OCN=1./xmlchange NTHRDS_CPL=$nthreads,NTASKS_CPL=$cpl_tasks./xmlchange NTHRDS_GLC=$nthreads,NTASKS_GLC=$glc_tasks,NINST_GLC=1./xmlchange NTHRDS_ROF=$nthreads,NTASKS_ROF=$rof_tasks,NINST_ROF=$num_instances./xmlchange NTHRDS_WAV=$nthreads,NTASKS_WAV=$wav_tasks,NINST_WAV=1./xmlchange ROOTPE_ATM=0./xmlchange ROOTPE_LND=0./xmlchange ROOTPE_ICE=0./xmlchange ROOTPE_OCN=0./xmlchange ROOTPE_CPL=0./xmlchange ROOTPE_GLC=0./xmlchange ROOTPE_ROF=0 ./xmlchange ROOTPE_WAV=0
And I also made user_nl_clm, user_nl_datm for each case.Then I got these errror in cesm.log: 14: NetCDF: Invalid dimension ID or name 13: NetCDF: Variable not found 13: NetCDF: Variable not found 13: NetCDF: Invalid dimension ID or name 13: NetCDF: Invalid dimension ID or name 13: NetCDF: Invalid dimension ID or name 13: NetCDF: Invalid dimension ID or name 13: NetCDF: Invalid dimension ID or name(There are A LOT of these warnings)..... 17:(seq_domain_areafactinit) : min/max mdl2drv 1.000000000000000 1.000000000000000 areafact_l_LND0018 17:(seq_domain_areafactinit) : min/max drv2mdl 1.000000000000000 1.000000000000000 areafact_l_LND0018 18:(seq_domain_areafactinit) : min/max mdl2drv 1.000000000000000 1.000000000000000 areafact_l_LND0019 18:(seq_domain_areafactinit) : min/max drv2mdl 1.000000000000000 1.000000000000000 areafact_l_LND0019 19:(seq_domain_areafactinit) : min/max mdl2drv 1.000000000000000 1.000000000000000 areafact_l_LND0020 19:(seq_domain_areafactinit) : min/max drv2mdl 1.000000000000000 1.000000000000000 areafact_l_LND0020 18:(seq_mct_drv) : Initialize atm component phase 2 ATM0019 16:(seq_mct_drv) : Initialize atm component phase 2 ATM0017 15:(seq_mct_drv) : Initialize atm component phase 2 ATM0016 17:(seq_mct_drv) : Initialize atm component phase 2 ATM0018 19:(seq_mct_drv) : Initialize atm component phase 2 ATM0020.... 18:OMP: Warning #123: Ignoring invalid OS proc ID 3. 18:OMP: Warning #124: No valid OS proc IDs specified - not using affinity. 19:OMP: Warning #123: Ignoring invalid OS proc ID 4. 19:OMP: Warning #124: No valid OS proc IDs specified - not using affinity. 16:OMP: Warning #123: Ignoring invalid OS proc ID 1. 16:OMP: Warning #124: No valid OS proc IDs specified - not using affinity. 17:OMP: Warning #123: Ignoring invalid OS proc ID 2. 17:OMP: Warning #124: No valid OS proc IDs specified - not using affinity.INFO: 0031-251 task 15 exited: rc=-8INFO: 0031-251 task 16 exited: rc=-8INFO: 0031-251 task 17 exited: rc=-8INFO: 0031-251 task 18 exited: rc=-8INFO: 0031-251 task 19 exited: rc=-8 5:forrtl: error (78): process killed (SIGTERM) 5:Image PC Routine Line Source 5:libpthread.so.0 00002B9B0225D2A5 Unknown Unknown Unknown 5:libpoe.so 00002B9B06952AE2 Unknown Unknown Unknown 5:libpthread.so.0 00002B9B02255851 Unknown Unknown Unknown 5:libc.so.6 00002B9B0345C90D Unknown Unknown Unknown I am wondering 1) what is the meaning of this rc=-8? 2) what are the NetCDF errors? Thanks,-Xi

santos · Feb 24, 2015

I don't know much about PTCLM, but I can answer your two questions:1) rc=-8 is most likely a floating point or other arithmetic exception. SIGFPE happens to be signal 8 on most Linux systems; I'm not sure why, but on yellowstone you tend to get a negative version of the usual error codes. You might have gotten core_lite files from this error in your run directory, but there's a known problem on Yellowstone where sometimes no files are produced, or they are empty, due to a race condition.2) The netCDF "errors" are produced when the model is checking for a variable that's not on a given file. These warnings always appear, even in runs that are working fine, because some files may contain optional fields that the model can use, but are not required. (We need to figure out how to shut the messages off, since nothing is wrong in most cases when this is printed.)

geoxiyang@gmail_com · Feb 25, 2015

Hi Sean, Thank you for the detailed answers. These are very helpful! Best,-Xi

Error running multi-instance single-point case

geoxiyang@gmail_com

New Member

santos

Member

geoxiyang@gmail_com

New Member