Fast on Single Node, Huge Slowdown on Multiple Nodes: Knight's Landing

heavens · Jan 14, 2019

I am running CESM 1.2.2 in the commissioning phase of a small Linux cluster with Knight's Landing (Intel Xeon Phi 7210) processors. On a single node, I have -user_compset 1850_CAM5_CLM45_CICE_POP2_RTM_SGLC_SWAV -res 1.9x2.5_gx1v6 running quite well. dt in CPL is ~144 s, so ~20 timesteps per minute.Layout is below:./xmlchange -file env_mach_pes.xml -id ROOTPE_GLC -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_GLC -val 256./xmlchange -file env_mach_pes.xml -id NTHRDS_GLC -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_WAV -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_WAV -val 256./xmlchange -file env_mach_pes.xml -id NTHRDS_WAV -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_LND -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_LND -val 256./xmlchange -file env_mach_pes.xml -id NTHRDS_LND -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_ROF -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_ROF -val 256./xmlchange -file env_mach_pes.xml -id NTHRDS_ROF -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_ICE -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_ICE -val 256./xmlchange -file env_mach_pes.xml -id NTHRDS_ICE -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_ATM -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_ATM -val 256./xmlchange -file env_mach_pes.xml -id NTHRDS_ATM -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_OCN -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_OCN -val 256./xmlchange -file env_mach_pes.xml -id NTHRDS_OCN -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_CPL -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_CPL -val 256./xmlchange -file env_mach_pes.xml -id NTHRDS_CPL -val 1./xmlchange -file env_mach_pes.xml -id NTASKS_CPL -val 256./xmlchange -file env_mach_pes.xml -id NTHRDS_CPL -val 1./xmlchange -file env_mach_pes.xml -id MAX_TASKS_PER_NODE -val 256./xmlchange -file env_mach_pes.xml -id PES_PER_NODE -val 64 Then I tried 4 nodes with: ./xmlchange -file env_mach_pes.xml -id ROOTPE_GLC -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_GLC -val 960./xmlchange -file env_mach_pes.xml -id NTHRDS_GLC -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_WAV -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_WAV -val 960./xmlchange -file env_mach_pes.xml -id NTHRDS_WAV -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_LND -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_LND -val 960./xmlchange -file env_mach_pes.xml -id NTHRDS_LND -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_ROF -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_ROF -val 960./xmlchange -file env_mach_pes.xml -id NTHRDS_ROF -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_ICE -val 480./xmlchange -file env_mach_pes.xml -id NTASKS_ICE -val 480./xmlchange -file env_mach_pes.xml -id NTHRDS_ICE -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_ATM -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_ATM -val 960./xmlchange -file env_mach_pes.xml -id NTHRDS_ATM -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_OCN -val 960./xmlchange -file env_mach_pes.xml -id NTASKS_OCN -val 64./xmlchange -file env_mach_pes.xml -id NTHRDS_OCN -val 1./xmlchange -file env_mach_pes.xml -id ROOTPE_CPL -val 0./xmlchange -file env_mach_pes.xml -id NTASKS_CPL -val 960./xmlchange -file env_mach_pes.xml -id NTHRDS_CPL -val 1 This gives me about 1 timestep per minute and requires 41 minutes to start up, as opposed to 5 minutes on a single node.I'm really not used to CESM being so slow to load on multiple nodes. Should I be worried about cluster interconnect? Or is there some key setting to make sure jobs on multiple nodes go faster? Nicholas HeavensResearch Assistant Professor of Planetary ScienceHampton University

heavens · Jan 17, 2019

I've mostly been able to figure out what is happening. If you stretch a model on multiple nodes, it slows down massively. I've managed improved performance over single node by placing all components but the ocean on one node and the ocean on the other. My deeper problem is why 512 Knight's Landing cores on 1.9x2.5 grid struggles to get 2 years per wall clock day, when 0.9x1.25 on 128 sandy bridge processors gets just short of 10.Is CESM code just too badly parallelized to do anything with MIC architecture? Nicholas HeavensResearch Assistant Professor of Planetary ScienceHampton University

jbuzan@purdue_edu · Apr 11, 2019

Hello Nicholas Heavens and CESM Staff, I am running into similar issues on Marconi-KNL system. If I try to up the number of NTASKS to >1024 the model gives me npr_yz errors.The simulation executes extremely slowly. The knights landing has 68 cores per node, but 272 theoretical cores per node. I had MPI errors when I run with 68 Tasks per node, but no errors when I execute with 64 Tasks/node. The Marconi staff believe it was a memory issue. Is it possible to use the theoretical cores to speed up the simulation?

Unfortunately, CESM122 executes extremely slowly on Marconi-KNL system, and I need to be able to scale between 2040 and 4080 cores for computational proposals. This is difficult because of the npr_yz errors. Any advise on this issue would be much appreciated! -Jonathan R. BuzanUniversity of Bern

Fast on Single Node, Huge Slowdown on Multiple Nodes: Knight's Landing

heavens

Member

heavens

Member

jbuzan@purdue_edu

Member