Main menu

Navigation

Fast on Single Node, Huge Slowdown on Multiple Nodes: Knight's Landing

2 posts / 0 new
Last post
heavens
Fast on Single Node, Huge Slowdown on Multiple Nodes: Knight's Landing

I am running CESM 1.2.2 in the commissioning phase of a small Linux cluster with Knight's Landing (Intel Xeon Phi 7210) processors. 

On a single node, I have -user_compset 1850_CAM5_CLM45_CICE_POP2_RTM_SGLC_SWAV -res 1.9x2.5_gx1v6 

running quite well. dt in CPL is ~144 s, so ~20 timesteps per minute.

Layout is below:

./xmlchange -file env_mach_pes.xml -id ROOTPE_GLC -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_GLC -val 256

./xmlchange -file env_mach_pes.xml -id NTHRDS_GLC -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_WAV -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_WAV -val 256

./xmlchange -file env_mach_pes.xml -id NTHRDS_WAV -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_LND -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_LND -val 256

./xmlchange -file env_mach_pes.xml -id NTHRDS_LND -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_ROF -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_ROF -val 256

./xmlchange -file env_mach_pes.xml -id NTHRDS_ROF -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_ICE -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_ICE -val 256

./xmlchange -file env_mach_pes.xml -id NTHRDS_ICE -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_ATM -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_ATM -val 256

./xmlchange -file env_mach_pes.xml -id NTHRDS_ATM -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_OCN -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_OCN -val 256

./xmlchange -file env_mach_pes.xml -id NTHRDS_OCN -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_CPL -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_CPL -val 256

./xmlchange -file env_mach_pes.xml -id NTHRDS_CPL -val 1

./xmlchange -file env_mach_pes.xml -id NTASKS_CPL -val 256

./xmlchange -file env_mach_pes.xml -id NTHRDS_CPL -val 1

./xmlchange -file env_mach_pes.xml -id MAX_TASKS_PER_NODE -val 256

./xmlchange -file env_mach_pes.xml -id PES_PER_NODE -val 64\

 

Then I tried 4 nodes with:

 

./xmlchange -file env_mach_pes.xml -id ROOTPE_GLC -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_GLC -val 960

./xmlchange -file env_mach_pes.xml -id NTHRDS_GLC -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_WAV -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_WAV -val 960

./xmlchange -file env_mach_pes.xml -id NTHRDS_WAV -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_LND -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_LND -val 960

./xmlchange -file env_mach_pes.xml -id NTHRDS_LND -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_ROF -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_ROF -val 960

./xmlchange -file env_mach_pes.xml -id NTHRDS_ROF -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_ICE -val 480

./xmlchange -file env_mach_pes.xml -id NTASKS_ICE -val 480

./xmlchange -file env_mach_pes.xml -id NTHRDS_ICE -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_ATM -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_ATM -val 960

./xmlchange -file env_mach_pes.xml -id NTHRDS_ATM -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_OCN -val 960

./xmlchange -file env_mach_pes.xml -id NTASKS_OCN -val 64

./xmlchange -file env_mach_pes.xml -id NTHRDS_OCN -val 1

./xmlchange -file env_mach_pes.xml -id ROOTPE_CPL -val 0

./xmlchange -file env_mach_pes.xml -id NTASKS_CPL -val 960

./xmlchange -file env_mach_pes.xml -id NTHRDS_CPL -val 1

 

This gives me about 1 timestep per minute and requires 41 minutes to start up, as opposed to 5 minutes on a single node.

I'm really not used to CESM being so slow to load on multiple nodes. Should I be worried about cluster interconnect? Or is there some key setting to make sure jobs on multiple nodes go faster?

 

Nicholas Heavens

Research Assistant Professor of Planetary Science

Hampton University

 

 

 

 

heavens

I've mostly been able to figure out what is happening. If you stretch a model on multiple nodes, it slows down massively. I've managed improved performance over single node by placing all components but the ocean on one node and the ocean on the other. 

My deeper problem is why 512 Knight's Landing cores on 1.9x2.5 grid struggles to get 2 years per wall clock day, when 0.9x1.25 on 128 sandy bridge processors gets just short of 10.

Is CESM code just too badly parallelized to do anything with MIC architecture?

 

Nicholas Heavens

Research Assistant Professor of Planetary Science

Hampton University

 
Log in or register to post comments

Who's new

  • ccchang3@...
  • iccp.stein@...
  • hegreaves@...
  • sallyz
  • damian.insua@...