Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

computation speed from CICE4 to CICE6

After I upgrade the sea ice component of CESM1.2.1 from CICE4 to CICE6 (and keep other components untouched), the computation became much slower (20 model years per wall time day vs. 35 model years before the upgrade). According to the timming information, the sea ice component is the bottle neck. I wonder if this is a normal situation for CICE6? I did not find online info about computation speed comparision among CICE versions.

I am using the f19_g16 grid on my own server.

Thanks in advance.
 

dbailey

CSEG and Liaisons
Staff member
I don't believe we have done a comprehensive timing of CICE6 versus CICE4. Part of this is the default physics options may have changed. Also, CICE6 is dynamically allocated where CICE4 was statically allocated. Did you try different computational decompositions and pe layouts?
 

dupontf

Frédéric Dupont
New Member
jumping into this discussion, I have tested CICE/main (6.4.0) standalone today with different MPI partitions for a large domain 4320x3604 (ORCA12) on a icelake cluster, compiled with inteloneapi-2022.1.2 and associated MPI library and obtained this plot which shows the speedup relative to 2160 computational cores. The performance seems to stall after 3500 cores, that is between 2840 and 3996 cores or between a local block size of 108x51 and 59x67, although 59x67 does not appear to be that small to me. I suspect that global MPI operations may be the issue (given the large number of total cores) although I am running with halos on and add_mpi_barriers = .true. Has this been documented for massively parallel architectures?
1661966896283.png
 

dupontf

Frédéric Dupont
New Member
I realized that the diagnostics were "on" at each timestep. Removing them does help a bit for the highest cpu test, but does not change the fact that that test suffers from MPI communications overhead.
 

dupontf

Frédéric Dupont
New Member
on a better note, If I use a distribution of 80x54=4320, I am back in business (i.e., closer to the diagonal curve). it sounds like the 74x54=3996 did cause something strange (one node was not fully used but not sure why it could be a limitation)
 
Top