Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Substantial reduction of speed in multiple nodes-porting the CESM in a new machine

Sreerag

Sreerag
Member
Hai,
I am porting CESM2 in my university which is having a batch system as slurm. I can run the model in a single node with a considerable speed which is 1.45 years/day in 128 processors. I am attaching my config file with this. I am using openmpi. Also attached log file which time out because of slower even when the processors doubled.
Can anyone please help me with it.

Thanks in advance
Sreerag
 

Attachments

  • cesm.log.17685615.230622-213429.txt
    133.4 KB · Views: 5
  • config_batch.txt
    1.1 KB · Views: 5
  • config_compiler.txt
    1.1 KB · Views: 0
  • config_machine.txt
    2.7 KB · Views: 3

jedwards

CSEG and Liaisons
Staff member
Is the openmpi you are using optimized for the network you have? This is really a question for your sys-ad staff.
You also have a parameter in your mpirun statement --map-by ppr:1:node I don't think that it's correct. You might try:
<arg name="tasks_per_node"> --map-by ppr:{{ tasks_per_node }}:socket:PE=$ENV{OMP_NUM_THREADS} --bind-to hwthread</arg>
 

Sreerag

Sreerag
Member
Hey,
I tried the above resolution. Still remains the problem.
I found the model got stuck from the cesm.log file after writing the first restart file. with a error showing


Opened file EXP_30_new.cam.rs.0001-01-02-00000.nc to write 2752512
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
max rss=509.9 MB
newchild: child "a:PIO:pio_write_darray" can't be a parent of itself
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.


Is this a possible problem with my porting or some mismatching with config_machine file?

Thanks in advance
Yours Sincerely
Sreerag
 

jedwards

CSEG and Liaisons
Staff member
This indicates that you are taking an incorrect path through the code, perhaps there is another error
in the log before these? It indicates that the t_stopf for a:PIO:pre_pio_write_nf was called prior to t_startf
and that t_stopf for "i:cice_run_total" was called more than once.
 

Sreerag

Sreerag
Member
Thank you for your replay. This information seem so relevant toward my question. I am attaching the log file with this reply. I don't find an error before but have some of the warning. I am not able to find where I am going wrong. If you have any comment, I am really interested too know.

I am using the cluster having Slurm inside it. For this log file I don't execute the #SBATCH --exclusive way, Is this also make some inconsistency.

Thanks in advance.
Sreerag
 

Attachments

  • cesm.log.17828904.230705-182053.txt
    95.6 KB · Views: 3

jedwards

CSEG and Liaisons
Staff member
What version of the model are you using and have you made any changes to the source code? If you
provide the compset and resolution I can try to reproduce this issue locally.
 

Sreerag

Sreerag
Member
Really thank you. I am using the cesm2.1.3 version.
I am not changing anything in source code.
I am running
create_newcase --case f_present_day --compset F2000climo --res f09_f09_mg17

Thankfully
Sreerag
 
Top