Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu
Hai,
I am porting CESM2 in my university which is having a batch system as slurm. I can run the model in a single node with a considerable speed which is 1.45 years/day in 128 processors. I am attaching my config file with this. I am using openmpi. Also attached log file which time out because of slower even when the processors doubled.
Can anyone please help me with it.
Is the openmpi you are using optimized for the network you have? This is really a question for your sys-ad staff.
You also have a parameter in your mpirun statement --map-by ppr:1:node I don't think that it's correct. You might try:
<arg name="tasks_per_node"> --map-by ppr:{{ tasks_per_node }}:socket:PE=$ENV{OMP_NUM_THREADS} --bind-to hwthread</arg>
Hey,
I tried the above resolution. Still remains the problem.
I found the model got stuck from the cesm.log file after writing the first restart file. with a error showing
Opened file EXP_30_new.cam.rs.0001-01-02-00000.nc to write 2752512
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
GPTLstopf thread 0: timer for "a:PIO:pre_pio_write_nf" had not been started.
max rss=509.9 MB
newchild: child "a:PIO:pio_write_darray" can't be a parent of itself
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
GPTLstopf: timer "i:cice_run_total" was already off.
Is this a possible problem with my porting or some mismatching with config_machine file?
This indicates that you are taking an incorrect path through the code, perhaps there is another error
in the log before these? It indicates that the t_stopf for a:PIO:pre_pio_write_nf was called prior to t_startf
and that t_stopf for "i:cice_run_total" was called more than once.
Thank you for your replay. This information seem so relevant toward my question. I am attaching the log file with this reply. I don't find an error before but have some of the warning. I am not able to find where I am going wrong. If you have any comment, I am really interested too know.
I am using the cluster having Slurm inside it. For this log file I don't execute the #SBATCH --exclusive way, Is this also make some inconsistency.
What version of the model are you using and have you made any changes to the source code? If you
provide the compset and resolution I can try to reproduce this issue locally.
Really thank you. I am using the cesm2.1.3 version.
I am not changing anything in source code.
I am running
create_newcase --case f_present_day --compset F2000climo --res f09_f09_mg17
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.