Infinite hang occurs in CPL during initialization or model execution

Kihang Youn

Kihang Youn
New Member
Hi all,

Hello, I get intermittently stuck in infinite hang while running the model.
Based on the model log, it stops at cpl.log and I don't know if that helps, but it stops at the (seq_mct_drv) : creating gsmap_ax, creating dom_ax part.
The problem is that if I run it over and over again, sometimes it works well.
Is there anything I can check for more debugging? Or can there be such a case like this?

Best Regards,
Kihang
 

fischer

CSEG and Liaisons
Staff member
Hi Kihang,

Since it runs sometimes, that leads me to believe that it's a system issue. You can turn on debugging but using ./xmlchange DEBUG=TRUE.
Then rebuild and run, you might get more information about the hang. But please read the following about information to include so we
may better assist you.


Thanks
Chris
 

Kihang Youn

Kihang Youn
New Member
Hi Chris,

Here are my log files from the run and I will give you details more.

Error depending on process combination

When I did it with the following combination, it worked without a problem.
- NTASK_OCN=1140, NTASK_ATM=6460, NTASK_CPL=6460
But when I changed the process combination, the problem occurred.
- NTASK_OCN=2280, NTASK_ATM=5320, NTASK_CPL=5320

Source code location where the model stops

As a result of adding a little print statement to the code, I found that seq_map_map(mapper_Co2x(eoi), dom_oo(eoi)%data, dom_ox%data, In the msgtag=~) function, we saw that the model was stopped. In the seq_map_mod.F90 source code, it stops at mct_rearr_rearrange(line 483, cesm1_2_2_1).

Model log file

And the model log file is also attached.


Best Regards,
Kihang
 

Attachments

fischer

CSEG and Liaisons
Staff member
Hi Kihang,

If you don't mind me asking, why are you using cesm1_2_2_1 and not a newer cesm2 version?

Chris
 

Kihang Youn

Kihang Youn
New Member
Hi Chris,

The model version, compset, and resolution I want to optimize are fixed, so it is impossible to change the version. :-(

I checked that it was stuck in the rearrage function MPI_WAITANY, and it seems to be because of the component's comm_world in my opinion.

Let me check it by adjusting the number of cpl processes.

Best Regards,
Kihang
 
Back
Top