Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

B cases hanging during initialization on Pleiades

jamiller

J Miller
New Member
Hello,

I'm running CESM v2.1.5 on NASA's Pleiades machine. The F1850 compset runs fine as expected, but for the B1850 compset, the run seems to hang near the end of the initialization. None of the log files show an error message. Depending on the number of cores, sometimes it hangs on the first run, and sometimes it hangs when starting a resubmitted run.

There's another forum post with a similar issue here. In this post, the user solved the issue by updating their compiler. I'm running the latest version of my compiler (comp-intel/2023.2.1) but the issue still persists.

My guess is that the default configuration for the modules/compiler Pleiades uses has gotten out of date, but I'm rather new to this. Does anyone have a env_mach_specific.xml file of a successful B run on Pleiades that I can use to check module versions?

Attached are the cesm log (with the middle trimmed out to fit the file size limit), cpl log, and my env_mach_specific.xml file. Thanks!
 

Attachments

  • env_mach_specific.xml.txt
    2.1 KB · Views: 5
  • cpl.log.txt
    83.8 KB · Views: 1
  • cesm.log_part2.txt
    634 KB · Views: 4
  • cesm.log_part1.txt
    367.3 KB · Views: 1

jedwards

CSEG and Liaisons
Staff member
MPT: Received signal 15

I believe that this indicates you've run out of memory. You may try to increase the pelayout or try a different version of the mpt library if available.
 

jamiller

J Miller
New Member
Thanks for the response. I am trying a few different versions of the mpt library(mpi-hpe/mpt.2.30 and 2.28), but so far they haven't made any difference.

The "MPT: Received signal 15" message comes when the run is killed - the run hangs just before this line.

I've tried setting up the B1850 case (f19_g17 resolution) on Pleiades-ivy with 960 cores (48 nodes) that has 64GB per node, surely that should be enough memory.

Do you have any suggestions for a different pelayout?
 

jamiller

J Miller
New Member
I've used the default which gives every component 960 cores, though NTASKS_WAV had to be set to 600 max. This setup hangs on the initial run.
./xmlquery NTASKS
NTASKS: ['CPL:960', 'ATM:960', 'LND:960', 'ICE:960', 'OCN:960', 'ROF:960', 'GLC:960', 'WAV:600', 'ESP:960']
I've also tried 600 cores for all components, which does the initial run fine but hangs on the resubmitted run.
Giving it 1200 cores seems to break the ocean model during initialization, and gives the error:
POP aborting...
(init_moc_ts_transport_arrays) SH is not a regular lat-lon grid. The southern b
oundary for region 2 ("Atlantic") cannot be specified.
 

wbr2023

彬睿王
Member
你好

我在 NASA 的 Pleiades 机器上运行 CESM v2.1.5。F1850 组合如预期的那样运行良好,但对于 B1850 组合,运行似乎在初始化结束时挂起。所有日志文件均未显示错误消息。根据内核的数量,有时它在第一次运行时挂起,有时在开始重新提交的运行时挂起。

这里有另一个论坛帖子有类似的问题。在这篇文章中,用户通过更新他们的编译器解决了这个问题。我正在运行最新版本的编译器 (comp-intel/2023.2.1),但问题仍然存在。

我的猜测是 Pleiades 使用的模块/编译器的默认配置已经过时,但我对此相当陌生。有没有人有在 Pleiades 上成功运行 B 的 env_mach_specific.xml 文件,我可以用它来检查模块版本?

附件是 cesm 日志(中间修剪掉以适应文件大小限制)、cpl 日志和我的 env_mach_specific.xml 文件。谢谢!
Hi, have you solved this problem? I'm experiencing the same issue.
 

wbr2023

彬睿王
Member
Hi, have you solved this problem? I'm experiencing the same issue.
I ran the B2000 compset using CESM2.1.3 and submitted the job through the SLURM workload manager, requesting 9 nodes (each with 64 cores), for a total of 576 cores. All of my log files show no error messages, and the job status indicates it's still running, but the cesm.log file stops at the point shown in the screenshot.
1746622296397.png
This is the job submission script:
1746622874536.png
 

jedwards

CSEG and Liaisons
Staff member
cesm 2.1.5 is the latest in that series, please update and try again. It also looks like you did not
follow the porting instructions and instead you have written a custom submit script. Please follow the
cesm procedure for porting and running jobs and let us know if you have any trouble doing so.
 

jamiller

J Miller
New Member
Hi, have you solved this problem? I'm experiencing the same issue.
I wasn't able to figure out the issue. My runs would hang in the same spot as yours on either the initial or resubmitted runs depending on the MPI or compiler version or the number of cores, so you could try changing those.

I assumed the issue was specific to Pleiades, but it looks like you're using a different system that I'm not familiar with. If you're able to figure it out, I'd be interested too!
 
Top