Cheyenne errors with large ensemble sizes

adamhb · Oct 9, 2023

Thanks in advance for any help / guidance on the following issue!

I am getting the following error in the cesm.log file while trying to run a 1512-member, single site ensemble with CLM-FATES:

/glade/u/apps/ch/opt/mpt/2.22/bin/omplace: line 704: 39929 Bus error dplace -p $placefile "$@"

The error occurs as a model execution error and case.run error about 7 hours after the model execution starts. However, it appears that no output files have been written at the time of the error, so I don’t think the simulations ever actually started. I have been able to run a successful 216-member ensemble with the same case build settings, so scaling up to the larger ensemble size seems to be the issue.

I am running on Cheyenne and have attached my build script for more details. The case dir is here: /glade/u/home/adamhb/cases/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff

The text below is written to the CaseStatus file.

2023-10-09 06:39:07: model execution error
ERROR: Command: 'mpiexec_mpt -p "%g:" -np 1512 omplace -tm open64 -vv /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed with error '/bin/sh: line 1: 39806 Bus error mpiexec_mpt -p "%g:" -np 1512 omplace -tm open64 -vv /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/bld/cesm.exe >> cesm.log.$LID 2>&1' from dir '/glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/run'
---------------------------------------------------
2023-10-09 06:39:07: case.run error
ERROR: RUN FAIL: Command 'mpiexec_mpt -p "%g:" -np 1512 omplace -tm open64 -vv /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/run/cesm.log.3757830.chadmin1.ib0.cheyenne.ucar.edu.231008-224802

For a little more context, I have also tried running a 5004-member ensemble where the only difference in the case build settings was that I added the –multi-driver flag to create_newcase. This simulation also failed under similar circumstances (no output files were written even though the CaseStatus log showed that the model was running for multiple hours), but with a *different error* written to the cesm.log file:

-1:MPT: shepherd terminated: r10i0n13.ib0.cheyenne.ucar.edu - job aborting

jedwards · Oct 9, 2023

I think that you should try without the omplace argument. Also if you see the run has started but hasn't started writing output files after about 5 minutes you should just kill it - it's not going to run.

To remove dplace:

adamhb · Oct 9, 2023

Thanks for the quick reply jedwards. Do you have any tips on how/where to remove the "omplace" argument? I do see a line in the case's
env_mach_specific.xml file: <arg name="zthreadplacement"> omplace -tm open64 </arg>

Do I delete this line entirely or make a change here? It looked like you were going to add instructions on how to remove "dplace"... typo?

Also do you recommend using the --multi-driver flag in the call to create_newcase for large ensembles?
Thanks again!
Adam

jedwards · Oct 9, 2023

the --multi-driver flag is only for mct and is ignored in this case.

Try just deleting the zthreadplacement line, you may need to make this change in the source tree in config_machines.xml instead of in env_mach_specific.xml. Have you tried on derecho?

adamhb · Oct 9, 2023

OK, thanks.

I will try deleting that line in ctsm/ccs_config/machines/config_machines.xml and also in env_mach_specific.xml, just to be sure.

No I have not tried on Derecho (haven't migrated work yet).

adamhb · Oct 10, 2023

I deleted the zthreadplacement line in the config_machines.xml file and rebuilt the case (case dir: /glade/u/home/adamhb/cases/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff). The omplace argument did not appear in the case's env_mach_specific.xml this time which tells me that that worked. However, 1 hour after the model execution started there were still no history files being written, so I killed the job. The CaseStatus file (attached) shows "case.run" starts 8 hours after case.submit (maybe because I'm using economy queue?), and then model execution starts another hour after that.

Any other ideas for what might be going on, or what to try next would be greatly appreciated! The build script used this time is the same as above.

Cheyenne errors with large ensemble sizes

adamhb

Adam Hanbury-Brown

New Member

Attachments

jedwards

CSEG and Liaisons

adamhb

Adam Hanbury-Brown

New Member

jedwards

CSEG and Liaisons

adamhb

Adam Hanbury-Brown

New Member

adamhb

Adam Hanbury-Brown

New Member

Attachments