Thanks in advance for any help / guidance on the following issue!
I am getting the following error in the cesm.log file while trying to run a 1512-member, single site ensemble with CLM-FATES:
/glade/u/apps/ch/opt/mpt/2.22/bin/omplace: line 704: 39929 Bus error dplace -p $placefile "$@"
The error occurs as a model execution error and case.run error about 7 hours after the model execution starts. However, it appears that no output files have been written at the time of the error, so I don’t think the simulations ever actually started. I have been able to run a successful 216-member ensemble with the same case build settings, so scaling up to the larger ensemble size seems to be the issue.
I am running on Cheyenne and have attached my build script for more details. The case dir is here: /glade/u/home/adamhb/cases/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff
The text below is written to the CaseStatus file.
2023-10-09 06:39:07: model execution error
ERROR: Command: 'mpiexec_mpt -p "%g:" -np 1512 omplace -tm open64 -vv /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed with error '/bin/sh: line 1: 39806 Bus error mpiexec_mpt -p "%g:" -np 1512 omplace -tm open64 -vv /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/bld/cesm.exe >> cesm.log.$LID 2>&1' from dir '/glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/run'
---------------------------------------------------
2023-10-09 06:39:07: case.run error
ERROR: RUN FAIL: Command 'mpiexec_mpt -p "%g:" -np 1512 omplace -tm open64 -vv /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/run/cesm.log.3757830.chadmin1.ib0.cheyenne.ucar.edu.231008-224802
For a little more context, I have also tried running a 5004-member ensemble where the only difference in the case build settings was that I added the –multi-driver flag to create_newcase. This simulation also failed under similar circumstances (no output files were written even though the CaseStatus log showed that the model was running for multiple hours), but with a *different error* written to the cesm.log file:
-1:MPT: shepherd terminated: r10i0n13.ib0.cheyenne.ucar.edu - job aborting
I am getting the following error in the cesm.log file while trying to run a 1512-member, single site ensemble with CLM-FATES:
/glade/u/apps/ch/opt/mpt/2.22/bin/omplace: line 704: 39929 Bus error dplace -p $placefile "$@"
The error occurs as a model execution error and case.run error about 7 hours after the model execution starts. However, it appears that no output files have been written at the time of the error, so I don’t think the simulations ever actually started. I have been able to run a successful 216-member ensemble with the same case build settings, so scaling up to the larger ensemble size seems to be the issue.
I am running on Cheyenne and have attached my build script for more details. The case dir is here: /glade/u/home/adamhb/cases/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff
The text below is written to the CaseStatus file.
2023-10-09 06:39:07: model execution error
ERROR: Command: 'mpiexec_mpt -p "%g:" -np 1512 omplace -tm open64 -vv /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed with error '/bin/sh: line 1: 39806 Bus error mpiexec_mpt -p "%g:" -np 1512 omplace -tm open64 -vv /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/bld/cesm.exe >> cesm.log.$LID 2>&1' from dir '/glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/run'
---------------------------------------------------
2023-10-09 06:39:07: case.run error
ERROR: RUN FAIL: Command 'mpiexec_mpt -p "%g:" -np 1512 omplace -tm open64 -vv /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /glade/scratch/adamhb/archive/ca_5pfts_1512mem_100823_-17e2acb6a_FATES-031f28ff/run/cesm.log.3757830.chadmin1.ib0.cheyenne.ucar.edu.231008-224802
For a little more context, I have also tried running a 5004-member ensemble where the only difference in the case build settings was that I added the –multi-driver flag to create_newcase. This simulation also failed under similar circumstances (no output files were written even though the CaseStatus log showed that the model was running for multiple hours), but with a *different error* written to the cesm.log file:
-1:MPT: shepherd terminated: r10i0n13.ib0.cheyenne.ucar.edu - job aborting