srun error in two clusters

wvsi3w · Jun 15, 2023

I tried historical compset on two clusters:

./create_newcase --case /home/meisam/scratch/cases/histJuneTestBeluga --compset IHistClm50BgcCrop --res f19_g17 --machine beluga --walltime 02:00:00 --run-unsupported
AND
./create_newcase --case /home/meisam/scratch/cases/histJune --compset IHistClm50BgcCrop --res f19_g17 --machine narval --walltime 02:00:00 --run-unsupported

The config_machine file and other related files are attached for both runs (on two clusters).

After submitting the job on two clusters (Beluga and Narval) they both failed after running for couple of minutes with the following message:
on Beluga : " case.run error
ERROR: RUN FAIL: Command 'srun -n 80 --ntasks-per-node=40 /scratch/meisam/histJuneTestBeluga/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed "

on Narval: "case.run error
ERROR: RUN FAIL: Command 'srun -n 128 --ntasks-per-node=64 /scratch/meisam/histJune/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed "

The two log file for these errors are attached also.

What do you think is the issue here? Do I need to change the number of tasks per node for both configs, because these numbers (80 and 128) are above the limit of these two clusters. I don't understand the error. what can we understand from the above error and those two log files (CESM log files that I have attached)?

Thank you for your help

oleson · Jun 15, 2023

For the error on Narval, try adding this to your user_nl_clm:

use_init_interp = .true.

That should come of the box but doesn't for some reason.

Not sure what to say about the error on Beluga. It dies trying to allocate some arrays so maybe you are running out of memory...

wvsi3w · Jun 16, 2023

Thank you for your response. I added that line to the user_nl_clm file and tried rebuilding and resubmitting the case. It showed that specific line in lnd_in file but the submitted job failed again with the following error message which I uploaded the error log file (CESM Narval Error 2).

case.run error

ERROR: RUN FAIL: Command 'srun -n 128 --ntasks-per-node=64 /scratch/meisam/histJune/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed.

for Beluga:

[bc11536:29920:0:29920] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x5ef000005ca)
do you think it is related to wrong-dimensionalized objects that may cause a segmentation fault?

wvsi3w · Jun 19, 2023

Hello again,
In the previous log file, it said that "Consider rerunning with the following in user_nl_clm:
init_interp_fill_missing_with_natveg = .true."
I have done that and tried the run again but it seems to have the same error message which in CaseStatus says: ERROR: RUN FAIL: Command 'srun -n 128 --ntasks-per-node=64 /scratch/meisam/histJune/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed.

The new log file of the run is attached which I believe is the same as before.

slevis · Jun 19, 2023

Narval:
If you changed the case's user_nl_clm to include
`init_interp_fill_missing_with_natveg = .true."`
then it seems strange to get exactly the same error. Check the lnd_in file in the /run directory. If it doesn't include your change, then the change didn't take effect for some reason, and you should try again. Sometimes it even helps to start over clean with a new case if nothing seems to work the way that you would expect. Before doing that though, it also often helps to keep a backup of your work, in case you need to refer to it again.

Beluga:
@oleson suggested that you may be running out of memory. You suggested that some object may have the wrong dimensions. We may have more insights if you let us know whether you have ever run the model successfully on Beluga. If you have, then how does this case differ from the successful cases? Have you changed code, datasets, anything else?

wvsi3w · Jun 20, 2023

Thanks slevis for your reply, I have checked the lnd_in file and it has that change in it. It is attached.

I saw one file which name was "user_nl_clm~" and I deleted it build the case again and submitted it. But, it also failed. I have attached this log file also which shows the same error. Weird. I will start over with a clean new case and will share the results soon.

about Beluga, no I just made it to the submit step and after running for 2 min the job failed with that error. I haven't made any simulation on beluga before. I thought that would be due to wrong dimensions because after contacting the cluster support team they guessed this would be the issue, however, they strongly suggest asking professionals (like this forum) about the error I am facing.

One question that I have in mind is that what this error message is saying "ERROR: RUN FAIL: Command 'srun -n 128 --ntasks-per-node=64 /scratch/meisam/histJune/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed." ???? Because this is what you can first see in you CaseStatus file. I know that after this line it refers us to the log file, but is there anything specific about this error message that we should consider?

wvsi3w · Jun 21, 2023

Hello,
I did the run again with a new case from the beginning and it shows the same error which is attached. I have added the two lines of "use_init_interp = .true. and init_interp_fill_missing_with_natveg = .true." and it appears in the lnd_in file but the error keeps showing up.
What do you think about it?

It says:
"ERROR initInterp set_mindist: Cannot find any input points matching output point:
subgrid level, index = column 80683
lat, lon = 1.745329251994330E+034 , 1.745329251994330E+034
ltype: 9
ctype: 71

Consider rerunning with the following in user_nl_clm:
init_interp_fill_missing_with_natveg = .true."

oleson · Jun 21, 2023

I tried to repeat your case on our machine here using release-cesm2.1.3:

./create_newcase --case cesm213R_2deg_IHistClm50BgcCrop --compset IHistClm50BgcCrop --res f19_g17 --run-unsupported
cd cesm213R_2deg_IHistClm50BgcCrop
./case.setup
./case.build
./case.submit

At this point I received the same error you got originally:

Did you mean to set use_init_interp = .true. in user_nl_clm?
(Setting use_init_interp = .true. is needed when doing a
transient run using an initial conditions file from a non-transient run,
or a non-transient run using an initial conditions file from a transient run,
or when running a resolution or configuration that differs from the initial conditions.)
ERROR:
ERROR in /home/meisam/my_cesm_sandbox/components/clm/src/main/ncdio_pio.F90.in
at line 368
check_dim ERROR: mismatch of input dimension 14853 with expected value
17716 for variable landunit

I then added this to user_nl_clm as instructed earlier:

use_init_interp = .true.

and then resubmitted the case again and it ran fine.

I did not have to set:

init_interp_fill_missing_with_natveg = .true.

So at this point I can only assume there is something wrong with your port of CESM to your machine or you've made some changes to either the code or namelist, etc. that I'm not aware of.

I assume you've ported CIME to your machine per the porting guide:

6. Porting and validating CIME on a new platform — CIME master documentation

esmci.github.io

wvsi3w · Jun 22, 2023

oleson said:
I tried to repeat your case on our machine here using release-cesm2.1.3:

./create_newcase --case cesm213R_2deg_IHistClm50BgcCrop --compset IHistClm50BgcCrop --res f19_g17 --run-unsupported
cd cesm213R_2deg_IHistClm50BgcCrop
./case.setup
./case.build
./case.submit

At this point I received the same error you got originally:

Did you mean to set use_init_interp = .true. in user_nl_clm?
(Setting use_init_interp = .true. is needed when doing a
transient run using an initial conditions file from a non-transient run,
or a non-transient run using an initial conditions file from a transient run,
or when running a resolution or configuration that differs from the initial conditions.)
ERROR:
ERROR in /home/meisam/my_cesm_sandbox/components/clm/src/main/ncdio_pio.F90.in
at line 368
check_dim ERROR: mismatch of input dimension 14853 with expected value
17716 for variable landunit

I then added this to user_nl_clm as instructed earlier:

use_init_interp = .true.

and then resubmitted the case again and it ran fine.

I did not have to set:

init_interp_fill_missing_with_natveg = .true.

So at this point I can only assume there is something wrong with your port of CESM to your machine or you've made some changes to either the code or namelist, etc. that I'm not aware of.

I assume you've ported CIME to your machine per the porting guide:

6. Porting and validating CIME on a new platform — CIME master documentation

esmci.github.io

Thank Keith for your reply,
I did this six months ago with the help of a colleague (who is not available at the moment) and I believe we went through all of that porting (as far as I remember).
I will try to do it again but there is a question I need to know first; I wanted to test that compset which is historical because the project that I have in mind consists of single point simulation (for borehole temperature profiling through transient and equilibrium simulations), hence, do you think that I would need to fix the current issue or that single point analysis is totally independent of the current issue.
I personally think it is necessary to overcome my current issue with the historical compset simulation.

oleson · Jun 22, 2023

I agree. Even if the single point simulation ran to completion there would be reason to doubt the results.

wvsi3w · Jun 22, 2023

I tried it with a different compset "IHistClm50Bgc" (same resolution) and oddly it worked. instead of IHistClm50BgcCrop, I used IHistClm50Bgc and it worked without an error. After ./case.submit it started downloading a few input data and it kept running for the whole time of 2 hours. I set the wall clock time to 2 hours and the simulation of 5 years from 1850 ended in its 3rd year (clm2.h0.1853-04.nc) due to lack of time (TIMEOUT).

How is this possible? the only difference between this one and the previous compset is Crop.

slevis · Jun 26, 2023

If the IHistClm50BgcCrop case still doesn't work, you might try the following:
- Look at the differences between the two cases using the diff command on the two case directories (the directories created by the create_newcase command).
- Manually put the changes needed for the crop case (the one that doesn't work) into the non-crop case (the one that works). This may be tedious, so it's your decision whether to try my suggestion, which may still not work in the end.
- Build and hope that the manually edited case runs.
- @oleson was able to run and could not recreate the error that you see, so I think we have reached the end of our ability to help with this.

wvsi3w · Jul 4, 2023

Hello again,
After porting the CLM5 (based on the documentation) I tried scripts_regression_test.py for the Beluga cluster (computecanada server) and it Ran 129 tests in 10199.481s and FAILED (failures=3, skipped=11) with this message:
Detected failures, leaving directory: /scratch/XXXXX/scripts_regression_test.20230704_101948 .

when you check the "scripts_regression_test.20230704_101948" directory one of the errors is the following which is inside the st_archive_resubmit_test directory:

ERROR: BUILD_COMPLETE is not true
Please rebuild the model interactively.

Second error I found is the following (in TESTRUNFAIL_P1.f19_g16_rx1.A.beluga_intel.20230704_114343):
case.run error
ERROR: RUN FAIL: Command 'srun -n 1 --ntasks-per-node=40 /scratch/XXXXX/scripts_regression_test.20230704_101948/TESTRUNFAIL_P1.f19_g16_rx1.A.beluga_intel.20230704_114343/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /scratch/XXXXX/scripts_regression_test.20230704_101948/TESTRUNFAIL_P1.f19_g16_rx1.A.beluga_intel.20230704_114343/run/cesm.log.38408315.230704-114433

when you look into the log file (mentioned in the previous line) it shows this:
Insta fail
srun: error: bc12017: task 0: Exited with exit code 255

The third error I couldn't find.

P.S. I checked the config files with the xmllint command to see the validation and they all validated.

srun error in two clusters

wvs3iw

Member

Attachments

Keith Oleson

CSEG and Liaisons

wvs3iw

Member

Attachments

wvs3iw

Member

Attachments

Moderator

wvs3iw

Member

Attachments

wvs3iw

Member

Attachments

Keith Oleson

CSEG and Liaisons

wvs3iw

Member

Keith Oleson

CSEG and Liaisons

wvs3iw

Member

Moderator

wvs3iw

Member