case.submit ERROR: RUN FAIL: command

Maggie Xia · Feb 5, 2022

Dear all,
I am trying to submit a case created using compset 'ISSP126Clm50BgcCrop' in the containized docker version of CESM. The only setting I changed was the DIN_LOC_ROOT. The case create, setup, and build processes were ok. Here I attached the cesm.log file, FYI. Thank you!

oleson · Feb 6, 2022

I tried that compset (at f09_g17 resolution) on our supercomputer (outside the container) with release-cesm2.2.0 and it worked fine.
What resolution are you trying to run at?
I see in the following post that @smoggy had the same error ("Program received signal SIGBUS: Access to an undefined portion of a memory object.") in the containerized version of CESM:

Containerized CESM for laptops/workstations (Windows/Mac/Linux)

Have you been doing this primarily with a Dockerfile to build the container or some other method? If the former, maybe we can get it under version control for easier collaboration? I was originally planning on doing a from-scratch Singularity recipe file to build my container, and then I just...

bb.cgd.ucar.edu

So you might check with @smoggy to see if the error was resolved.
Otherwise, I suggest creating a new post in the Containers & Cloud Platforms Forum and see if you can get help there.
I do see that there is a list of tested compsets for the container version and ISSP126Clm50BgcCrop is not on there.

dobbins · Feb 7, 2022

Maggie Xia said:
Dear all,
I am trying to submit a case created using compset 'ISSP126Clm50BgcCrop' in the containized docker version of CESM. The only setting I changed was the DIN_LOC_ROOT. The case create, setup, and build processes were ok. Here I attached the cesm.log file, FYI. Thank you!

I don't think you're running in the container -- your log files indicate a user path that's different from what you get in the container, and the Intel compilers which aren't yet embedded in it:

ERROR: Command make exec_se -j 4 EXEC_SE=/Users/mark/projects/scratch/cam6_fv2deg/bld/cesm.exe
...
/usr/local/intel/bin/mpif90 ...

Can you share more details of what commands you're running and where you're running them? In the meantime, I'll test that this compset works in the container, too. I've run some other I compsets as seen here, so I would think this would too:

Containerized CESM - Tested Configs

docs.google.com

Thanks,
- Brian

Maggie Xia · Feb 12, 2022

dobbins said:
I don't think you're running in the container -- your log files indicate a user path that's different from what you get in the container, and the Intel compilers which aren't yet embedded in it:

Can you share more details of what commands you're running and where you're running them? In the meantime, I'll test that this compset works in the container, too. I've run some other I compsets as seen here, so I would think this would too:

Containerized CESM - Tested Configs

docs.google.com

Thanks,
- Brian

Thanks for the reply, Brian. I am indeed running in the container, and I think the error you quoted didn't come from the log file I attached. However, I have solved this problem by doing './xmlchange MAX_TASKS_PER_NODE=6, MAX_MPITASKS_PER_NODE=6' (default value is 48). I have tried some other values, 24 did not work but 12 seems fine. I'm not sure why this happened, and I wonder if a smaller value for MAX_TASKS_PER_NODE may slow down the calculation processes? FYI, I am running Linux version 3.10.0-957.10.1.el7.x86_64 and Docker version 1.13.1, build 7d71120/1.13.1.

dobbins · Feb 15, 2022

Maggie Xia said:
Thanks for the reply, Brian. I am indeed running in the container, and I think the error you quoted didn't come from the log file I attached. However, I have solved this problem by doing './xmlchange MAX_TASKS_PER_NODE=6, MAX_MPITASKS_PER_NODE=6' (default value is 48). I have tried some other values, 24 did not work but 12 seems fine. I'm not sure why this happened, and I wonder if a smaller value for MAX_TASKS_PER_NODE may slow down the calculation processes? FYI, I am running Linux version 3.10.0-957.10.1.el7.x86_64 and Docker version 1.13.1, build 7d71120/1.13.1.

My apologies; I must've gotten confused from another post. My guess is that the issue you're hitting is related to the default limit that Docker gives in shared memory - typically 64MB - which is insufficient on high numbers of MPI ranks. Try adding the flag:

--shm-size=512M

... To your 'docker run' command. Basically, every MPI process stores some information in /dev/shm (a mapped region of memory), but Docker defaults to a small amount there, and it's typically fine for a 4- or 8-core laptop, but not a 48-core system like yours. For the GNU / MPICH combination in use, 512MB is likely enough for 48 cores, but worst case try 1G as well.

And yes, if you have 48 cores, using only 6 or 12 will to give you the full performance. It's not always completely linear, since memory bandwidth matters a lot as well, but I'd say try the above and then set the number of tasks correctly. For the container, if you're using the Jupyter version the MAX_TASKS_PER_NODE should be set automatically, but if you're not using the Jupyter version than yes, you need to set it explicitly. Setting both those variables, plus the NTASKS to 48 should work.

If not, let me know and we'll try to solve it quickly.

Cheers,
- Brian

Maggie Xia · Feb 16, 2022

Than

dobbins said:
My apologies; I must've gotten confused from another post. My guess is that the issue you're hitting is related to the default limit that Docker gives in shared memory - typically 64MB - which is insufficient on high numbers of MPI ranks. Try adding the flag:

--shm-size=512M

... To your 'docker run' command. Basically, every MPI process stores some information in /dev/shm (a mapped region of memory), but Docker defaults to a small amount there, and it's typically fine for a 4- or 8-core laptop, but not a 48-core system like yours. For the GNU / MPICH combination in use, 512MB is likely enough for 48 cores, but worst case try 1G as well.

And yes, if you have 48 cores, using only 6 or 12 will to give you the full performance. It's not always completely linear, since memory bandwidth matters a lot as well, but I'd say try the above and then set the number of tasks correctly. For the container, if you're using the Jupyter version the MAX_TASKS_PER_NODE should be set automatically, but if you're not using the Jupyter version than yes, you need to set it explicitly. Setting both those variables, plus the NTASKS to 48 should work.

If not, let me know and we'll try to solve it quickly.

Cheers,
- Brian

Thanks, Brian! It's running fine now.

case.submit ERROR: RUN FAIL: command

Maggie Xia

Maggie Xia

New Member

Attachments

oleson

Keith Oleson

CSEG and Liaisons

Containerized CESM for laptops/workstations (Windows/Mac/Linux)

dobbins

Brian Dobbins

CSEG and Liaisons

Containerized CESM - Tested Configs

Maggie Xia

Maggie Xia

New Member

Containerized CESM - Tested Configs

dobbins

Brian Dobbins

CSEG and Liaisons

Maggie Xia

Maggie Xia

New Member