Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Tutorial gets error on yellowstone: something to do with too many tasks

Hi!

I am teaching a course, running the CESM, using a tutorial I used two years ago.  Using a script I
used two years ago, which ran, I now get the following error:
Execute poe command line: poe  /glade/scratch/mahowald/TestCLM_1/bld/cesm.exe
ATTENTION: 0031-393  Ignoring -resd/MP_RESD specified for batch job
ATTENTION: 0031-408  64 tasks allocated by Resource Manager, continuing...
ATTENTION: 0031-606 Unrecognized environment variable, MP_EAGER_LIMIT_LOCAL.
ERROR: 0031-758 AFFINITY: [ys5922] Oversubscribe: 32 tasks in total, each task requires 1 resource,
but there are only 16 available resource. Affinity can not be applied
ERROR: 0031-161  EOF on socket connection with node ys5922-ib
INFO: 0031-639  Exit status from pm_respond = -1

The run directory for the CESM is:
~mahowald/TestCLM_1

the scratch directory is:
/glade/scratch/mahowald/TestCLM_1/

If you could help me figure out what changed on yellowstone in the last two years that might have
impacted this, and/or how to fix this problem?CISL suggested I add the following:"setenv MP_TASK_AFFINITY cpu

before submitting the CESM run script. Better yet you may just enter into your cesm run script
somewhere before the command mpirun.lsf."

 Which I did, but it still didn't work.  Does anyone have any other suggestions?

Thanks very much.
Natalie
 

jedwards

CSEG and Liaisons
Staff member
In env_mach_pes.xml change MAX_TASKS_PER_NODE to 16.   This should solve the problem.   Currently you are trying to use 32 MPI tasks per node.Each node has 16 cpu's  - they can run up to 32 threads, but we recommend not using more than 16 mpi tasks per node. 
 
Thanks, this fixed it!  I think I could also fix it by just changing ptile=16 (instead of 32) in the run script, as I tried that also, and it worked). Thanks!Natalie
 
Top