MPI_Wait error during running

tresamt

Tresa Mary
Member
Hi,
I am trying to run a F2000climo Compset run at 0.9x1.25 resolution. We have created our own machine and compiler configurations (as attached).

We are able to run the model, if high (>10 hours) JOB_WALLCLOCK_TIME is given. For shorter JOB_WALLCLOCK_TIME values, the following error occur within few seconds (30-40): Abort(1007330318) on node 1 (rank 1 in comm 0): Fatal error in internal_Wait: Message truncated, error stack: internal_Wait(89): MPI_Wait(request=0x7fffa471334c, status=0x692bae0) failed MPIR_Wait(911)...:

The log file is also attached. I would like to understand why this is happening?
 

Attachments

erik

Erik Kluzek
CSEG and Liaisons
Staff member
I don't see the log file which maybe was too big to attach.

Since, this happens quickly, my guess is that the MPI or batch environment is different when you run with a shorter wallclock time. Maybe it's a node with lower memory or different MPI settings? I think this likely an issue with the particular machine you are running on. So I suggest you get with the system admins for your machine and have them let you know what might be different in the environments for the two cases.
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
We don't have the resources to examine your files in depth to validate they are correct. You will need to do that with your own team. Your system administrators for your machine should also be able to help you.

We do have general advice on porting CESM to other machines here:


That gives you some advice on steps to take as well as tests to do as you go through them.
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
I looked at the log file and it's a general MPI error, so not very helpful. Unfortunately this is not uncommon. You should look in the various log files to see if you have other hints about what's going on. I also recommend about running a simpler configuration (simpler compset and lower resolution) to see if you can get a case working. And also do things like build and run with DEBUG=TRUE to see if you get better error trapping.
 

tresamt

Tresa Mary
Member
Thankyou Erik.
We have tried a couple of combinations now. The general trend we got is that higher the wallclock time given, the case works.
I will try your suggestions too.
Hope this the systems administrators will be able to help.
Thank you
 
Back
Top