Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MPI_Wait error during running

tresamt

Tresa Mary
Member
Hi,
I am trying to run a F2000climo Compset run at 0.9x1.25 resolution. We have created our own machine and compiler configurations (as attached).

We are able to run the model, if high (>10 hours) JOB_WALLCLOCK_TIME is given. For shorter JOB_WALLCLOCK_TIME values, the following error occur within few seconds (30-40): Abort(1007330318) on node 1 (rank 1 in comm 0): Fatal error in internal_Wait: Message truncated, error stack: internal_Wait(89): MPI_Wait(request=0x7fffa471334c, status=0x692bae0) failed MPIR_Wait(911)...:

The log file is also attached. I would like to understand why this is happening?
 

Attachments

  • Compiler.txt
    1.7 KB · Views: 4
  • Machine.txt
    1.6 KB · Views: 2

erik

Erik Kluzek
CSEG and Liaisons
Staff member
I don't see the log file which maybe was too big to attach.

Since, this happens quickly, my guess is that the MPI or batch environment is different when you run with a shorter wallclock time. Maybe it's a node with lower memory or different MPI settings? I think this likely an issue with the particular machine you are running on. So I suggest you get with the system admins for your machine and have them let you know what might be different in the environments for the two cases.
 

tresamt

Tresa Mary
Member
Hi Erik,
The log file is attached. I have contacted the system admins too. But wanted to confirm about my new machine specifications.
 

Attachments

  • cesm.log.178.200504-182548 (2).txt
    11.6 KB · Views: 4

tresamt

Tresa Mary
Member
Sorry, I attached the wrong log file earlier.
Please find attached the correct log file.
 

Attachments

  • cesm.txt
    48.7 KB · Views: 5

erik

Erik Kluzek
CSEG and Liaisons
Staff member
We don't have the resources to examine your files in depth to validate they are correct. You will need to do that with your own team. Your system administrators for your machine should also be able to help you.

We do have general advice on porting CESM to other machines here:


That gives you some advice on steps to take as well as tests to do as you go through them.
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
I looked at the log file and it's a general MPI error, so not very helpful. Unfortunately this is not uncommon. You should look in the various log files to see if you have other hints about what's going on. I also recommend about running a simpler configuration (simpler compset and lower resolution) to see if you can get a case working. And also do things like build and run with DEBUG=TRUE to see if you get better error trapping.
 

tresamt

Tresa Mary
Member
Thankyou Erik.
We have tried a couple of combinations now. The general trend we got is that higher the wallclock time given, the case works.
I will try your suggestions too.
Hope this the systems administrators will be able to help.
Thank you
 
Top