Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

MPI Error in making high-resolution surface dataset

Status
Not open for further replies.

yifanc17

Yifan Cheng
New Member
Hi all,

I'm trying to create a high-resolution surface dataset (1km over Contiguous US) for regional simulations following the instructions in Setting up (high-res sparse) regional-grid CTSM simulations #1919. I have succesfully created the 1km masked mesh file (please see attached CONUS_1km_mesh.png), when using mksurfdata_esmf to create surface dataset:
Code:
./gen_mksurfdata_namelist --start-year 2005 --end-year 2005 --nocrop --model-mesh-nx 6464 --model-mesh-ny 2781 --model-mesh /glade/work/yifanc17/02_data/cesmdata/meshdata/CONUS_1kmx1km/lnd_mesh_CONUS_1km_c240729.nc --res CONUS_1km
I was able to generate the .namelist, however, I got MPI errors when generating the .nc file (I didn't make any change to the default .namelist) like below:

Code:
MPICH ERROR [Rank 0] [job id e36be614-3f1c-4b4c-aedb-649bde738d9f] [Tue Jul 30 15:32:47 2024] [dec2401] - Abort(874109199) (rank 0 in comm 0): Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(163)............: MPI_Send(buf=0x15174dcd8010, count=14803696, MPI_DOUBLE, dest=1, tag=17, comm=0xc4000011) failed
MPID_Send(499)............:
MPIDI_send_unsafe(58).....:
MPIDI_OFI_send_normal(372): OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Bad address)


aborting job:
Fatal error in PMPI_Send: Other MPI error, error stack:
PMPI_Send(163)............: MPI_Send(buf=0x15174dcd8010, count=14803696, MPI_DOUBLE, dest=1, tag=17, comm=0xc4000011) failed
MPID_Send(499)............:
MPIDI_send_unsafe(58).....:
MPIDI_OFI_send_normal(372): OFI tagged senddata failed (ofi_send.h:372:MPIDI_OFI_send_normal:Bad address)
dec2401.hsn.de.hpc.ucar.edu: rank 0 exited with code 255
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libpthread-2.31.s  000014839FD278C0  Unknown               Unknown  Unknown
libmpi_intel.so.1  000014839DC0B94A  Unknown               Unknown  Unknown
libmpi_intel.so.1  000014839C9CD8C5  Unknown               Unknown  Unknown
libmpi_intel.so.1  000014839CA35FAF  Unknown               Unknown  Unknown
libmpi_intel.so.1  000014839D796FA6  Unknown               Unknown  Unknown
libmpi_intel.so.1  000014839D675A83  Unknown               Unknown  Unknown
libmpi_intel.so.1  000014839BBEA79F  Unknown               Unknown  Unknown
libmpi_intel.so.1  000014839BBEB315  PMPI_Alltoallw        Unknown  Unknown
libpioc.so         00001483A755445C  pio_swapm             Unknown  Unknown
libpioc.so         00001483A7557D11  rearrange_io2comp     Unknown  Unknown
libpioc.so         00001483A757167E  PIOc_read_darray      Unknown  Unknown
mksurfdata         0000000000FCCA76  Unknown               Unknown  Unknown
mksurfdata         0000000000EAE81C  Unknown               Unknown  Unknown
mksurfdata         0000000000EADDEE  Unknown               Unknown  Unknown
mksurfdata         00000000007D55E2  Unknown               Unknown  Unknown
mksurfdata         00000000007D1CA6  Unknown               Unknown  Unknown
mksurfdata         00000000004F5833  Unknown               Unknown  Unknown
mksurfdata         0000000000448D91  Unknown               Unknown  Unknown
mksurfdata         00000000004B711A  Unknown               Unknown  Unknown
mksurfdata         00000000004292AD  Unknown               Unknown  Unknown
libc-2.31.so       000014839AB8E29D  __libc_start_main     Unknown  Unknown
mksurfdata         00000000004291DA  Unknown               Unknown  Unknown

I have attached the .log and the batch job error files. Not sure if it's because this high resolution is too computationally intensive. I wonder if creating surface dataset at this resolution on Derecho is feasible. If yes, can anyone share some experience or successful cases with me?

Thanks a lot!
 

Attachments

  • CONUS_1km_mesh.png
    CONUS_1km_mesh.png
    100.2 KB · Views: 4
  • mksurfdata_CONUS_1km.o5431509.txt
    960.5 KB · Views: 1
  • surfdata_CONUS_1km_hist_2005_16pfts_c240730.log.txt
    28.9 KB · Views: 1

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I tried it using:

./gen_mksurfdata_jobscript_single --number-of-nodes 24 --tasks-per-node 12 --namelist-file surfdata_CONUS_1km_hist_2005_16pfts_c240730.namelist

It seemed to complete successfully. Are you maybe running out of disc space?
 

yifanc17

Yifan Cheng
New Member
I tried it using:

./gen_mksurfdata_jobscript_single --number-of-nodes 24 --tasks-per-node 12 --namelist-file surfdata_CONUS_1km_hist_2005_16pfts_c240730.namelist

It seemed to complete successfully. Are you maybe running out of disc space?
Thank you Keith! I guess I didn't request enough number of nodes before and now it works!
 
Status
Not open for further replies.
Top