Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CESM 1.2 multi-instance config on small (2-node) machine

I am attempting to test a 20-instance CESM (compset 'F_2000_WACCM') run for use with DART. We have a 2-node Intel cluster with 12 pes/node. I'm having difficulty coming up with a env_mach_pes.xml that doesn't result in MPI errors. Earlier, I had success with a 2-instance CESM config that used 12 pes; now I'm trying to scale up to a 20-member ensemble. I recognize this is an undersized system for a 20-member CESM, but it is for development only.

My latest env_mach_pes.xml below. Thanks in advance for any suggestions.













































 

santos

Member
I don't see an attachment here; I'm not sure what MPI errors you are getting either.Regardless, there's a certain amount of memory (configuration and resolution dependent) that you will need for running each WACCM case. If your two nodes do not have enough memory, they simply will not be able to run 20 cases at once, no matter how you handle the layout.
 

santos

Member
I don't see an attachment here; I'm not sure what MPI errors you are getting either.Regardless, there's a certain amount of memory (configuration and resolution dependent) that you will need for running each WACCM case. If your two nodes do not have enough memory, they simply will not be able to run 20 cases at once, no matter how you handle the layout.
 
Sorry--I didn't realize my .xml didn't take. Here it is, followed by the MPI errors when I tried to run on 12 pes. Thanks.                                                                                                                                                                                                           (seq_comm_setcomm)  initialize ID (  1 GLOBAL          ) pelist   =     0    11     1 ( npes =    12) ( nthreads =  1)MPI Error, rank:0, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:2, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:3, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:4, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:5, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:6, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:7, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:8, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:9, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:10, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:11, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI: Global rank 0 is aborting with error code 0.     Process ID: 11874, Host: n002, Program: /rotor/scratch/p1783-HALAS/cesm1_1_1/CAMDART_F_2000_WACCM/cesm.exe MPI: --------stack traceback-------MPI Error, rank:1, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI: Attaching to program: /proc/11874/exe, process 11874MPI: Try: zypper install -C "debuginfo(build-id)=365e4d2c812908177265c8223f222a1665fe1035"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=8362cd0e37776b4bba3372224858dbcafcadc4ee"MPI: (no debugging symbols found)...done.MPI: [Thread debugging using libthread_db enabled]MPI: Try: zypper install -C "debuginfo(build-id)=a41ac0b0b7cd60bd57473303c2c3de08856d2e06"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=3f06bcfc74f9b01780d68e89b8dce403bef9b2e3"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=d70e9482ac22a826c1cf7d04bdbb1bf06f2e707b"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=17c088070352d83e7afc43d83756b00899fd37f0"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=81a3a96c7c0bc95cb4aa5b29702689cf324a7fcd"MPI: (no debugging symbols found)...done.MPI: 0x00002aaaab67e105 in waitpid () from /lib64/libpthread.so.0MPI: (gdb) #0  0x00002aaaab67e105 in waitpid () from /lib64/libpthread.so.0MPI: #1  0x00002aaaab3e99a4 in mpi_sgi_system (command=)MPI:     at sig.c:89MPI: #2  MPI_SGI_stacktraceback (command=) at sig.c:269MPI: #3  0x00002aaaab37ec42 in print_traceback (ecode=0) at abort.c:168MPI: #4  0x00002aaaab37ee33 in MPI_SGI_abort () at abort.c:78MPI: #5  0x00002aaaab3a50a3 in errors_are_fatal (comm=,MPI:     code=) at errhandler.c:223MPI: #6  0x00002aaaab3a5401 in MPI_SGI_error (comm=1, code=13) at errhandler.c:60MPI: #7  0x00002aaaab3b4734 in PMPI_Group_range_incl (group=3, n=1,MPI:     ranges=0x799b520, newgroup=0x7fffffff7b54) at group_range_incl.c:58MPI: #8  0x00002aaaab3b4795 in pmpi_group_range_incl__ ()MPI:    from /opt/sgi/mpt/mpt-2.02/lib/libmpi.soMPI: #9  0x000000000110f84c in seq_comm_mct_mp_seq_comm_setcomm_ ()MPI: #10 0x0000000001114daa in seq_comm_mct_mp_seq_comm_init_ ()MPI: #11 0x0000000000434733 in ccsm_comp_mod_mp_ccsm_pre_init_ ()MPI: #12 0x0000000000435df2 in MAIN__ ()MPI: #13 0x0000000000009fe0 in ?? ()MPI: #14 0x0000000000000000 in ?? ()MPI: (gdb) A debugging session is active.MPI:MPI:    Inferior 1 [process 11874] will be detached.MPI:MPI: Quit anyway? (y or n) [answered Y; input not from terminal]MPI: Detaching from program: /proc/11874/exe, process 11874
 
Sorry--I didn't realize my .xml didn't take. Here it is, followed by the MPI errors when I tried to run on 12 pes. Thanks.                                                                                                                                                                                                           (seq_comm_setcomm)  initialize ID (  1 GLOBAL          ) pelist   =     0    11     1 ( npes =    12) ( nthreads =  1)MPI Error, rank:0, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:2, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:3, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:4, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:5, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:6, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:7, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:8, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:9, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:10, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI Error, rank:11, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI: Global rank 0 is aborting with error code 0.     Process ID: 11874, Host: n002, Program: /rotor/scratch/p1783-HALAS/cesm1_1_1/CAMDART_F_2000_WACCM/cesm.exe MPI: --------stack traceback-------MPI Error, rank:1, function:MPI_GROUP_RANGE_INCL, Invalid argumentMPI: Attaching to program: /proc/11874/exe, process 11874MPI: Try: zypper install -C "debuginfo(build-id)=365e4d2c812908177265c8223f222a1665fe1035"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=8362cd0e37776b4bba3372224858dbcafcadc4ee"MPI: (no debugging symbols found)...done.MPI: [Thread debugging using libthread_db enabled]MPI: Try: zypper install -C "debuginfo(build-id)=a41ac0b0b7cd60bd57473303c2c3de08856d2e06"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=3f06bcfc74f9b01780d68e89b8dce403bef9b2e3"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=d70e9482ac22a826c1cf7d04bdbb1bf06f2e707b"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=17c088070352d83e7afc43d83756b00899fd37f0"MPI: (no debugging symbols found)...done.MPI: Try: zypper install -C "debuginfo(build-id)=81a3a96c7c0bc95cb4aa5b29702689cf324a7fcd"MPI: (no debugging symbols found)...done.MPI: 0x00002aaaab67e105 in waitpid () from /lib64/libpthread.so.0MPI: (gdb) #0  0x00002aaaab67e105 in waitpid () from /lib64/libpthread.so.0MPI: #1  0x00002aaaab3e99a4 in mpi_sgi_system (command=)MPI:     at sig.c:89MPI: #2  MPI_SGI_stacktraceback (command=) at sig.c:269MPI: #3  0x00002aaaab37ec42 in print_traceback (ecode=0) at abort.c:168MPI: #4  0x00002aaaab37ee33 in MPI_SGI_abort () at abort.c:78MPI: #5  0x00002aaaab3a50a3 in errors_are_fatal (comm=,MPI:     code=) at errhandler.c:223MPI: #6  0x00002aaaab3a5401 in MPI_SGI_error (comm=1, code=13) at errhandler.c:60MPI: #7  0x00002aaaab3b4734 in PMPI_Group_range_incl (group=3, n=1,MPI:     ranges=0x799b520, newgroup=0x7fffffff7b54) at group_range_incl.c:58MPI: #8  0x00002aaaab3b4795 in pmpi_group_range_incl__ ()MPI:    from /opt/sgi/mpt/mpt-2.02/lib/libmpi.soMPI: #9  0x000000000110f84c in seq_comm_mct_mp_seq_comm_setcomm_ ()MPI: #10 0x0000000001114daa in seq_comm_mct_mp_seq_comm_init_ ()MPI: #11 0x0000000000434733 in ccsm_comp_mod_mp_ccsm_pre_init_ ()MPI: #12 0x0000000000435df2 in MAIN__ ()MPI: #13 0x0000000000009fe0 in ?? ()MPI: #14 0x0000000000000000 in ?? ()MPI: (gdb) A debugging session is active.MPI:MPI:    Inferior 1 [process 11874] will be detached.MPI:MPI: Quit anyway? (y or n) [answered Y; input not from terminal]MPI: Detaching from program: /proc/11874/exe, process 11874
 

jedwards

CSEG and Liaisons
Staff member
In general you can't have more NTASKS_{COMP} than you have available pes.    MPI doesn't like that.   Reduce the number of instances or run on one pe per instance.
 

jedwards

CSEG and Liaisons
Staff member
In general you can't have more NTASKS_{COMP} than you have available pes.    MPI doesn't like that.   Reduce the number of instances or run on one pe per instance.
 
Can CESM execute these ensemble runs sequentially? There is an NINST_{COMP}_LAYOUT variable in env_mach_pes.xml, but the user documentation suggests this is not yet an active feature.
 
Can CESM execute these ensemble runs sequentially? There is an NINST_{COMP}_LAYOUT variable in env_mach_pes.xml, but the user documentation suggests this is not yet an active feature.
 
Top