Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Porting issues for CESM3 pre-release (cesm3_0_beta04) on aarch64 macOS 15.2

phansel

Paul Hansel
New Member
As an evaluation for porting CESM to macOS 15.2 on Apple Silicon (arm64 Darwin), I am following the CAM f2000 control exercise under the CESM tutorial. 1: Control case: F2000climo — CESM Tutorial. This is on a single node with 16 performance cores @ 3.7 GHz.

I've installed netcdf, hdf5, openmpi, python3, gfortran, LAPACK, openBLAS, etc. via homebrew. Versions and formulas are in brewlist.txt.

ESMF is built and installed with version: v8.8.0b00-240-g7228e87d3c. Python is version 3.13.1.

My CESM directory tree is as follows:
phansel@CRUMPET CESM_Data % pwd
/Users/phansel/Public/CESM_Data
phansel@CRUMPET CESM_Data % tree -L 1
.
├── esmf
├── f2000_control
├── hosts
├── inputdata
├── my_cesm_sandbox
└── outputdata

5 directories, 1 file

What version of the code are you using?
git-describe:
cesm3_0_beta04
./bin/git-fleximod status: (prompt text in the forum template still suggests checkout_externals!)
Result in git-fleximod-status.txt.
I selected this branch as none of the CESM2 branches I selected supported python3.

Have you made any changes to files in the source tree?
- I have created a file ~/.cime/config_machines.xml that passes XML validation. It is attached below.
- I have modified env_mach_specific.py under CIME to comment out resource.setrlimit() since modifying RLIMIT_STACK is evidently not supported on macOS (see python3 resource.setrlimit strange behaviour under macOS · Issue #78783 · python/cpython and other issues).
diff --git a/CIME/XML/env_mach_specific.py b/CIME/XML/env_mach_specific.py
- resource.setrlimit(attr, limits)
+ #resource.setrlimit(attr, limits)
- I have modified CIME/Tools/Makefile to force the inclusion of LAPACK. Without this, case.build returns a number of errors related to LAPACK FORTRAN functions. I'm aware that this is not the right place to link LAPACK, but it's the first place I found that worked since ~/.cime/config_compilers.xml is no longer applicable.
diff --git a/CIME/Tools/Makefile b/CIME/Tools/Makefile
-FoX_LIBS := -L$(SHAREDLIBROOT)/$(SHAREDPATH)/CDEPS/fox/lib -lFoX_dom -lFoX_sax -lFoX_utils -lFoX_fsys -lFoX_wxml -lFoX_common -lFoX_fsys
+FoX_LIBS := -L$(SHAREDLIBROOT)/$(SHAREDPATH)/CDEPS/fox/lib -lFoX_dom -lFoX_sax -lFoX_utils -lFoX_fsys -lFoX_wxml -lFoX_common -lFoX_fsys -llapack
Errors that appear without including lapack:
Undefined symbols for architecture arm64:
"_dgbsv_", referenced from:
___lapack_interfaces_MOD_dgbsv_wrap in libatm.a[272](lapack_interfaces.o)
[continued]
"_strmv_", referenced from:
___lapack_interfaces_MOD_strmv_wrap in libatm.a[272](lapack_interfaces.o)
ld: symbol(s) not found for architecture arm64
collect2: error: ld returned 1 exit status
gmake: *** [/Users/phansel/Public/CESM_Data/example3/Tools/Makefile:935: ../../cesm.exe] Error 1

Describe every step you took leading up to the problem:
- Checkout CESM at the specified branch using git-fleximod
- Make modifications to CIME as mentioned above
- Create .cime/config_machine.xml as attached
- Confirm that MPI processes can ssh into localhost without issue:
ssh localhost
- Set environment variables
export CIME_MACHINE=crumpet
export NETCDF_PATH=/opt/homebrew/Cellar/netcdf/4.9.2_2
export NETCDF_FORTRAN_PATH=/opt/homebrew/Cellar/netcdf-fortran/4.6.1_1
export NETCDF_C_PATH=/opt/homebrew/Cellar/netcdf/4.9.2_2
export ESMFMKFILE=/Users/phansel/Public/CESM_Data/esmf/lib/libO/Darwin.gfortranclang.64.mpiuni.default/esmf.mk
- Create new case:
cd $CIMEROOT
./scripts/create_newcase --case $CESMDATAROOT/f2000_control --compset F2000climo --res f19_f19_mg17
- Set up the case per defaults
cd $CESMDATAROOT/f2000_control
./case.setup
- Build the case
./case.build
- Download input data - this can only be done after building the case?
./check_input_data --download
- Build again (just to be sure)
./case.build
- Submit to queue (which is none)
./case.submit
The case stops running after ~1 second and exits with this error.
run command is mpirun /Users/phansel/Public/CESM_Data/outputdata/f2000_control/bld/cesm.exe >> cesm.log.$LID 2>&1
Exception from case_run: ERROR: RUN FAIL: Command 'mpirun /Users/phansel/Public/CESM_Data/outputdata/f2000_control/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /Users/phansel/Public/CESM_Data/outputdata/f2000_control/run/cesm.log.250112-173750
The log file specified has been moved, but is otherwise not informative:
phansel@CRUMPET f2000_control % cat /Users/phansel/Public/CESM_Data/outputdata/archive/f2000_control/logs/cesm.log.250112-173750
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 4
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

Process name: [prterun-CRUMPET-23234@1,17]
Exit code: 1
--------------------------------------------------------------------------
Looking further into the outputdata/f2000_control/run directory, an ESMF logfile is present:
phansel@CRUMPET f2000_control % cat ../outputdata/f2000_control/run/PET0.ESMF_LogFile
20250112 173751.639 ERROR PET0 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/cmeps/cime_config/../cesm/driver/esm.F90:950 Not valid - Invalid NTASKS value specified for component: cpl ntasks: 32 1
20250112 173751.639 ERROR PET0 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/cmeps/cime_config/../cesm/driver/esm.F90:203 Not valid - Passing error in return code
20250112 173751.639 ERROR PET0 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:797 Not valid - Passing error in return code
20250112 173751.639 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:2918 Not valid - Phase 'IPDv02p1' Initialize for modelComp 1: ESM0001 did not return ESMF_SUCCESS
20250112 173751.639 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:1345 Not valid - Passing error in return code
20250112 173751.639 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:486 Not valid - Passing error in return code
20250112 173751.639 ERROR PET0 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/cmeps/cime_config/../cesm/driver/esmApp.F90:134 Not valid - Passing error in return code
20250112 173751.639 INFO PET0 Finalizing ESMF with endflag==ESMF_END_ABORT
No other log files exist in the outputdata/f2000_climo directory.

I could not find any issues on ESMF with NTASKS mentioned: Issues · esmf-org/esmf

I looked into changing NTASKS_xyz for ATM/ICE/LND etc. to 1 (from 32) via xmlchange, but the same log file showed another error. Is this an ESMF issue?

I found this thread on a similar issue, but did not find any difference in behavior when I changed mpi-serial to mpt or the number of available PEs per node to 1.


If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.

Build logs and PET0.ESMF_LogFile attached.

Describe your problem or question:
Case submission on a tutorial CAM test case fails within seconds due to an ESMF error.
 

Attachments

  • CDEPS.bldlog.250112-170508.txt
    166.2 KB · Views: 0
  • git-fleximod-status.txt
    4 KB · Views: 0
  • brewlist.txt
    1.8 KB · Views: 1
  • rof.bldlog.250112-170508.txt
    39.8 KB · Views: 0
  • pio.bldlog.250112-170508.txt
    95.5 KB · Views: 0
  • ocn.bldlog.250112-170508.txt
    384 bytes · Views: 0
  • ice.bldlog.250112-170508.txt
    284.7 KB · Views: 0
  • gptl.bldlog.250112-170508.txt
    7.7 KB · Views: 0
  • csm_share.bldlog.250112-170508.txt
    182.8 KB · Views: 0
  • cesm.bldlog.250112-170508.txt
    122.7 KB · Views: 0

phansel

Paul Hansel
New Member
I've re-created the case after rebuilding ESMF with openMPI rather than the default, mpiuni. There are now 24 PETxx.ESMF_LogFile in the run/ directory, but aside from timestamps, they are all identical to the log uploaded in the previous post.
export ESMFMKFILE=/Users/phansel/Public/CESM_Data/esmf/lib/libO/Darwin.gfortranclang.64.openmpi.default/esmf.mk
There's also a different output in the CESM log:
phansel@CRUMPET run % cat /Users/phansel/Public/CESM_Data/outputdata/archive/f2000_control/logs/cesm.log.250112-181747
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 4
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 5 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
Proc: [[52404,1],5]
Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
prterun has exited due to process rank 5 with PID 0 on node CRUMPET calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------



I've already run the Porting CIME MPI example found here, without issues, although I did need to remove the ".F90" from line 1 for it to compile under gfortran gcc version 14.2.0.
phansel@CRUMPET ~ % mpif90 fhello_world_mpi.F90 -o hello_world
phansel@CRUMPET ~ % mpirun -np 2 ./hello_world
HELLO_MPI - Master process:
FORTRAN90/MPI version
An MPI test program.
The number of processes is 2
Process 0 says "Hello, world!" CRUMPET.local
Process 1 says "Hello, world!" CRUMPET.local
 

fischer

CSEG and Liaisons
Staff member
Hi Paul,

I've been trying to get the f19_f19_mg17 F2000climo case to run on our systems with just 32 tasks, and haven't had any luck. I was able to run a simple case ne5_ne5_mg37 FHS94 on 32 tasks.

Could you run ./preview_run in your case directory, and attach your env_mach_pes.xml. It looks like you're trying to run on more tasks than what is available.

I haven't had a chance to try ESMF 8.8 yet, but I don't think that's your issue.

Chris
 

phansel

Paul Hansel
New Member
Hi Chris,

Thanks for the follow-up!

preview_run:
phansel@CRUMPET f2000_control % ./preview_run
CASE INFO:
nodes: 2
total tasks: 32
tasks per node: 16
thread count: 1
ngpus per node: 0

BATCH INFO:
FOR JOB: case.run
ENV:
Setting Environment LAPACK_LIBDIR=/opt/homebrew/opt/openblas/lib
Setting Environment MPI_TYPE_DEPTH=16
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_STACKSIZE=256M

SUBMIT CMD:
None

MPIRUN (job=case.run):
mpirun /Users/phansel/Public/CESM_Data/outputdata/f2000_control/bld/cesm.exe >> cesm.log.$LID
2>&1

FOR JOB: case.st_archive
ENV:
Setting Environment LAPACK_LIBDIR=/opt/homebrew/opt/openblas/lib
Setting Environment MPI_TYPE_DEPTH=16
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_STACKSIZE=256M

SUBMIT CMD:
None
The "total tasks" parameter is surprising. Everything I've configured in config_machines.xml says 16 cores / threads. I wonder where 32 comes from?

./pelayout:
phansel@CRUMPET f2000_control % ./pelayout
Comp NTASKS NTHRDS ROOTPE PSTRIDE
CPL : 32/ 1; 0 1
ATM : 32/ 1; 0 1
LND : 32/ 1; 0 1
ICE : 32/ 1; 0 1
OCN : 32/ 1; 0 1
ROF : 32/ 1; 0 1
GLC : 32/ 1; 0 1
WAV : 32/ 1; 0 1
ESP : 1/ 1; 0 1
ESMF_AWARE_THREADING is False
ROOTPE is with respect to 16.0 tasks per node
Again, 32 tasks set everywhere. I'll note as mentioned in post #2 that changing NTASKS_xyz to 1 didn't help:
$ for a in CPL ATM LND ICE OCN ROF GLC WAV ESP ; do ./xmlchange NTASKS_$a=1 ; done
But I didn't notice how that didn't take effect; I hadn't reset the case run. I missed the warnings indicating as much!

After resetting the setup and running the next line, I get the env_mach_pes.xml as attached (with e.g. <value compclass="ATM">16</value>)
phansel@CRUMPET f2000_control % for a in CPL ATM LND ICE OCN ROF GLC WAV ESP ; do ./xmlchange NTASKS_$a=16 ; done

phansel@CRUMPET f2000_control % ./preview_run
CASE INFO:
nodes: 1
total tasks: 16
tasks per node: 16
thread count: 1
ngpus per node: 0

BATCH INFO:
FOR JOB: case.run
ENV:
Setting Environment LAPACK_LIBDIR=/opt/homebrew/opt/openblas/lib
Setting Environment MPI_TYPE_DEPTH=16
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_STACKSIZE=256M

SUBMIT CMD:
None

MPIRUN (job=case.run):
mpirun /Users/phansel/Public/CESM_Data/outputdata/f2000_control/bld/cesm.exe >> cesm.log.$LID 2>&1

FOR JOB: case.st_archive
ENV:
Setting Environment LAPACK_LIBDIR=/opt/homebrew/opt/openblas/lib
Setting Environment MPI_TYPE_DEPTH=16
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_STACKSIZE=256M

SUBMIT CMD:
None

I can now build and run it, receiving a slightly different error in PET00.ESMF_LogFile: ESMF wasn't built with the PIO library enabled.
phansel@CRUMPET run % cat PET00.ESMF_LogFile
20250113 154113.128 ERROR PET00 /Users/phansel/Public/CESM_Data/esmf/src/Infrastructure/Mesh
/src/ESMCI_Mesh_FileIO.C:298 ESMCI_mesh_create_from_file() Library needed by ESMF not present - This f
unctionality requires ESMF to be built with the PIO library enabled.
20250113 154113.131 ERROR PET00 ESMCI_MeshCap.C:2605 MeshCap::meshcreatefromfilenew() Librar
y needed by ESMF not present - Internal subroutine call returned Error
20250113 154113.131 ERROR PET00 ESMF_Mesh.F90:1970 ESMF_MeshCreateFromFile() Library needed
by ESMF not present - Internal subroutine call returned Error
20250113 154113.131 ERROR PET00 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/c
ice/src/cicecore/drivers/nuopc/cmeps/ice_comp_nuopc.F90:734 Library needed by ESMF not present - Passi
ng error in return code
20250113 154113.131 ERROR PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:2918 Library ne
eded by ESMF not present - Phase 'IPDv01p1' Initialize for modelComp 4: ICE did not return ESMF_SUCCES
S
20250113 154113.131 ERROR PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:1340 Library ne
eded by ESMF not present - Passing error in return code
20250113 154113.131 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:2918 Library n
eeded by ESMF not present - Phase 'IPDv02p1' Initialize for modelComp 1: ESM0001 did not return ESMF_S
UCCESS
20250113 154113.131 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:1345 Library n
eeded by ESMF not present - Passing error in return code
20250113 154113.131 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:486 Library ne
eded by ESMF not present - Passing error in return code
20250113 154113.131 ERROR PET00 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/c
meps/cime_config/../cesm/driver/esmApp.F90:134 Library needed by ESMF not present - Passing error in r
eturn code
20250113 154113.131 INFO PET00 Finalizing ESMF with endflag==ESMF_END_ABORT
I'll rebuild ESMF with PIO and report back.

Thanks for the help!
 

phansel

Paul Hansel
New Member
Found something that seems to be a platform-specific issue with ESMF. I'll file a bug report.
esmf % make -j16 lib
-- Found NetCDF_C: /libnetcdf.so
-- Checking NetCDF version
-- Checking NetCDF version - 4.9.2./*!<
-- Checking whether NetCDF has parallel support
-- Checking whether NetCDF has parallel support - no
-- Looking for nc_set_log_level
-- Looking for nc_set_log_level - not found
-- Checking whether NetCDF has PnetCDF support
-- Checking whether NetCDF has PnetCDF support - no
-- Checking whether NetCDF has DAP support
-- Checking whether NetCDF has DAP support - yes
-- Found CURL: /Library/Developer/CommandLineTools/SDKs/MacOSX15.2.sdk/usr/lib/libcurl.tbd (found versi
on "8.7.1")
-- Found HDF5_HL: /opt/homebrew/lib/libhdf5_hl.dylib
-- Found HDF5_C: /opt/homebrew/lib/libhdf5.dylib
Fortran Library build is OFF
-- Check size of size_t
-- Check size of size_t - failed
-- Check size of long long
-- Check size of long long - failed
CMake Error at src/clib/CMakeLists.txt:180 (message):
size_t and long long must be the same size!


-- Configuring incomplete, errors occurred!
make[7]: ./Makefile: No such file or directory
make[7]: *** No rule to make target `./Makefile'. Stop.
make[7]: ./Makefile: No such file or directory
make[7]: *** No rule to make target `./Makefile'. Stop.
cp: ../Install/include/*: No such file or directory
cp: ../Install/lib/*: No such file or directory
make[6]: *** [tree_lib] Error 1
make[5]: *** [tree] Error 1
make[4]: *** [tree] Error 1
make[3]: *** [tree] Error 1
make[2]: *** [tree] Error 1
make[1]: *** [build_libs] Error 2
make: *** [lib] Error 2

Seems to be related to this: prepare to update to pio2 by jedwards4b · Pull Request #32 · esmf-org/esmf
 

phansel

Paul Hansel
New Member
Appears I set the path for the NetCDF includes wrong. Not a real platform-specific error. This worked to build ESMF:
phansel@CRUMPET esmf % export ESMF_NETCDF=nc-config
ESMF builds and installs successfully, but testing again after re-building the CESM case reveals the same PIO complaint. I'll re-build with ESMF_PNETCDF=pnetcdf-config and give it another shot.
 

phansel

Paul Hansel
New Member
That did work. I ended up having to build ESMF with the following env vars (and after running "brew install pnetcdf" & re-sourcing .zprofile):
ESMFMKFILE=/Users/phansel/Public/CESM_Data/esmf/lib/li
bO/Darwin.gfortranclang.64.openmpi.default/esmf.mk
ESMF_DIR=/Users/phansel/Public/CESM_Data/esmf
ESMF_PIO=internal
ESMF_COMM=openmpi
ESMF_NETCDF_INCLUDE=/opt/homebrew/Cellar/netcdf/4.9.2_
2/include/
ESMF_NETCDF_LIBPATH=/opt/homebrew/Cellar/netcdf/4.9.2_
2/lib
ESMF_NETCDF_LIBS=-lnetcdf
ESMF_NETCDF=nc-config
ESMF_PNETCDF=pnetcdf-config
I also had to run "./case.setup --reset" and clean+build the case again. ATM build is quite slow at ~41 seconds for this example.

Total runtime for 1 day of the CAM f2000 control exercise was 152 seconds real, 2600 seconds user, around 70 watts average power.
 
Top