Porting issues for CESM3 pre-release (cesm3_0_beta04) on aarch64 macOS 15.2

phansel

Paul Hansel
New Member
As an evaluation for porting CESM to macOS 15.2 on Apple Silicon (arm64 Darwin), I am following the CAM f2000 control exercise under the CESM tutorial. 1: Control case: F2000climo — CESM Tutorial. This is on a single node with 16 performance cores @ 3.7 GHz.

I've installed netcdf, hdf5, openmpi, python3, gfortran, LAPACK, openBLAS, etc. via homebrew. Versions and formulas are in brewlist.txt.

ESMF is built and installed with version: v8.8.0b00-240-g7228e87d3c. Python is version 3.13.1.

My CESM directory tree is as follows:
phansel@CRUMPET CESM_Data % pwd
/Users/phansel/Public/CESM_Data
phansel@CRUMPET CESM_Data % tree -L 1
.
├── esmf
├── f2000_control
├── hosts
├── inputdata
├── my_cesm_sandbox
└── outputdata

5 directories, 1 file

What version of the code are you using?
git-describe:
cesm3_0_beta04
./bin/git-fleximod status: (prompt text in the forum template still suggests checkout_externals!)
Result in git-fleximod-status.txt.
I selected this branch as none of the CESM2 branches I selected supported python3.

Have you made any changes to files in the source tree?
- I have created a file ~/.cime/config_machines.xml that passes XML validation. It is attached below.
- I have modified env_mach_specific.py under CIME to comment out resource.setrlimit() since modifying RLIMIT_STACK is evidently not supported on macOS (see python3 resource.setrlimit strange behaviour under macOS · Issue #78783 · python/cpython and other issues).
diff --git a/CIME/XML/env_mach_specific.py b/CIME/XML/env_mach_specific.py
- resource.setrlimit(attr, limits)
+ #resource.setrlimit(attr, limits)
- I have modified CIME/Tools/Makefile to force the inclusion of LAPACK. Without this, case.build returns a number of errors related to LAPACK FORTRAN functions. I'm aware that this is not the right place to link LAPACK, but it's the first place I found that worked since ~/.cime/config_compilers.xml is no longer applicable.
diff --git a/CIME/Tools/Makefile b/CIME/Tools/Makefile
-FoX_LIBS := -L$(SHAREDLIBROOT)/$(SHAREDPATH)/CDEPS/fox/lib -lFoX_dom -lFoX_sax -lFoX_utils -lFoX_fsys -lFoX_wxml -lFoX_common -lFoX_fsys
+FoX_LIBS := -L$(SHAREDLIBROOT)/$(SHAREDPATH)/CDEPS/fox/lib -lFoX_dom -lFoX_sax -lFoX_utils -lFoX_fsys -lFoX_wxml -lFoX_common -lFoX_fsys -llapack
Errors that appear without including lapack:
Undefined symbols for architecture arm64:
"_dgbsv_", referenced from:
___lapack_interfaces_MOD_dgbsv_wrap in libatm.a[272](lapack_interfaces.o)
[continued]
"_strmv_", referenced from:
___lapack_interfaces_MOD_strmv_wrap in libatm.a[272](lapack_interfaces.o)
ld: symbol(s) not found for architecture arm64
collect2: error: ld returned 1 exit status
gmake: *** [/Users/phansel/Public/CESM_Data/example3/Tools/Makefile:935: ../../cesm.exe] Error 1

Describe every step you took leading up to the problem:
- Checkout CESM at the specified branch using git-fleximod
- Make modifications to CIME as mentioned above
- Create .cime/config_machine.xml as attached
- Confirm that MPI processes can ssh into localhost without issue:
ssh localhost
- Set environment variables
export CIME_MACHINE=crumpet
export NETCDF_PATH=/opt/homebrew/Cellar/netcdf/4.9.2_2
export NETCDF_FORTRAN_PATH=/opt/homebrew/Cellar/netcdf-fortran/4.6.1_1
export NETCDF_C_PATH=/opt/homebrew/Cellar/netcdf/4.9.2_2
export ESMFMKFILE=/Users/phansel/Public/CESM_Data/esmf/lib/libO/Darwin.gfortranclang.64.mpiuni.default/esmf.mk
- Create new case:
cd $CIMEROOT
./scripts/create_newcase --case $CESMDATAROOT/f2000_control --compset F2000climo --res f19_f19_mg17
- Set up the case per defaults
cd $CESMDATAROOT/f2000_control
./case.setup
- Build the case
./case.build
- Download input data - this can only be done after building the case?
./check_input_data --download
- Build again (just to be sure)
./case.build
- Submit to queue (which is none)
./case.submit
The case stops running after ~1 second and exits with this error.
run command is mpirun /Users/phansel/Public/CESM_Data/outputdata/f2000_control/bld/cesm.exe >> cesm.log.$LID 2>&1
Exception from case_run: ERROR: RUN FAIL: Command 'mpirun /Users/phansel/Public/CESM_Data/outputdata/f2000_control/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /Users/phansel/Public/CESM_Data/outputdata/f2000_control/run/cesm.log.250112-173750
The log file specified has been moved, but is otherwise not informative:
phansel@CRUMPET f2000_control % cat /Users/phansel/Public/CESM_Data/outputdata/archive/f2000_control/logs/cesm.log.250112-173750
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 4
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

Process name: [prterun-CRUMPET-23234@1,17]
Exit code: 1
--------------------------------------------------------------------------
Looking further into the outputdata/f2000_control/run directory, an ESMF logfile is present:
phansel@CRUMPET f2000_control % cat ../outputdata/f2000_control/run/PET0.ESMF_LogFile
20250112 173751.639 ERROR PET0 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/cmeps/cime_config/../cesm/driver/esm.F90:950 Not valid - Invalid NTASKS value specified for component: cpl ntasks: 32 1
20250112 173751.639 ERROR PET0 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/cmeps/cime_config/../cesm/driver/esm.F90:203 Not valid - Passing error in return code
20250112 173751.639 ERROR PET0 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:797 Not valid - Passing error in return code
20250112 173751.639 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:2918 Not valid - Phase 'IPDv02p1' Initialize for modelComp 1: ESM0001 did not return ESMF_SUCCESS
20250112 173751.639 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:1345 Not valid - Passing error in return code
20250112 173751.639 ERROR PET0 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:486 Not valid - Passing error in return code
20250112 173751.639 ERROR PET0 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/cmeps/cime_config/../cesm/driver/esmApp.F90:134 Not valid - Passing error in return code
20250112 173751.639 INFO PET0 Finalizing ESMF with endflag==ESMF_END_ABORT
No other log files exist in the outputdata/f2000_climo directory.

I could not find any issues on ESMF with NTASKS mentioned: Issues · esmf-org/esmf

I looked into changing NTASKS_xyz for ATM/ICE/LND etc. to 1 (from 32) via xmlchange, but the same log file showed another error. Is this an ESMF issue?

I found this thread on a similar issue, but did not find any difference in behavior when I changed mpi-serial to mpt or the number of available PEs per node to 1.


If this is a port to a new machine: Please attach any files you added or changed for the machine port (e.g., config_compilers.xml, config_machines.xml, and config_batch.xml) and tell us the compiler version you are using on this machine.
Please attach any log files showing error messages or other useful information.

Build logs and PET0.ESMF_LogFile attached.

Describe your problem or question:
Case submission on a tutorial CAM test case fails within seconds due to an ESMF error.
 

Attachments

phansel

Paul Hansel
New Member
I've re-created the case after rebuilding ESMF with openMPI rather than the default, mpiuni. There are now 24 PETxx.ESMF_LogFile in the run/ directory, but aside from timestamps, they are all identical to the log uploaded in the previous post.
export ESMFMKFILE=/Users/phansel/Public/CESM_Data/esmf/lib/libO/Darwin.gfortranclang.64.openmpi.default/esmf.mk
There's also a different output in the CESM log:
phansel@CRUMPET run % cat /Users/phansel/Public/CESM_Data/outputdata/archive/f2000_control/logs/cesm.log.250112-181747
(t_initf) Read in prof_inparm namelist from: drv_in
(t_initf) Using profile_disable= F
(t_initf) profile_timer= 4
(t_initf) profile_depth_limit= 4
(t_initf) profile_detail_limit= 2
(t_initf) profile_barrier= F
(t_initf) profile_outpe_num= 1
(t_initf) profile_outpe_stride= 0
(t_initf) profile_single_file= F
(t_initf) profile_global_stats= T
(t_initf) profile_ovhd_measurement= F
(t_initf) profile_add_detail= F
(t_initf) profile_papi_enable= F
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 5 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
Proc: [[52404,1],5]
Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
prterun has exited due to process rank 5 with PID 0 on node CRUMPET calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------



I've already run the Porting CIME MPI example found here, without issues, although I did need to remove the ".F90" from line 1 for it to compile under gfortran gcc version 14.2.0.
phansel@CRUMPET ~ % mpif90 fhello_world_mpi.F90 -o hello_world
phansel@CRUMPET ~ % mpirun -np 2 ./hello_world
HELLO_MPI - Master process:
FORTRAN90/MPI version
An MPI test program.
The number of processes is 2
Process 0 says "Hello, world!" CRUMPET.local
Process 1 says "Hello, world!" CRUMPET.local
 

fischer

CSEG and Liaisons
Staff member
Hi Paul,

I've been trying to get the f19_f19_mg17 F2000climo case to run on our systems with just 32 tasks, and haven't had any luck. I was able to run a simple case ne5_ne5_mg37 FHS94 on 32 tasks.

Could you run ./preview_run in your case directory, and attach your env_mach_pes.xml. It looks like you're trying to run on more tasks than what is available.

I haven't had a chance to try ESMF 8.8 yet, but I don't think that's your issue.

Chris
 

phansel

Paul Hansel
New Member
Hi Chris,

Thanks for the follow-up!

preview_run:
phansel@CRUMPET f2000_control % ./preview_run
CASE INFO:
nodes: 2
total tasks: 32
tasks per node: 16
thread count: 1
ngpus per node: 0

BATCH INFO:
FOR JOB: case.run
ENV:
Setting Environment LAPACK_LIBDIR=/opt/homebrew/opt/openblas/lib
Setting Environment MPI_TYPE_DEPTH=16
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_STACKSIZE=256M

SUBMIT CMD:
None

MPIRUN (job=case.run):
mpirun /Users/phansel/Public/CESM_Data/outputdata/f2000_control/bld/cesm.exe >> cesm.log.$LID
2>&1

FOR JOB: case.st_archive
ENV:
Setting Environment LAPACK_LIBDIR=/opt/homebrew/opt/openblas/lib
Setting Environment MPI_TYPE_DEPTH=16
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_STACKSIZE=256M

SUBMIT CMD:
None
The "total tasks" parameter is surprising. Everything I've configured in config_machines.xml says 16 cores / threads. I wonder where 32 comes from?

./pelayout:
phansel@CRUMPET f2000_control % ./pelayout
Comp NTASKS NTHRDS ROOTPE PSTRIDE
CPL : 32/ 1; 0 1
ATM : 32/ 1; 0 1
LND : 32/ 1; 0 1
ICE : 32/ 1; 0 1
OCN : 32/ 1; 0 1
ROF : 32/ 1; 0 1
GLC : 32/ 1; 0 1
WAV : 32/ 1; 0 1
ESP : 1/ 1; 0 1
ESMF_AWARE_THREADING is False
ROOTPE is with respect to 16.0 tasks per node
Again, 32 tasks set everywhere. I'll note as mentioned in post #2 that changing NTASKS_xyz to 1 didn't help:
$ for a in CPL ATM LND ICE OCN ROF GLC WAV ESP ; do ./xmlchange NTASKS_$a=1 ; done
But I didn't notice how that didn't take effect; I hadn't reset the case run. I missed the warnings indicating as much!

After resetting the setup and running the next line, I get the env_mach_pes.xml as attached (with e.g. <value compclass="ATM">16</value>)
phansel@CRUMPET f2000_control % for a in CPL ATM LND ICE OCN ROF GLC WAV ESP ; do ./xmlchange NTASKS_$a=16 ; done

phansel@CRUMPET f2000_control % ./preview_run
CASE INFO:
nodes: 1
total tasks: 16
tasks per node: 16
thread count: 1
ngpus per node: 0

BATCH INFO:
FOR JOB: case.run
ENV:
Setting Environment LAPACK_LIBDIR=/opt/homebrew/opt/openblas/lib
Setting Environment MPI_TYPE_DEPTH=16
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_STACKSIZE=256M

SUBMIT CMD:
None

MPIRUN (job=case.run):
mpirun /Users/phansel/Public/CESM_Data/outputdata/f2000_control/bld/cesm.exe >> cesm.log.$LID 2>&1

FOR JOB: case.st_archive
ENV:
Setting Environment LAPACK_LIBDIR=/opt/homebrew/opt/openblas/lib
Setting Environment MPI_TYPE_DEPTH=16
Setting Environment OMP_NUM_THREADS=1
Setting Environment OMP_STACKSIZE=256M

SUBMIT CMD:
None

I can now build and run it, receiving a slightly different error in PET00.ESMF_LogFile: ESMF wasn't built with the PIO library enabled.
phansel@CRUMPET run % cat PET00.ESMF_LogFile
20250113 154113.128 ERROR PET00 /Users/phansel/Public/CESM_Data/esmf/src/Infrastructure/Mesh
/src/ESMCI_Mesh_FileIO.C:298 ESMCI_mesh_create_from_file() Library needed by ESMF not present - This f
unctionality requires ESMF to be built with the PIO library enabled.
20250113 154113.131 ERROR PET00 ESMCI_MeshCap.C:2605 MeshCap::meshcreatefromfilenew() Librar
y needed by ESMF not present - Internal subroutine call returned Error
20250113 154113.131 ERROR PET00 ESMF_Mesh.F90:1970 ESMF_MeshCreateFromFile() Library needed
by ESMF not present - Internal subroutine call returned Error
20250113 154113.131 ERROR PET00 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/c
ice/src/cicecore/drivers/nuopc/cmeps/ice_comp_nuopc.F90:734 Library needed by ESMF not present - Passi
ng error in return code
20250113 154113.131 ERROR PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:2918 Library ne
eded by ESMF not present - Phase 'IPDv01p1' Initialize for modelComp 4: ICE did not return ESMF_SUCCES
S
20250113 154113.131 ERROR PET00 ESM0001:src/addon/NUOPC/src/NUOPC_Driver.F90:1340 Library ne
eded by ESMF not present - Passing error in return code
20250113 154113.131 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:2918 Library n
eeded by ESMF not present - Phase 'IPDv02p1' Initialize for modelComp 1: ESM0001 did not return ESMF_S
UCCESS
20250113 154113.131 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:1345 Library n
eeded by ESMF not present - Passing error in return code
20250113 154113.131 ERROR PET00 ensemble:src/addon/NUOPC/src/NUOPC_Driver.F90:486 Library ne
eded by ESMF not present - Passing error in return code
20250113 154113.131 ERROR PET00 /Users/phansel/Public/CESM_Data/my_cesm_sandbox/components/c
meps/cime_config/../cesm/driver/esmApp.F90:134 Library needed by ESMF not present - Passing error in r
eturn code
20250113 154113.131 INFO PET00 Finalizing ESMF with endflag==ESMF_END_ABORT
I'll rebuild ESMF with PIO and report back.

Thanks for the help!
 

phansel

Paul Hansel
New Member
Found something that seems to be a platform-specific issue with ESMF. I'll file a bug report.
esmf % make -j16 lib
-- Found NetCDF_C: /libnetcdf.so
-- Checking NetCDF version
-- Checking NetCDF version - 4.9.2./*!<
-- Checking whether NetCDF has parallel support
-- Checking whether NetCDF has parallel support - no
-- Looking for nc_set_log_level
-- Looking for nc_set_log_level - not found
-- Checking whether NetCDF has PnetCDF support
-- Checking whether NetCDF has PnetCDF support - no
-- Checking whether NetCDF has DAP support
-- Checking whether NetCDF has DAP support - yes
-- Found CURL: /Library/Developer/CommandLineTools/SDKs/MacOSX15.2.sdk/usr/lib/libcurl.tbd (found versi
on "8.7.1")
-- Found HDF5_HL: /opt/homebrew/lib/libhdf5_hl.dylib
-- Found HDF5_C: /opt/homebrew/lib/libhdf5.dylib
Fortran Library build is OFF
-- Check size of size_t
-- Check size of size_t - failed
-- Check size of long long
-- Check size of long long - failed
CMake Error at src/clib/CMakeLists.txt:180 (message):
size_t and long long must be the same size!


-- Configuring incomplete, errors occurred!
make[7]: ./Makefile: No such file or directory
make[7]: *** No rule to make target `./Makefile'. Stop.
make[7]: ./Makefile: No such file or directory
make[7]: *** No rule to make target `./Makefile'. Stop.
cp: ../Install/include/*: No such file or directory
cp: ../Install/lib/*: No such file or directory
make[6]: *** [tree_lib] Error 1
make[5]: *** [tree] Error 1
make[4]: *** [tree] Error 1
make[3]: *** [tree] Error 1
make[2]: *** [tree] Error 1
make[1]: *** [build_libs] Error 2
make: *** [lib] Error 2

Seems to be related to this: prepare to update to pio2 by jedwards4b · Pull Request #32 · esmf-org/esmf
 

phansel

Paul Hansel
New Member
Appears I set the path for the NetCDF includes wrong. Not a real platform-specific error. This worked to build ESMF:
phansel@CRUMPET esmf % export ESMF_NETCDF=nc-config
ESMF builds and installs successfully, but testing again after re-building the CESM case reveals the same PIO complaint. I'll re-build with ESMF_PNETCDF=pnetcdf-config and give it another shot.
 

phansel

Paul Hansel
New Member
That did work. I ended up having to build ESMF with the following env vars (and after running "brew install pnetcdf" & re-sourcing .zprofile):
ESMFMKFILE=/Users/phansel/Public/CESM_Data/esmf/lib/li
bO/Darwin.gfortranclang.64.openmpi.default/esmf.mk
ESMF_DIR=/Users/phansel/Public/CESM_Data/esmf
ESMF_PIO=internal
ESMF_COMM=openmpi
ESMF_NETCDF_INCLUDE=/opt/homebrew/Cellar/netcdf/4.9.2_
2/include/
ESMF_NETCDF_LIBPATH=/opt/homebrew/Cellar/netcdf/4.9.2_
2/lib
ESMF_NETCDF_LIBS=-lnetcdf
ESMF_NETCDF=nc-config
ESMF_PNETCDF=pnetcdf-config
I also had to run "./case.setup --reset" and clean+build the case again. ATM build is quite slow at ~41 seconds for this example.

Total runtime for 1 day of the CAM f2000 control exercise was 152 seconds real, 2600 seconds user, around 70 watts average power.
 
Back
Top