Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Problem in running CAM5.3

Dear CESM Users, I am trying to run cesm1.2.0 (CAM5.3) as CAM standalone on Linux machine with PGI comlier and parallely compiled NeTCDF4.2 libraries, I used the following commands lines to submit the run:Commands to configue:/home/opt/app/cesm1_2_0/models/atm/cam/bld/configure -fc_type pgi -fc mpif90 -cc mpicc -dyn fv -hgrid 1.9x2.5 -ntasks 24 -nosmp  -testcommand to build the model:gmakeCommand to build the namelist:/home/opt/app/cesm1_2_0/models/atm/cam/bld/build-namelist -test -config /home/dilip/New_PerformanceTests/Test1_24/bld/config_cache.xmlCommandlines To submit the run:#!/bin/sh #$ -pe mpi 24#$ -cwd #$ -j y #$ -S /bin/bash # /opt/pgi/linux86-64/2013/mpi2/mpich/bin/mpirun -np 24 /home/dilip/New_PerformanceTests/Test1_24/bld/camTh Details about the machine specific i am using to run the model are as followings:No. of nodes -1 master node 9 compute nodes and 12 processor per node, 24 GB RAM P/N. where Master Node -Fujitsu Primergy RX 300S7, Intel Xeon ES260@ 2GHz, 24GB RAM, 8TB HDD and compute node(0-8)- Fujitsu Primergy RX 200S7, Intel Xeon ES-2620 @ 2GHz,24GB RAM,   500GB HDD. it is Rock Cluster 6 with Sun Grid Engine job scheduler (SGE 6) and Compiler- PGI. It is a Operating System Linux (CentOS 6.2).
I got the run terminated with following error message:/home/opt/inputdata/atm/cam/solar/solar_ave_sc19-sc23.c090810.nc solar_data_readnl: solar_data_type = SERIAL solar_data_readnl: solar_data_ymd  =             0 solar_data_readnl: solar_data_tod  =             0PGFIO/stdio: Input/output errorPGFIO-F-/OPEN/unit=99/error code returned by host stdio - 5. File name = atm_in In source file /home/opt/app/cesm1_2_0/models/atm/cam/src/chemistry/utils/solar_data.F90, at line number 94Fatal error in PMPI_Bcast: Other MPI error, error stack:PMPI_Bcast(1478)......................: MPI_Bcast(buf=0x24fe2e0, count=1, MPI_LOGICAL, root=0, comm=0xc4000002) failedMPIR_Bcast_impl(1321).................:MPIR_Bcast_intra(1119)................:MPIR_Bcast_scatter_ring_allgather(962):MPIR_Bcast_binomial(213)..............: Failure during collectiveMPIR_Bcast_scatter_ring_allgather(955):MPIR_Bcast_binomial(189)..............:MPIC_Send(63).........................:MPIDI_EagerContigShortSend(262).......: failure occurred while attempting to send an eager messageMPIDI_CH3_iStartMsg(36)...............: Communication error with rank 12---------------------- I also tried the to create the similler kind of case with 18 processors but the run got terminated with similler kind of error,I am attaching the the log files for both cases here, The number in file name indicate the numbers of processors,Thanking you anticipations,Please help me out in running a successful CAM run.
 

santos

Member
It looks to me like these are I/O errors from the system, not CESM. For some reason, it suddenly can't read the CAM namelist halfway through initialization. You should ask whoever manages your machine about this.
 

santos

Member
It looks to me like these are I/O errors from the system, not CESM. For some reason, it suddenly can't read the CAM namelist halfway through initialization. You should ask whoever manages your machine about this.
 

santos

Member
It looks to me like these are I/O errors from the system, not CESM. For some reason, it suddenly can't read the CAM namelist halfway through initialization. You should ask whoever manages your machine about this.
 

eaton

CSEG and Liaisons
I agree with Sean's comment.  In addition the log output from the 18 task run shows that it executed past the I/O failure that stopped the 24 task run.  This inconsistent result due to running with different numbers of tasks suggests a problem in the mpi installation.
 

eaton

CSEG and Liaisons
I agree with Sean's comment.  In addition the log output from the 18 task run shows that it executed past the I/O failure that stopped the 24 task run.  This inconsistent result due to running with different numbers of tasks suggests a problem in the mpi installation.
 

eaton

CSEG and Liaisons
I agree with Sean's comment.  In addition the log output from the 18 task run shows that it executed past the I/O failure that stopped the 24 task run.  This inconsistent result due to running with different numbers of tasks suggests a problem in the mpi installation.
 
Dear Eaton & Sean,Thanks for your confermative replies for my first guess about the issue with those run failiures,We are using the mpi installed with PGI(13.7) installation, So finally we are planning to re-install the compiler (PGI), So that it pass the requirment (Flag etc.) if we missed something in earlier installation. Beacuse, A few days ago we tried to make CAM5.3 test run for 1.9x2.5 resoltuion for one day with 24(npr_yz = 12,2,2,12) processors, it took 2.716 Hrs for one day run and with npr_yz = 24,1,1,24 it took 2.638 hrs, Which unexpetedly large.So, finally i would like to take your important suggestions on Re-installation of PGI and mpi if we required to specify anything specific condition (Flags etc) during the installation.Or anything we can try apart from Re-installation.Thanking you in anticipation.Regards,Ram
 
Dear Eaton & Sean,Thanks for your confermative replies for my first guess about the issue with those run failiures,We are using the mpi installed with PGI(13.7) installation, So finally we are planning to re-install the compiler (PGI), So that it pass the requirment (Flag etc.) if we missed something in earlier installation. Beacuse, A few days ago we tried to make CAM5.3 test run for 1.9x2.5 resoltuion for one day with 24(npr_yz = 12,2,2,12) processors, it took 2.716 Hrs for one day run and with npr_yz = 24,1,1,24 it took 2.638 hrs, Which unexpetedly large.So, finally i would like to take your important suggestions on Re-installation of PGI and mpi if we required to specify anything specific condition (Flags etc) during the installation.Or anything we can try apart from Re-installation.Thanking you in anticipation.Regards,Ram
 
Dear Eaton & Sean,Thanks for your confermative replies for my first guess about the issue with those run failiures,We are using the mpi installed with PGI(13.7) installation, So finally we are planning to re-install the compiler (PGI), So that it pass the requirment (Flag etc.) if we missed something in earlier installation. Beacuse, A few days ago we tried to make CAM5.3 test run for 1.9x2.5 resoltuion for one day with 24(npr_yz = 12,2,2,12) processors, it took 2.716 Hrs for one day run and with npr_yz = 24,1,1,24 it took 2.638 hrs, Which unexpetedly large.So, finally i would like to take your important suggestions on Re-installation of PGI and mpi if we required to specify anything specific condition (Flags etc) during the installation.Or anything we can try apart from Re-installation.Thanking you in anticipation.Regards,Ram
 

eaton

CSEG and Liaisons
I don't have any expertise in installing compilers or mpi.  But just to give you some idea of the performance you should expect, I ran a 1 day test of cam5 at 1.9x2.5 and found that a serial run took 0.63 hrs, and a run using 16 tasks (16,1,1,16) took 0.07 hrs.  That is from a 16 core node with the following processors: Intel(R) Xeon(R) CPU X5672  @ 3.20GHzA good way to test your mpi installation is to start with a serial run.  Next build with mpi and do a 1 task run which should take about the same time as the serial run and should produce identical results.  Then do a 2 task run, then 4 tasks, etc.  Each run should produce bit for bit identical results to the serial run (make sure the optimization level is the same in all builds).  You should see good scaling a low task counts.  For example, a 2 task run on our local linux cluster for the same configuration as above took  0.35 hrs which is a parallel efficiency of about 90% relative to the serial run.
 

eaton

CSEG and Liaisons
I don't have any expertise in installing compilers or mpi.  But just to give you some idea of the performance you should expect, I ran a 1 day test of cam5 at 1.9x2.5 and found that a serial run took 0.63 hrs, and a run using 16 tasks (16,1,1,16) took 0.07 hrs.  That is from a 16 core node with the following processors: Intel(R) Xeon(R) CPU X5672  @ 3.20GHzA good way to test your mpi installation is to start with a serial run.  Next build with mpi and do a 1 task run which should take about the same time as the serial run and should produce identical results.  Then do a 2 task run, then 4 tasks, etc.  Each run should produce bit for bit identical results to the serial run (make sure the optimization level is the same in all builds).  You should see good scaling a low task counts.  For example, a 2 task run on our local linux cluster for the same configuration as above took  0.35 hrs which is a parallel efficiency of about 90% relative to the serial run.
 

eaton

CSEG and Liaisons
I don't have any expertise in installing compilers or mpi.  But just to give you some idea of the performance you should expect, I ran a 1 day test of cam5 at 1.9x2.5 and found that a serial run took 0.63 hrs, and a run using 16 tasks (16,1,1,16) took 0.07 hrs.  That is from a 16 core node with the following processors: Intel(R) Xeon(R) CPU X5672  @ 3.20GHzA good way to test your mpi installation is to start with a serial run.  Next build with mpi and do a 1 task run which should take about the same time as the serial run and should produce identical results.  Then do a 2 task run, then 4 tasks, etc.  Each run should produce bit for bit identical results to the serial run (make sure the optimization level is the same in all builds).  You should see good scaling a low task counts.  For example, a 2 task run on our local linux cluster for the same configuration as above took  0.35 hrs which is a parallel efficiency of about 90% relative to the serial run.
 
Top