Dear
@jedwards
I have some good news about the ECT situation on beluga (and narval):
In the beluga setup there were some issues, mostly related to
Open MPI vs
Intel MPI. On the one hand I had this:
Code:
<COMPILERS>intel</COMPILERS>
<!-- MPILIBS: mpilibs supported on this machine, comma seperated list,
first is default, mpi-serial is assumed and not required in this list-->
<!-- NFT: intelmpi produces errors, so openmpi is recommended. -->
<MPILIBS>openmpi,intelmpi</MPILIBS>
which means it'll use openmpi, but then the openmpi settings are commented out:
Code:
<!-- <modules mpilib="openmpi" DEBUG="FALSE">
<command name="unload">intelmpi/2018.3</command>
<command name="load">intel/2018.3</command>
<command name="load">openmpi/3.1.2</command>
<command name="load">hdf5-mpi/1.10.3</command>
<command name="load">netcdf-mpi/4.6.1</command>
<command name="load">netcdf-fortran-mpi/4.5.1</command>
</modules> -->
This is why the setup couldn't find netCDF. It is now changed this to:
<MPILIBS>intelmpi</MPILIBS>
so it'll run Intel MPI.
The Python file is modified now: by replacing "/" (division) by "//" (integer division) in ensemble.py so that it is compatible with Python 3, I don not need to load python/2.7.18 anymore.
The modules that are necessary for this configuration on my beluga system are:
module load StdEnv/2018.3 perl/5.22.4 python/3.7.4 cmake/3.16.3 intelmpi/2018.3.222 hdf5-mpi/1.10.3 netcdf-mpi/4.4.1.1 netcdf-fortran-mpi/4.4.4
Then the ECT for the 4 ensembles
(python ensemble.py --case /home/meisam/scratch/cases/ensemble.cesm_tag.000 --mach beluga --ensemble 4 --ect cam --project P99999999 --walltime 12:00:00) did run and completed successfully which is very good news.
on Narval, the same Python issue occurs as on Beluga, so by doing the same // change to ensemble.py it fixed it too. Then there was a problem that the walltime isn't propagated so it uses a default walltime of 1 hour:
This could be adjusted by editing
/home/meisam/my_cesm_sandbox/cime/config/cesm/machines/config_batch.xml
and adding
Code:
<submit_args>
<arg flag="--time" name="$JOB_WALLCLOCK_TIME"/>
</submit_args>
Then the job successfully completed on narval system too.
+++++++++++++++++++++++++++++++++++++++++++++++++
However
+++++++++++++++++++++++++++++++++++++++++++++++++
There are some issues with the instructions regarding the ECT in your website.
For example, in this link (
6. Porting and validating CIME on a new platform — CIME master documentation)
it is mentioned that you need to go through the README instructions. This is what the
README says:
"Once all ensemble simulations have run successfully, copy every cam history file (*.cam.h0.*) for CAM-ECT and UF-CAM-ECT) or monthly pop history file (*.pop.h.*) for POP-ECT from each ensemble run directory into a separate directory. Next create the ensemble summary using the pyCECT tool pyEnsSum.py (for CAM-ECT and UF-CAM-ECT) or pyEnsSumPop.py (for POP-ECT). For details see README_pyEnsSum.rst and README_pyEnsSumPop.rst with the pyCECT tools."
Creating test runs: (1) Once an ensemble summary file has been created or chosen to use from $CESMDATAROOT/inputdata/validation, the simulation run(s) to be verified by ECT must be created via script ensemble.py. NOTE: It is important that the **same** resolution and compset be used in the individual runs as in the ensemble. The NetCDF ensemble summary file global attributes give this information. (2) For example, for CAM-ECT: python ensemble.py --case /glade/scratch/cesm_user/cesm_tag/camcase.cesm_tag.000 --ect cam --mach cheyenne --project P99999999 --compset F2000climo --res f19_f19_mg17 (3) Next verify the new simulation(s) with the pyCECT tool pyCECT.py (see README_pyCECT.rst with the pyCECT tools)
But in this link (
Python Tools | Community Earth System Model) it says to do "
./addmetadata.sh" after running the 3 runs, which contradicts the README file instructions. The 3 runs that this link states do not have ensemble numbers in them.
Also, in these 3 runs there is a typo: "./addmetadata.sh --caseroot /glade/scratch/cesm_user/cesm_tag/case.
esm_tag.uf.000 --histfile /glade/scratch/cesm_user/cesm_tag/case.cesm_tag.uf.000/run/case.cesm_tag.uf.000.cam.h0.0001-01-01-00000.nc"
which I am sure instead of that
esm it should be
cesm.
++++++++++++++++++++++++++++++++
Then
++++++++++++++++++++++++++++++++
When I uploaded the three runs of ECT to the verification page it said "Verification complete-success: These runs PASSED according to our testing criterion", but there is some other vague information about some runs that failed which I would appreciate it if you check the attached
JPG and let me know if it is critical fail or not.
If I am right, the ECT process and its verification is done and my porting and configuration work well (?)
Thanks for your time,
P.S. I should mention that Bart Oldeman from The Alliance Support Team helped a lot regarding these issues.