Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Trouble running on Cori

hannah6

Walter Michael Hannah
New Member
When I try to run a CESM case on Cori (NERSC) I get this run-time message:

srun: error: Invalid numeric value "2.0" for --cpus-per-task.

I'm trying to run SPCAM, but I doubt that's to blame here.
I can't figure out where this value of "2.0" is being set. Any suggestions are appreciated.
 

jedwards

CSEG and Liaisons
Staff member
This is a python2 vs python3 issue. It should be solved in the latest code available,
for cesm2.1 use cesm2.1-alphabranch to do this go to the top level of your source
and run
git checkout cesm2.1-alphabranch
./manage_externals/checkout_externals
 

hannah6

Walter Michael Hannah
New Member
Thanks Jim, I tried to checkout that branch but I'm getting an error from checkout_externals:

ERROR:root:Failed with output:
svn: E195012: Path '.' does not share common version control ancestry with the requested switch location. Use --ignore-ancestry to disable this check.
svn: E195012: 'https://svn-ccsm-models.cgd.ucar.edu/tools/proc_atm/chem_proc/release_tags/chem_proc5_0_03_rel' shares no common ancestry with '/global/u1/w/whannah/CESM/CESM_SRC1/components/cam/chem_proc'

Do I need to reset the submodules or something like that?
 

hannah6

Walter Michael Hannah
New Member
Looks like that branch is broken/outdated as well:

ERROR: module command /opt/modules/default/bin/modulecmd python load cray-mpich/7.7.8 cray-netcdf-hdf5parallel/4.6.3.0 cray-hdf5-parallel/1.10.5.0 cray-parallel-netcdf/1.11.1.0 cmake/3.14.4 failed with message:
cray-mpich(3):ERROR:105: Unable to locate a modulefile for 'cray-mpich/7.7.8'
 

jedwards

CSEG and Liaisons
Staff member
I've updated modules for cori in this cime branch: jedwards/maint-5.6/nersc_module_updates
to get it go to the cime subdirectory and do
git pull origin
git checkout jedwards/maint-5.6/nersc_module_updates

I did not do extensive testing with these changes but just made sure that the model builds.
 

hannah6

Walter Michael Hannah
New Member
Jim, your changes fixed the modules, but it didn't resolve the python issue. I'll dig into how we fixed this in E3SM and see if I can find the fix.
 

jedwards

CSEG and Liaisons
Staff member
./preview_run should show you the srun command in your submitted job. When I look at mine I see an integer 2:

MPIRUN (job=case.test):
srun --label -n 128 -c 2 /global/cscratch1/sd/jedwards/SMS.f19_g16.X.cori-haswell_intel.20210111_141136_aa4agt/bld/cesm.exe >> cesm.log.$LID 2>&1
 

jedwards

CSEG and Liaisons
Staff member
also checked cori-knl, works correctly there too:
MPIRUN (job=case.test):
srun --label -n 64 -c 4 --cpu_bind=cores /global/cscratch1/sd/jedwards/SMS.f19_g16.X.cori-knl_intel.20210111_142513_zqr5hu/bld/cesm.exe >> cesm.log.$LID 2>&1
 

hannah6

Walter Michael Hannah
New Member
Mine shows 2.0 (see below), could it be specific to SPCAM?

Code:
CASE INFO:
  nodes: 32
  total tasks: 1024
  tasks per node: 32
  thread count: 1

BATCH INFO:
  FOR JOB: case.run
    ENV:
      module command is /opt/modules/default/bin/modulecmd python rm PrgEnv-intel PrgEnv-cray PrgEnv-gnu intel cce cray-parallel-netcdf cray-parallel-hdf5 pmi cray-libsci cray-mpich2 cray-mpich cray-netcdf cray-hdf5 cray-netcdf-hdf5parallel craype-sandybridge craype-ivybridge craype
      module command is /opt/modules/default/bin/modulecmd python load PrgEnv-intel
      module command is /opt/modules/default/bin/modulecmd python switch intel intel/19.1.3.304
      module command is /opt/modules/default/bin/modulecmd python use /global/project/projectdirs/ccsm1/modulefiles/cori
      module command is /opt/modules/default/bin/modulecmd python load esmf/7.1.0r-defio-intel18.0.1.163-mpi-O-cori-haswell cray-memkind
      module command is /opt/modules/default/bin/modulecmd python swap craype craype/2.7.2
      module command is /opt/modules/default/bin/modulecmd python switch cray-libsci/20.09.1
      module command is /opt/modules/default/bin/modulecmd python load cray-mpich/7.7.16 cray-netcdf-hdf5parallel/4.7.4.0 cray-hdf5-parallel/1.12.0.0 cray-parallel-netcdf/1.12.1.0 cmake/3.18.2
      Setting Environment OMP_STACKSIZE=256M
      Setting Environment OMP_PROC_BIND=spread
      Setting Environment OMP_PLACES=threads
      Setting Environment OMP_NUM_THREADS=1

    SUBMIT CMD:
      sbatch --time 0:30:00 -q debug --account m3312 --mail-user hannah6@llnl.gov --mail-type end --mail-type fail .case.run --resubmit

    MPIRUN (job=case.run):
      srun  --label  -n 1024  -c 2.0 /global/cscratch1/sd/whannah/cesm_scratch/CESM.f09_f09_mg17.FSPCAMS/bld/cesm.exe  >> cesm.log.$LID 2>&1

  FOR JOB: case.st_archive
    ENV:
      module command is /opt/modules/default/bin/modulecmd python rm PrgEnv-intel PrgEnv-cray PrgEnv-gnu intel cce cray-parallel-netcdf cray-parallel-hdf5 pmi cray-libsci cray-mpich2 cray-mpich cray-netcdf cray-hdf5 cray-netcdf-hdf5parallel craype-sandybridge craype-ivybridge craype
      module command is /opt/modules/default/bin/modulecmd python load PrgEnv-intel
      module command is /opt/modules/default/bin/modulecmd python switch intel intel/19.1.3.304
      module command is /opt/modules/default/bin/modulecmd python use /global/project/projectdirs/ccsm1/modulefiles/cori
      module command is /opt/modules/default/bin/modulecmd python load esmf/7.1.0r-defio-intel18.0.1.163-mpi-O-cori-haswell cray-memkind
      module command is /opt/modules/default/bin/modulecmd python swap craype craype/2.7.2
      module command is /opt/modules/default/bin/modulecmd python switch cray-libsci/20.09.1
      module command is /opt/modules/default/bin/modulecmd python load cray-mpich/7.7.16 cray-netcdf-hdf5parallel/4.7.4.0 cray-hdf5-parallel/1.12.0.0 cray-parallel-netcdf/1.12.1.0 cmake/3.18.2
      Setting Environment OMP_STACKSIZE=256M
      Setting Environment OMP_PROC_BIND=spread
      Setting Environment OMP_PLACES=threads
      Setting Environment OMP_NUM_THREADS=1

    SUBMIT CMD:
      sbatch --time 0:30:00 -q debug --account m3312  --dependency=afterok:0 --mail-user hannah6@llnl.gov --mail-type end --mail-type fail case.st_archive --resubmit
 

jedwards

CSEG and Liaisons
Staff member
If you want to change permissions for your source code on cori I'll try to spot the difference there - are you sure you have the cime branch that I provided?
 

hannah6

Walter Michael Hannah
New Member
I don't think I can open up my home directory on Cori. I've tried it before without success.
If I go into the source code I'm on the right branch:

Code:
> git branch
* cesm2.1-alphabranch

Similarly if I jump into the cime directory I'm on your branch there:

Code:
> cd cime
> git branch
* jedwards/maint-5.6/nersc_module_updates
  master

So I'm pretty confused...
I could try a more typical configuration. What's a typical F compset that people use?
 

hannah6

Walter Michael Hannah
New Member
I still see "-c 2.0" for that case, so you're right that it doesn't have anything to do with the compset.
 

jedwards

CSEG and Liaisons
Staff member
So this problem is related to using python 3. I made a bad assumption that I was using python 3 on cori, but I was using
python 2. If you want to fix this quickly unload the python module. I'll push a fix to the cime branch shortly.
 

jedwards

CSEG and Liaisons
Staff member
Should be fixed on the cime branch now, to get the update go to the cime directory
git pull origin jedwards/maint-5.6/nersc_module_updates
 

hannah6

Walter Michael Hannah
New Member
ok, that seems to have worked for an X compset. I didn't get any log files though, is that supposed to happen?
 
Top