Viability of running CESM on 40 cores

m_mineter@ed_ac_uk · Jan 21, 2020

We have a cluster with relatively slow interconnect between nodes, its geared to high thru-put not HPC.
We would welcome guidance: are we overoptimistic if we seek to run say 2 degree by 2 degree CESM on just one node with spec: 768Gb 40core Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz nodes with 1Tb local disk.

Is there any experience we can learn from? A best configuration to try? Others' performance in similar cases? If not 2degree then might ~4 degree be viable?
Thanks!

sacks · Jan 22, 2020

If you're talking about running fully-coupled CESM with the latest physics, then I'd normally say that 40 cores is too little, in part due to memory issues. In your case, you have a large amount of memory per node (e.g., compared with Cheyenne | Computational Information Systems Laboratory which has 64 GB memory per node), so memory limitation may not be an issue. However, I'd still expect slow simulation times: probably less than one simulated year per wallclock day. If those slow times are acceptable to you, then you could give it a shot. (For some indication of typical core count and performance, see https://csegweb.cgd.ucar.edu/timing/cgi-bin/timings.cgi . See rows starting with compset B-something, and resolution f19_something for fully-coupled 2-degree.)

If you're willing to back off to older physics versions, particularly in the atmosphere, then you could get faster run time, but you'd need to do more work to get a scientifically reasonable configuration. CESM can also be used with select components – such as ocean-only or land-only – and in simpler, more idealized configurations (CESM Models | Simpler Models). Any of those could allow you to run significantly faster, if those are suitable for the science you want to do.

Running fully coupled at resolutions coarser than 2 deg is not scientifically supported, so could be a reasonable option for something like playing around with the model, initial testing, or a university class project, but probably not for publishable science.

If you let us know a little more about how you want to use the model, we can recommend some configurations to try.

m_mineter@ed_ac_uk · Jun 8, 2020

Hi Bill/colleague
We are now actually trying this - aiming for 2deg atmospheric model only.

I'd now welcome guidance to those configurations we might first try.

We do have one snag - I think I have the full download from /manage_externals/checkout_externals at 2nd time of asking - but we have svn 1.7 available. Its this a big problem? Will that bite us when we download data and configurations?
Many thanks
MIke

sacks · Jun 8, 2020

svn 1.7 should be fine.

Regarding configurations: A standard atmosphere-only configuration would be: --compset F2000climo --res f19_f19_mg17. However, this isn't officially scientifically supported at this resolution - but I could put you in touch with others who could possibly tell you whether it would still be a reasonable configuration to use for science. Fully-coupled (with ocean & sea ice) configurations are officially scientifically supported at 2 degree resolution: for example, --compset B1850 --res f19_g17, but it sounds like you want to stick with atmosphere-only. Let me know if you'd like me to ping others to get input regarding how scientifically reasonable it would be to run F2000climo at 2 deg.

m_mineter@ed_ac_uk · Jun 8, 2020

Thanks for that wonderfully fast response, Bill. We are struggling for compute power, so will try the atmos only first - but its very good to have both configurations in mind.

So to have those views on this 2deg atmos only configuration would be very helpful.

sacks · Jun 8, 2020

@hannay can you comment on how scientifically reasonable it would be to run F2000climo - or some other atmosphere-only compset - at 2 deg resolution? It looks like it isn't officially scientifically supported, but do you expect it to give good results, or is it not well validated scientifically?

hannay · Jun 8, 2020

It is correct that the 2 degree configuration is not scientifically supported. However, I expect that it will give reasonable results. The B1850 at f19_g17 has been validated. The F2000climo used the same atmosphere and this is why I expect it will give reasonable results.
Running a F case (with prescribed SSTs/ice cover) is more forgiving than running a B case (fully coupled). For instance, even if the TOA radiation is off, it will not lead to a run away climate.

The best would be to try and to compare with a 1 degree. I could provide climos from 1 degree run for the comparison.

m_mineter@ed_ac_uk · Jun 8, 2020

@hannay Thank you Cecile. It would be good to be able to make that comparison with your 1 deg data, please, but we've a fair bit to do still to get to that point.

m_mineter@ed_ac_uk · Jul 15, 2020

I'm seeking to run --compset F2000climo --res f19_f19_mg17 on a 40-core node
I get 2 nodes allocated when I hope for 1.

I am missing something in the setup, (Or documentation that explains it, sorry - point me!)

I am using the intel mpi, 2017.
export OMP_NUM_THREADS=2 # not sure if I need to set this?

./create_newcase --case /exports/eddie/scratch/mjm/F2000_t4/case --compset F2000climo --res f19_f19_mg17 -i $DIN_LOC_ROOT --output-root /exports/eddie/scratch/mjm/F2000_t3
The config_pes is as installed.

My config_* and version are attached,.

Thank you.

m_mineter@ed_ac_uk · Jul 15, 2020

While asking for your attention.... an additional query please: I am puzzled the case.st_ar job is also on the same number of cores as the model.

sacks · Jul 15, 2020

Good question. The default setting for processor count for a 2 deg F compset is to use 2 nodes. This is set in CAM's file cime_config/config_pes.xml (note that negative numbers in this file specify a number of nodes; positive numbers instead specify a number of processors). You can change this for an individual case by doing the following after running create_newcase but before running case.setup:

./xmlchange NTASKS=40

This will result in all components using 40 processors.

If you are going to have to do this a lot, you can also change the above config_pes.xml file, either by modifying the block starting with this:

Code:

  <grid name="a%1.9x2.5">
    <mach name="any">
      <pes pesize='any' compset='any'>

Or by copying and pasting this block and adding a new one that is specific for your machine (see examples in that file, e.g., for the cheyenne machine).

I would suggest starting with number of threads = 1, unless you have reason to think that 2 threads will work better. If you do want to use 2 threads per task, then you should set NTASKS=20 instead of 40 if you want to stay on one node (unless your machine provides hyperthreading, where you can oversubscribe the cores).

Regarding the case.st_archive job: can you explain what makes you say that it is on the same number of cores as the model? (I'm not that familiar with how this is specified, but if you give more details, I may be able to help more with this.)

m_mineter@ed_ac_uk · Jul 15, 2020

Thanks Bill!
As regards the archive job: qstat shows it has 32 cores:
jobname case.st_archive
qsub -l h_rt=0:20:00 -pe sharedmem 32 -hold_jid 5770490 -v ARGS_FOR_SCRIPT=--resubmit case.st_archive

when the model itself had been submitted maybe a second earlier with:
qsub -l h_rt=47:59:00 -pe sharedmem 32 -v ARGS_FOR_SCRIPT=--resubmit .case.run

It seems to think its another model run?

sacks · Jul 15, 2020

Good question. I suspect that something needs to be changed in your config_batch.xml file, but I'm not sure what it needs to be changed to. I'll reach out to @jedwards for help.

jedwards · Jul 15, 2020

The st_archive job is a separate run and the number of tasks used for that run is defined in config/cesm/machines/config_workflow.xml
This line in config_batch.xml <arg flag="-pe sharedmem" name=" $TOTALPES"/>
is overriding that value and making all jobs use TOTALPES from the case.
Try changing this line to be <arg flag="-pe sharedmem" name=" {{ total_tasks }}"/>

m_mineter@ed_ac_uk · Jul 21, 2020

I am now able to submit jobs with 20 pes (I decided that was our target) and am using "ulimit -s unlimited in my .bashrc, and asking more memory for each process (8000M for each of the 20) - it is a 2TB memory node..

I was running F200climo, and have made no changes to the namelists.

cesm build ok I think,

but when I run, the job hangs after writing into cesm.log a large number of
NetCDF: Variable not found
NetCDF: Variable not found

and
NetCDF: Invalid dimension ID or name
NetCDF: Invalid dimension ID or name
and some
WARNING: Rearr optional argument is a pio2 feature, ignored in pio1

I attach a tar ball with the run log and namelist files, a directory listing; the case directory's xml, README.case, and CaseStatus.
eddie-version.txt gives the information from describe_version.

--

I did try running the script regression test, wanting to be sure the netcdf libraries etc were working ok - but that submits a lot of jobs, and then the script closes and the files needed by the queued jobs got deleted.

I added --wait to the create Test code, and after some successes then the tests hung.... I am investigating where and why.
eddie_SRT3.txt is the standard output from the script_regression_test.

I recall you telling me I need to do case.submit --restart-immediate, and I am unsure if that has effect on the regression script.

Thank you.

m_mineter@ed_ac_uk · Jul 22, 2020

Please ignore the regression script comments - I am not yet sure if I still have issues there.... made some progress today.

sacks · Jul 22, 2020

Thank you for your detailed notes and attachments. It looks like you're getting very close here! From your log files, it looks like you get all the way through 5 days, and then the model hangs while - or just after - writing restart files. The various NetCDF messages you mention are expected, I think (we should probably clean those up, since they cause a lot of confusion), so I don't think they necessarily point to the source of the problem.

Here are some ideas that come to mind:

- I'd be interested to see a directory listing showing which files have been written to the run directory, using ls -lrt or similar

- Since this seems to hang / die when writing a restart file, I'm curious what happens if you set ./xmlchange REST_OPTION=never, then try rerunning (no need to rebuild)

- Often, hangs like this are caused by insufficient memory. Have you tried, say, doubling the memory allocation per processor (if that's possible)?

- Another way to see if this may be a memory (or other resource) issue would be to try a low-resolution case. For example, one resolution sometimes used for testing is f10_f10_mg37 (this is a 10 deg x 15 deg resolution).

m_mineter@ed_ac_uk · Jul 23, 2020

Thanks Bill. I need your optimism!

With REST_OPTION=never, the job ran and closed normally, the only netcdf file written to the archive directories was
glc/hist/F2000_T5.cism.initial_hist.0001-01-01-00000.nc

The f10_f10.. job seemd to run successfully, with archived files being:
./glc/hist/F10_1.cism.initial_hist.0001-01-01-00000.nc
./rest/0001-01-06-00000/F10_1.cam.rs.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cism.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.mosart.rh0.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.clm2.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cice.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cam.rh0.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cpl.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.mosart.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.clm2.rh0.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cam.r.0001-01-06-00000.nc

I doubled the memory requested for the 20pe, f10 case, and created, setup, built a new case,and on submission the hang recurred.
(the .case.run script had the requested 16000M per pe in its directives, doubled from before.)

I am unusure how to read the qstat command usage data. But it suggests a loop?

netcdf files were written to the run directory and closed at 16:40
No further output to log files after then
After 10 more minutes I did a qstat, then again after another 10 and the difference was:

diff eddie_qstat_model_onhang.txt eddie_qstat_model_atkill.txt
45c45
< usage 1: wallclock=00:39:15, cpu=12:58:37, mem=54032.98685 GBs, io=50.87935 GB, iow=12.610 s, ioops=49948, vmem=23.913G, maxvmem=24.003G
---
> usage 1: wallclock=00:49:17, cpu=16:18:47, mem=68219.85471 GBs, io=50.87935 GB, iow=12.610 s, ioops=49948, vmem=23.913G, maxvmem=24.003G

I deleted the archive job, then the model job.

In the run directory, the nc files were fewer than in the f10 archive:
-rw-r--r-- 1 mjm eddie_users 190M Jul 23 16:14 finidat_interp_dest.nc
-rw-r--r-- 1 mjm eddie_users 12M Jul 23 16:15 F2000_T6.cism.initial_hist.0001-01-01-00000.nc
-rw-r--r-- 1 mjm eddie_users 18M Jul 23 16:39 F2000_T6.mosart.rh0.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 28M Jul 23 16:39 F2000_T6.mosart.r.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 5.7M Jul 23 16:40 F2000_T6.cice.r.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 63M Jul 23 16:40 F2000_T6.clm2.rh0.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 190M Jul 23 16:40 F2000_T6.clm2.r.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 256M Jul 23 16:40 F2000_T6.cam.r.0001-01-06-00000.nc

I attach the file RUN_LISTING.txt and the log files in directory run,, and xml files from CASEROOT.

Is there way to see the memory actually available in the mpi process, from the log files?

(The parallel netcdf, hdf5 libraries did all pass teh standard installation tests - not with all 20 pes though..)

I'll try 32 PEs now and add to this report if it runs before you have more time.

Wanting to be sure of the basics of the installation, I took a harder look at the regression test script results, and ran a few cases independently of the regression script. I think the only one to fail was SEQ_Ln9.f19_g16_rx1.A. I just reran it with new test id, directory:
./create_test --wait SEQ_Ln9.f19_g16_rx1.A.eddie_intel -t S23 --output-root /exports/eddie/scratch/mjm/SEQ23 --test-root /exports/eddie/scratch/mjm/T23
Finished XML for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel in 0.222210 seconds (PASS)
Starting SETUP for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel with 1 procs
Finished SETUP for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel in 1.412144 seconds (PASS)
Starting SHAREDLIB_BUILD for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel with 1 procs
Finished SHAREDLIB_BUILD for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel in 1.894555 seconds (FAIL). [COMPLETED 1 of 1]
Case dir: /exports/eddie/scratch/mjm/T23/SEQ_Ln9.f19_g16_rx1.A.eddie_intel.S23
Errors were:
b"Building test for SEQ in directory /exports/eddie/scratch/mjm/T23/SEQ_Ln9.f19_g16_rx1.A.eddie_intel.S23\n/exports/eddie/scratch/mjm/T23/SEQ_Ln9.f19_g16_rx1.A.eddie_intel.S23/case2/SEQ_Ln9.f19_g16_rx1.A.eddie_intel.S23/env_mach_specific.xml already exists, delete to replace\nWARNING: Test case setup failed. Case2 has been removed, but the main case may be in an inconsistent state. If you want to rerun this test, you should create a new test rather than trying to rerun this one.\nERROR: Wrong type for entry id 'NTASKS'"

Once more, thanks

...

sacks · Jul 23, 2020

First, regarding the SEQ test failure: I doubt this is related to the main failure you're seeing, and it probably isn't worth your time to chase this error. That said, if you want to dig a bit deeper: Based on my guess of where this error is occurring, here is a diff that would give more information:

Code:

diff --git a/scripts/lib/CIME/SystemTests/seq.py b/scripts/lib/CIME/SystemTests/seq.py
index 716215db5..61f17c738 100644
--- a/scripts/lib/CIME/SystemTests/seq.py
+++ b/scripts/lib/CIME/SystemTests/seq.py
@@ -33,6 +33,7 @@ def _case_two_setup(self):
         else:
             totalpes = self._case.get_value("TOTALPES")
             newntasks = max(1, totalpes//len(comp_classes))
+            expect(False, 'newntasks = {}'.format(newntasks))
             rootpe = newntasks
 
             for comp in comp_classes:

(I put in an "expect" call to make this bail with a message at this point, because I can't remember how to reliably get information printed to stdout from a test script otherwise.) I can't understand why newntasks would be a non-allowed value here, but I'm curious to see what comes from that. Another thing you could look at is: in /exports/eddie/scratch/mjm/T23/SEQ_Ln9.f19_g16_rx1.A.eddie_intel.S23, do ./xmlquery NTASKS and ./xmlquery TOTALPES: I wonder if there is some bad value in there.

----

Back to your main issue, I'm continuing to feel optimistic based on what you have found: the fact that the f10 case ran successfully, and that your original case ran successfully with REST_OPTION=never, tells me that your basic machine configuration seems to be working, and there is likely some resource limitation issue when writing the restart files.

I thought of another likely possible cause, which is that you may need to tweak some of the settings for PIO (the parallel i/o library that is used for CESM).

I have one more idea of something to try, and a couple more questions:

One other thing you could try in your 2-degree case would be keeping REST_OPTION=never, but changing STOP_OPTION=nmonths, STOP_N=1. By running for a month, you should get history output from most/all of the components. I'm curious whether this runs to completion, or also hangs at the end of the month.

My additional questions are:
- Have you tried letting this go for a very long time (e.g., hours), and seeing if it eventually completes? I have seen very bad i/o performance lead to something that looks like a hang in the past.
- Looking at the files you send before, it looks like you're using regular netcdf, not pnetcdf (parallel netcdf). Have you explored installing and using pnetcdf on your system?

With this additional information, if we still can't tell what's going on, I may ask for help from Jim Edwards (who also contributed above), as he is an expert on issues related to i/o performance.

Oh, and regarding the loop you see with qstat: Indeed, there are two jobs submitted at once: the main model run, and then a separate short-term archive job that depends on the main job, and just has the responsibility of moving some files around after the run completes. (Is that what you're referring to?) I don't know how to see the memory available in the mpi process.

m_mineter@ed_ac_uk · Jul 23, 2020

Thanks Bill
On the SEQ - I wont pursue it in light of your comments. (But the xmlquery NTASKS TOTALPES all gave 20)

I'm pleased that a 32 PE run did apparently complete ok.... judging by the same number of nc files as for the coarse grid.

But it would be good to solve the 20 pe issue, so I will try letting it run on.

I had thought that netcdf and hdf5 gave me the parallelism and pnetcdf wasn't needed.... but I readily admit to utter confusion around the versions that are(not) needed. So that the tests also run, with cprnc, I also need the sequential netcdf. In config_machines I load the sequential then the parallel hoping that did what was needed.... how did you tell what actually was used?

Does F200climo include a simplified ice and the river models? - I was surprised to see more than cam files output.

Thanks

Viability of running CESM on 40 cores

Member

Bill Sacks

CSEG and Liaisons

Member

Bill Sacks

CSEG and Liaisons

Member

Bill Sacks

CSEG and Liaisons

Cecile Hannay

AMWG Liaison

Member

Member

Attachments

Member

Bill Sacks

CSEG and Liaisons

Member

Bill Sacks

CSEG and Liaisons

CSEG and Liaisons

Member

Attachments

Member

Bill Sacks

CSEG and Liaisons

Member

Attachments

Bill Sacks

CSEG and Liaisons

Member