Thanks Bill. I need your optimism!
With REST_OPTION=never, the job ran and closed normally, the only netcdf file written to the archive directories was
glc/hist/
F2000_T5.cism.initial_hist.0001-01-01-00000.nc
The f10_f10.. job seemd to run successfully, with archived files being:
./glc/hist/F10_1.cism.initial_hist.0001-01-01-00000.nc
./rest/0001-01-06-00000/F10_1.cam.rs.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cism.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.mosart.rh0.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.clm2.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cice.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cam.rh0.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cpl.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.mosart.r.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.clm2.rh0.0001-01-06-00000.nc
./rest/0001-01-06-00000/F10_1.cam.r.0001-01-06-00000.nc
I doubled the memory requested for the 20pe, f10 case, and created, setup, built a new case,and on submission the hang recurred.
(the .case.run script had the requested 16000M per pe in its directives, doubled from before.)
I am unusure how to read the qstat command usage data. But it suggests a loop?
netcdf files were written to the run directory and closed at 16:40
No further output to log files after then
After 10 more minutes I did a qstat, then again after another 10 and the difference was:
diff eddie_qstat_model_onhang.txt eddie_qstat_model_atkill.txt
45c45
< usage 1: wallclock=00:39:15, cpu=12:58:37, mem=54032.98685 GBs, io=50.87935 GB, iow=12.610 s, ioops=49948, vmem=23.913G, maxvmem=24.003G
---
> usage 1: wallclock=00:49:17, cpu=16:18:47, mem=68219.85471 GBs, io=50.87935 GB, iow=12.610 s, ioops=49948, vmem=23.913G, maxvmem=24.003G
I deleted the archive job, then the model job.
In the run directory, the nc files were fewer than in the f10 archive:
-rw-r--r-- 1 mjm eddie_users 190M Jul 23 16:14 finidat_interp_dest.nc
-rw-r--r-- 1 mjm eddie_users 12M Jul 23 16:15 F2000_T6.cism.initial_hist.0001-01-01-00000.nc
-rw-r--r-- 1 mjm eddie_users 18M Jul 23 16:39 F2000_T6.mosart.rh0.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 28M Jul 23 16:39 F2000_T6.mosart.r.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 5.7M Jul 23 16:40 F2000_T6.cice.r.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 63M Jul 23 16:40 F2000_T6.clm2.rh0.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 190M Jul 23 16:40 F2000_T6.clm2.r.0001-01-06-00000.nc
-rw-r--r-- 1 mjm eddie_users 256M Jul 23 16:40 F2000_T6.cam.r.0001-01-06-00000.nc
I attach the file RUN_LISTING.txt and the log files in directory run,, and xml files from CASEROOT.
Is there way to see the memory actually available in the mpi process, from the log files?
(The parallel netcdf, hdf5 libraries did all pass teh standard installation tests - not with all 20 pes though..)
I'll try 32 PEs now and add to this report if it runs before you have more time.
Wanting to be sure of the basics of the installation, I took a harder look at the regression test script results, and ran a few cases independently of the regression script. I think the only one to fail was SEQ_Ln9.f19_g16_rx1.A. I just reran it with new test id, directory:
./create_test --wait SEQ_Ln9.f19_g16_rx1.A.eddie_intel -t S23 --output-root /exports/eddie/scratch/mjm/SEQ23 --test-root /exports/eddie/scratch/mjm/T23
Finished XML for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel in 0.222210 seconds (PASS)
Starting SETUP for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel with 1 procs
Finished SETUP for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel in 1.412144 seconds (PASS)
Starting SHAREDLIB_BUILD for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel with 1 procs
Finished SHAREDLIB_BUILD for test SEQ_Ln9.f19_g16_rx1.A.eddie_intel in 1.894555 seconds (FAIL). [COMPLETED 1 of 1]
Case dir: /exports/eddie/scratch/mjm/T23/SEQ_Ln9.f19_g16_rx1.A.eddie_intel.S23
Errors were:
b"Building test for SEQ in directory /exports/eddie/scratch/mjm/T23/SEQ_Ln9.f19_g16_rx1.A.eddie_intel.S23\n/exports/eddie/scratch/mjm/T23/SEQ_Ln9.f19_g16_rx1.A.eddie_intel.S23/case2/SEQ_Ln9.f19_g16_rx1.A.eddie_intel.S23/env_mach_specific.xml already exists, delete to replace\nWARNING: Test case setup failed. Case2 has been removed, but the main case may be in an inconsistent state. If you want to rerun this test, you should create a new test rather than trying to rerun this one.\nERROR: Wrong type for entry id 'NTASKS'"
Once more, thanks
...