Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Timeseries post-processing on derecho

bmduran

Brandon Duran
New Member
Hi @aswann2 , I was having similar issues with postprocessing, where my existing script (that worked previously) was no longer running successfully and hanging on the singularity call. I believe it may be related to changes to the default NCAR environment on Derecho/Casper?

In my timeseries script, adding the following 3 lines before the module use /glade/work/bdobbins/Software/Modules call fixed this!

module purge
module load ncarenv/23.09
module reset


Hope this helps!
 

dobbins

Brian Dobbins
CSEG and Liaisons
Staff member
Oof, sorry, I didn't see these messages until now - if you're still having problems, please reply and let me know. Someone else reached out, and we updated the 'cesm_postprocessing_derecho' module to work with the new ncarenv/24.12 environment. It was an issue where the module sought to load a specific version of 'apptainer' (singularity), and that version changed with the environment update.
 

aswann2

Abigail Swann
Member
Oof, sorry, I didn't see these messages until now - if you're still having problems, please reply and let me know. Someone else reached out, and we updated the 'cesm_postprocessing_derecho' module to work with the new ncarenv/24.12 environment. It was an issue where the module sought to load a specific version of 'apptainer' (singularity), and that version changed with the environment update.
Thanks! I have it working again on derecho and at the moment I am successfully producing time series files.
 

michelle_dvorak

Michelle Dvorak
Member
Oof, sorry, I didn't see these messages until now - if you're still having problems, please reply and let me know. Someone else reached out, and we updated the 'cesm_postprocessing_derecho' module to work with the new ncarenv/24.12 environment. It was an issue where the module sought to load a specific version of 'apptainer' (singularity), and that version changed with the environment update.

I am running into the following error after invoking "create_postprocess -caseroot=`pwd` " for a new simulation:

source: read /.singularity.d/env/01-base.sh: stale NFS file handle
source: read /.singularity.d/env/10-docker2singularity.sh: stale NFS file handle
source: read /.singularity.d/env/90-environment.sh: stale NFS file handle
source: read /.singularity.d/env/94-appsbase.sh: stale NFS file handle
source: read /.singularity.d/env/95-apps.sh: stale NFS file handle
source: read /.singularity.d/env/99-base.sh: stale NFS file handle
FATAL: while reading /.singularity.d/runscript: read /.singularity.d/runscript: stale NFS file handle
 

bbucho

Ben Buchovecky
New Member
I'm also running into the same stale NFS file handle error. Were you able to find a workaround, Michelle?
 

dobbins

Brian Dobbins
CSEG and Liaisons
Staff member
My apologies -- I meant to reply here ages ago. I haven't dug deep into this, but did find that it seems related to symlinks or specific file systems.

So, for example, I get the NFS stale file handle error when I try to run it in a scratch directory accessed by a symbolic link from my home directory, but if I cd to /glade/derecho/scratch/<user>, it works.

Please give that a shot; if you aren't using symlinks, give me more info and I'll definitely try to take a deeper look soon.
 

bbucho

Ben Buchovecky
New Member
Ok, I changed my ncarenv module from ncarenv/24.12 to ncarenv/23.09 following Brandon Duran's advice and the script seemed to work. But to answer your question, I'm using cd and not accessing my scratch directory by a symbolic link. I've attached the test script and its output that reproduces this error when I run it. Thanks!
 

Attachments

  • postp_derecho_test.txt
    976 bytes · Views: 3
  • postp_derecho_test.csh.txt
    829 bytes · Views: 3

dbailey

CSEG and Liaisons
Staff member
Hi all. Apparently the "old" version of the timeseries is still working on derecho, but might be fragile. You might consider using the cupid-timeseries functionality in CUPiD. Documentation is here. Currently it is set by default to only do a subset of variables and only the monthly files, but these can be configured in config.yml.

 

andreas.chrysanthou

Andreas Chrysanthou
New Member
Has anybody have had issues with the timeseries on Derecho that causes the following error: integer division or modulo by zero?
I have a CESM2.1.5-WACCM6 AMIP run with 0.9x1.25 horizontal resolution and 110 levels, I was able to set it with option 1 up as detailed by @dbailey in this discussion, but no timeseries files are being generated. I've set TIMESERIES_GENERATE_ALL to FALSE in the env_postprocess.xml and edited my env_timeseries.xml to only include CAM output (including h0, h1 and h2 history files) and of course added the requested variables, even setting the tseries_filecat_n to 1, as it is just one year that I want to try first before CMORising the output.

All related code (including my env_timeseries.xml) can be found here: /glade/work/andreasc/runs/f.e21.PDINT2010.f09_f09_mg17.110L.test/postprocess/ as well as the logs which I paste below. I should also mention that for the purposes of this test, I'm only using 2 cores and request just 2 hours.

integer division or modulo by zero
3/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-03.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
9/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-04.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
integer division or modulo by zero
integer division or modulo by zero
integer division or modulo by zero
integer division or modulo by zero
integer division or modulo by zero
integer division or modulo by zero
1/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-11.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
4/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-02.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
5/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-01.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
6/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-08.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
7/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-05.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
8/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-09.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
10/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-07.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
integer division or modulo by zero
integer division or modulo by zero
2/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-12.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
11/32 opening /glade/derecho/scratch/andreasc/archive/f.e21.PDINT2010.f09_f09_mg17.110L.test/atm/hist/f.e21.PDINT2010.f09_f09_mg17.110L.test.cam.h0.2010-06.nc
time_period_freq = month_1
integer division or modulo by zero
integer division or modulo by zero
 

dobbins

Brian Dobbins
CSEG and Liaisons
Staff member
Hi Andreas,

Let me give you one quick idea, and if this doesn't work, I'm happy to take a deeper look tomorrow.

The simple idea? Change your job script to use 36, not 32 MPI processes, via changing this line:
#PBS -l select=2:ncpus=128:mpiprocs=16

To:
#PBS -l select=2:ncpus=128:mpiprocs=18

... That's from a hazy memory that there are unfortunately some 'magic numbers' of processors you need for things to work correctly. If that doesn't work, I'll copy your data and give it a shot myself.

Cheers,
- Brian
 

dobbins

Brian Dobbins
CSEG and Liaisons
Staff member
Update: I did try it myself, and it seems to be working. I only did a few files, so let me know if you run into trouble down the line!
 

andreas.chrysanthou

Andreas Chrysanthou
New Member
Excellent, that worked for me as well. Many thanks for this @dobbins! This is just a test for 1 year, so is there a magic number for when I'll need to do this for multiple years if I were to use 12 cores for example? Also, does this work on Casper?
 

dobbins

Brian Dobbins
CSEG and Liaisons
Staff member
Glad to hear it, and as far as I've seen, the duration of the files isn't a factor -- so that number should work fine for longer runs, yes. And on Casper too.

Good luck, and let me know if you run into issues!
 
Top