Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

termination during archiving deleted tons of data?

Hi,

I've been running CESM1.2.2 in aquaplanet mode and short-term archiving data from CAM in the scratch repository of my machine (NERSC's Hopper). I was using the resubmit option to break a long integration into chunks. One of the chunks accidentally exceeded its wallclock time limit and was terminated by the job scheduler, and now all data for this case in the scratch repository has disappeared. There had previously been ~200 GB from about 50 prior successfully completed chunks of the integration. Is there a reason why the short-term archiving script would delete everything? I assume the script just adds new data to the archiving repository, so I'm not sure why all data was removed. I'm pasting below the job output from the one that got terminated.

Thanks,
jake


Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
-------------------------------------------------------------------------
CESM BUILDNML SCRIPT STARTING
- To prestage restarts, untar a restart.tar file into /scratch/scratchdirs/seeley/aqua.02/run
infile is /global/homes/s/seeley/second_project/aqua.02/Buildconf/cplconf/cesm_namelist
CAM writing dry deposition namelist to drv_flds_in
CAM writing namelist to atm_in
CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
-------------------------------------------------------------------------
CESM PRESTAGE SCRIPT STARTING
- Case input data directory, DIN_LOC_ROOT, is /project/projectdirs/ccsm1/inputdata
- Checking the existence of input datasets in DIN_LOC_ROOT
CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
Wed Sep 24 15:16:06 PDT 2014 -- CSM EXECUTION BEGINS HERE
Wed Sep 24 15:45:54 PDT 2014 -- CSM EXECUTION HAS FINISHED
(seq_mct_drv): =============== SUCCESSFUL TERMINATION OF CPL7-CCSM ===============
Archiving cesm output to /scratch/scratchdirs/seeley/archive/aqua.02
Calling the short-term archiving script st_archive.sh

st_archive.sh: start of short-term archiving
st_archive.sh: restart files from end of run will be saved,
interim restart files will be deleted
=>> PBS: job killed: walltime 1838 exceeded limit 1800
Terminated
RESUBMIT is now 49
Terminated

+ --------------------------------------------------------------------------
+ Job name: aqua.02
+ Job Id: 8176754.hopque01
+ System: hopper
+ Queued Time: Wed Sep 24 15:13:50 2014
+ Start Time: Wed Sep 24 15:15:26 2014
+ Completion Time: Wed Sep 24 15:46:05 2014
+ User: seeley
+ MOM Host: nid04754
+ Queue: debug
+ Req. Resources: mppnodect=3,mppnppn=24,mppwidth=72,walltime=00:30:00
+ Used Resources: cput=00:00:18,mem=8316kb,vmem=58764kb,walltime=00:30:41
+ Acct String: m1196
+ PBS_O_WORKDIR: /global/u1/s/seeley/second_project/aqua.02
+ Submit Args: ./aqua.02.run
+ --------------------------------------------------------------------------
 

jet

Member
Hi Jacob:

I have run into this before. Fear not, I don't think your data is lost but your case directory under the archive directory has most likely been moved to a "dot file directory. When the short term archive script runs it works in this temporary directory and then moves the temporary hidden name back to the normal archive case name. Since the archiving script didn't finish this final move didn't get done. The hidden directory will start with .sta followed by a bunch of numbers and a date. This actually is very annoying and I just about had a heart attack the first time it happened to me. I will bring it up to our system group and lobby to have the script changed. I'm hoping you will find all your files safe and sound, just hidden from plain sight. You can move the .sta directory back to its standard case name.

jt
 

jet

Member
PS. I forgot to mention that you need to do an

ls -al

to see the hidden directory. A normal ls will miss it, thats why you thought it was deleted.
 
Jet,

Although I was able to recover the archived data, I think the restart files are missing given this output in the cesm log:

PGFIO-F-209/OPEN/unit=98/'OLD' specified for file which does not exist.
File name = rpointer.drv
In source file /global/u1/s/seeley/second_project/cesm1_2_2/models/drv/shr/seq_infodata_mod.F90, at line number 657
[NID 06251] 2014-09-25 00:32:23 Apid 34801442: initiated application termination
Application 34801442 exit codes: 127
Application 34801442 exit signals: Killed
Application 34801442 resources: utime ~12s, stime ~3s, Rss ~18612, inblocks ~327478, outblocks ~1386733

There definitely aren't the usual restart netcdf files in the run directory. How can I get this going from where it left off?

Thanks,
jake
 

aliceb

Member
Hi Jake,In your original post you have:st_archive.sh: start of short-term archivingst_archive.sh: restart files from end of run will be saved,interim restart files will be deleted=>> PBS: job killed: walltime 1838 exceeded limit 1800 These messages tell you the short-term archiver did not complete (1) and any interim restart files have been deleted(2). The interim restart file write frequency is defined in the env_run.xml file using the REST_N and REST_OPTION settings. The short-term archiver saves any interim restart files based on the DOUT_S_SAVE_INT_REST_FILES setting. You will need to restart your run from the last time a complete restart set was written in the $DOUT_S_ROOT/$CASE/rest/[date] directory. Copy these files into the run directory and resubmit your run.More details are available from the CESM users guide at:http://www.cesm.ucar.edu/models/cesm1.2/cesm/doc/usersguide/x1580.html#running_ccsm_restarts and $CASEROOT xml files at:http://www.cesm.ucar.edu/models/cesm1.2/cesm/doc/modelnl/modelnl.html
 
Top