jacob_t_seeley@gmail_com
New Member
Hi,
I've been running CESM1.2.2 in aquaplanet mode and short-term archiving data from CAM in the scratch repository of my machine (NERSC's Hopper). I was using the resubmit option to break a long integration into chunks. One of the chunks accidentally exceeded its wallclock time limit and was terminated by the job scheduler, and now all data for this case in the scratch repository has disappeared. There had previously been ~200 GB from about 50 prior successfully completed chunks of the integration. Is there a reason why the short-term archiving script would delete everything? I assume the script just adds new data to the archiving repository, so I'm not sure why all data was removed. I'm pasting below the job output from the one that got terminated.
Thanks,
jake
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
-------------------------------------------------------------------------
CESM BUILDNML SCRIPT STARTING
- To prestage restarts, untar a restart.tar file into /scratch/scratchdirs/seeley/aqua.02/run
infile is /global/homes/s/seeley/second_project/aqua.02/Buildconf/cplconf/cesm_namelist
CAM writing dry deposition namelist to drv_flds_in
CAM writing namelist to atm_in
CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
-------------------------------------------------------------------------
CESM PRESTAGE SCRIPT STARTING
- Case input data directory, DIN_LOC_ROOT, is /project/projectdirs/ccsm1/inputdata
- Checking the existence of input datasets in DIN_LOC_ROOT
CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
Wed Sep 24 15:16:06 PDT 2014 -- CSM EXECUTION BEGINS HERE
Wed Sep 24 15:45:54 PDT 2014 -- CSM EXECUTION HAS FINISHED
(seq_mct_drv): =============== SUCCESSFUL TERMINATION OF CPL7-CCSM ===============
Archiving cesm output to /scratch/scratchdirs/seeley/archive/aqua.02
Calling the short-term archiving script st_archive.sh
st_archive.sh: start of short-term archiving
st_archive.sh: restart files from end of run will be saved,
interim restart files will be deleted
=>> PBS: job killed: walltime 1838 exceeded limit 1800
Terminated
RESUBMIT is now 49
Terminated
+ --------------------------------------------------------------------------
+ Job name: aqua.02
+ Job Id: 8176754.hopque01
+ System: hopper
+ Queued Time: Wed Sep 24 15:13:50 2014
+ Start Time: Wed Sep 24 15:15:26 2014
+ Completion Time: Wed Sep 24 15:46:05 2014
+ User: seeley
+ MOM Host: nid04754
+ Queue: debug
+ Req. Resources: mppnodect=3,mppnppn=24,mppwidth=72,walltime=00:30:00
+ Used Resources: cput=00:00:18,mem=8316kb,vmem=58764kb,walltime=00:30:41
+ Acct String: m1196
+ PBS_O_WORKDIR: /global/u1/s/seeley/second_project/aqua.02
+ Submit Args: ./aqua.02.run
+ --------------------------------------------------------------------------
I've been running CESM1.2.2 in aquaplanet mode and short-term archiving data from CAM in the scratch repository of my machine (NERSC's Hopper). I was using the resubmit option to break a long integration into chunks. One of the chunks accidentally exceeded its wallclock time limit and was terminated by the job scheduler, and now all data for this case in the scratch repository has disappeared. There had previously been ~200 GB from about 50 prior successfully completed chunks of the integration. Is there a reason why the short-term archiving script would delete everything? I assume the script just adds new data to the archiving repository, so I'm not sure why all data was removed. I'm pasting below the job output from the one that got terminated.
Thanks,
jake
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
-------------------------------------------------------------------------
CESM BUILDNML SCRIPT STARTING
- To prestage restarts, untar a restart.tar file into /scratch/scratchdirs/seeley/aqua.02/run
infile is /global/homes/s/seeley/second_project/aqua.02/Buildconf/cplconf/cesm_namelist
CAM writing dry deposition namelist to drv_flds_in
CAM writing namelist to atm_in
CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
-------------------------------------------------------------------------
CESM PRESTAGE SCRIPT STARTING
- Case input data directory, DIN_LOC_ROOT, is /project/projectdirs/ccsm1/inputdata
- Checking the existence of input datasets in DIN_LOC_ROOT
CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
Wed Sep 24 15:16:06 PDT 2014 -- CSM EXECUTION BEGINS HERE
Wed Sep 24 15:45:54 PDT 2014 -- CSM EXECUTION HAS FINISHED
(seq_mct_drv): =============== SUCCESSFUL TERMINATION OF CPL7-CCSM ===============
Archiving cesm output to /scratch/scratchdirs/seeley/archive/aqua.02
Calling the short-term archiving script st_archive.sh
st_archive.sh: start of short-term archiving
st_archive.sh: restart files from end of run will be saved,
interim restart files will be deleted
=>> PBS: job killed: walltime 1838 exceeded limit 1800
Terminated
RESUBMIT is now 49
Terminated
+ --------------------------------------------------------------------------
+ Job name: aqua.02
+ Job Id: 8176754.hopque01
+ System: hopper
+ Queued Time: Wed Sep 24 15:13:50 2014
+ Start Time: Wed Sep 24 15:15:26 2014
+ Completion Time: Wed Sep 24 15:46:05 2014
+ User: seeley
+ MOM Host: nid04754
+ Queue: debug
+ Req. Resources: mppnodect=3,mppnppn=24,mppwidth=72,walltime=00:30:00
+ Used Resources: cput=00:00:18,mem=8316kb,vmem=58764kb,walltime=00:30:41
+ Acct String: m1196
+ PBS_O_WORKDIR: /global/u1/s/seeley/second_project/aqua.02
+ Submit Args: ./aqua.02.run
+ --------------------------------------------------------------------------