Main menu

Navigation

[Port Validation] Runtime Failure: "ERROR: restformat: number of records on restart file not supported"

8 posts / 0 new
Last post
aekholm@...
[Port Validation] Runtime Failure: "ERROR: restformat: number of records on restart file not supported"

Hello, 

I am in the process of preforming port validation testing of CESM 1.2.1 on our local cluster, "scylla," here at WHOI. I have encountered the same runtime failure for several test cases in which the initial model run fails with the $RUNDIR/$CASE/cesm.log.YYMMDD-hhmmss log file indicating the following error.

(shr_sys_abort) ERROR: restformat: number of records on restart file not supported

 

 

This failure occurs during runtime for the following test cases, and others, but does not seem to occur when testing the or compsets:

ERS_D.f05_g16.ETEST.scylla_pgi.t01

ERS_D.f19_g16_rx1.G.scylla_pgi.t01

ERT_D.ne30_g16.B1850CN.scylla_pgi.t01

 

 

The following is taken from the test case "ERT_D.ne30_g16.B1850CN.scylla_pgi.t01."

 


$CASEROOT/TestStatus:

RUN ERT_D.ne30_g16.B1850CN.scylla_pgi.t01

 

 

TestStatus.out indicates that "initial model run failed."

$CASEROOT/TestStatus.out:

doing a 3 nmonths initial test
pass = 0
ERROR in /opt/gridengine/62u5_20110621/default/spool/scylla059/job_scripts/55303: coupler log indicates that inital model run failed

 

CaseStatus also indicates run failure. Also, it indicates SFAIL after the create_newcase line.

$CASEROOT/CaseStatus:

test created with the following options:
case: ERT_D.ne30_g16.B1850CN.scylla_pgi.t01 casebaseid: ERT_D.ne30_g16.B1850CN.scylla_pgi compiler: pgi compset: B1850CN confopts: _D fullname: ERT_D grid: ne30_g16 mach: scylla test_argv: -testname ERT_D.ne30_g16.B1850CN.scylla_pgi -testroot /scratch/aekholm/cesm/cesm1_2_1/scripts testname: ERT
create_newcase -case /scratch/aekholm/cesm/cesm1_2_1/scripts/ERT_D.ne30_g16.B1850CN.scylla_pgi.t01 -res ne30_g16 -mach scylla -compset B1850CN -testname ERT -confopts _D -compiler pgi
SFAIL ERT_D.ne30_g16.B1850CN.scylla_pgi.t01
build complete 2014-02-24 09:17:15
test submitted 2014-02-24 09:35:01
test started 2014-02-24 09:35:32
run started 2014-02-24 09:35:33
run FAILED 2014-02-24 09:39:41

 

The output log indicates the root of the error is described in the CESM runtime log.

$CASEROOT/$CASE.o55303:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
-------------------------------------------------------------------------
CESM BUILDNML SCRIPT STARTING
- To prestage restarts, untar a restart.tar file into /scratch/aekholm/cesm/run/ERT_D.ne30_g16.B1850CN.scylla_pgi.t01
infile is /scratch/aekholm/cesm/cesm1_2_1/scripts/ERT_D.ne30_g16.B1850CN.scylla_pgi.t01/Buildconf/cplconf/cesm_namelist
CAM writing dry deposition namelist to drv_flds_in
CAM writing namelist to atm_in
CLM configure done.
CLM adding use_case 1850_control defaults for var sim_year with val 1850
CLM adding use_case 1850_control defaults for var sim_year_range with val constant
CLM adding use_case 1850_control defaults for var stream_year_first_ndep with val 1850
CLM adding use_case 1850_control defaults for var stream_year_last_ndep with val 1850
CLM adding use_case 1850_control defaults for var use_case_desc with val Conditions to simulate 1850 land-use
CICE configure done.
POP2 build-namelist: ocn_grid is gx1v6
POP2 build-namelist: ocn_tracer_modules are iage
CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
-------------------------------------------------------------------------
CESM PRESTAGE SCRIPT STARTING
- Case input data directory, DIN_LOC_ROOT, is /scratch/aekholm/cesm/input
- Checking the existence of input datasets in DIN_LOC_ROOT
CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY
-------------------------------------------------------------------------
Mon Feb 24 09:36:02 EST 2014 -- CSM EXECUTION BEGINS HERE
Mon Feb 24 09:39:41 EST 2014 -- CSM EXECUTION HAS FINISHED
Model did not complete - see /scratch/aekholm/cesm/run/ERT_D.ne30_g16.B1850CN.scylla_pgi.t01/cesm.log.140224-093533
initial run failed.

The CESM runtime log gives the error "ERROR: restformat: number of records on restart file not supported."

$RUNDIR/$CASE/cesm.log.140224-093533

aekholm@scylla-a:~$ tail -n16 /scratch/aekholm/cesm/run/ERT_D.ne30_g16.B1850CN.scylla_pgi.t01/cesm.log.140224-093533
(shr_sys_abort) ERROR: restformat: number of records on restart file not supported
(shr_sys_abort) WARNING: calling shr_mpi_abort() and stopping
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec has exited due to process rank 0 with PID 20854 on
node scylla059 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------

 

 

I can find no other indication of error in any of the build or runtime logs. If anyone could provide some information in how to proceed to debug this issue, it would be greatly appreciated.

 

Thanks!

 

Alex

Alexander K. Ekholm
Engineer I, Physical Oceanography
Woods Hole Oceanographic Institution
Woods Hole, MA, USA

jedwards

Hi Alex,


I think that this message is coming from the ice_restart.F90 file.    I can't tell from what you've sent exactly what is going on,  but it's trying to read a binary input file - could it be an endian issue?   I tried to look up scylla to see what kind of machine it was, but couldn't find anything on your website. 

CESM Software Engineer

aekholm@...

Hi, 

 

Thanks for the quick reply! Scylla is a distributed memory linux cluster, consisting of 80 compute and 2 master nodes. 

 

Here is some system information:

 

aekholm@scylla-a:~$ uname -a

Linux scylla-a.whoi.edu 2.6.32-55-server #117-Ubuntu SMP Tue Dec 3 17:45:11 UTC 2013 x86_64 GNU/Linux

 

aekholm@scylla-a:~$ cat /proc/version

Linux version 2.6.32-55-server (buildd@toyol) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #117-Ubuntu SMP Tue Dec 3 17:45:11 UTC 2013

 

aekholm@scylla-a:~$ cat /proc/version_signature 

Ubuntu 2.6.32-55.117-server 2.6.32.61+drm33.26

 

aekholm@scylla-b:~$ lscpu # (master node)

Architecture:          x86_64

CPU op-mode(s):        32-bit, 64-bit

CPU(s):                24

Thread(s) per core:    2

Core(s) per socket:    6

CPU socket(s):         2

NUMA node(s):          2

Vendor ID:             GenuineIntel

CPU family:            6

Model:                 44

Stepping:              2

CPU MHz:               1600.000

Virtualization:        VT-x

L1d cache:             32K

L1i cache:             32K

L2 cache:              256K

L3 cache:              12288K

 

aekholm@scylla-b:~$ lsb_release -a

LSB Version:core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:cxx-3.0-amd64:cxx-3.0-noarch:cxx-3.1-amd64:cxx-3.1-noarch:cxx-3.2-amd64:cxx-3.2-noarch:cxx-4.0-amd64:cxx-4.0-noarch:desktop-3.1-amd64:desktop-3.1-noarch:desktop-3.2-amd64:desktop-3.2-noarch:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.0-amd64:graphics-3.0-noarch:graphics-3.1-amd64:graphics-3.1-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-3.2-amd64:printing-3.2-noarch:printing-4.0-amd64:printing-4.0-noarch:qt4-3.1-amd64:qt4-3.1-noarch

Distributor ID:Ubuntu

Description:Ubuntu 10.04.4 LTS

Release:10.04

Codename:lucid

 

Please let me know if there is any additional information that I can provide to aid in debugging this issue.

 

Thanks again!

Alex

Alexander K. Ekholm
Engineer I, Physical Oceanography
Woods Hole Oceanographic Institution
Woods Hole, MA, USA

jedwards

So it shouldn't be an endian problem.   Perhaps it's not finding the input file or the one that it finds is corrupted?   Look in the ice.log file to find the name of the restart that it's trying to open

let me know the name and an md5sum and I can compare to what is on our server.

CESM Software Engineer

aekholm@...

From the ice.log file:

 

aekholm@scylla-a:/scratch/aekholm/cesm/run/ERT_D.ne30_g16.B1850CN.scylla_pgi.t01$ grep restart ice.log.140224-093533 

  restart                   =        T

  restart_dir               = 

  restart_file              = ERT_D.ne30_g16.B1850CN.scylla_pgi.t01.cice.r

  restart_format            = nc

  restart_age               =        F

  restart_FY                =        F

  restart_lvl               =        F

  restart_pond              =        F

  restart_aero              =        F


This file does not exist in $RUNDIR (or anywhere else) so I cannot provied the md5sum. I suppose this missing file would explain the error. 

 

Do you have any suggestions on how to proceed?

 

Alexander K. Ekholm
Engineer I, Physical Oceanography
Woods Hole Oceanographic Institution
Woods Hole, MA, USA

jedwards

So looking at an ice log locally I think that you want to check  variable ice_ic  

it should refer to a file in inputdata

CESM Software Engineer

aekholm@...

Ok, jedwards, thanks again for your help!

Alexander K. Ekholm
Engineer I, Physical Oceanography
Woods Hole Oceanographic Institution
Woods Hole, MA, USA

aekholm@...

It seems that the ice_ic is correctly set in $CASEROOT/Buildconf/ciceconf/ice_in:

 

aekholm@scylla-b:/scratch/aekholm/cesm/cesm1_2_1/scripts/ERT_D.ne30_g16.B1850CN.scylla_pgi.t01/Buildconf/ciceconf$ grep ice_ic ice_in 

 ice_ic = '/scratch/aekholm/cesm/input/ice/cice/iced.0001-01-01.gx1v6_20080212'

 

 

However, upon further investigation of the ice_ic input file it, I found that the file was zero-length. 

 

aekholm@scylla-a:/scratch/aekholm/cesm/input/ice/cice$ ls -l iced.0001-01-01.gx1v6_20080212 

-rw-r--r-- 1 aekholm aekholm 0 2014-02-23 15:17 iced.0001-01-01.gx1v6_20080212

 

 

I was able to export the input file from the SVN input data repository.

 

aekholm@scylla-a:/scratch/aekholm/cesm/input/ice/cice$ svn export --username guestuser --password XXXXXX https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/ice/cice/iced.00... `pwd`/iced.0001-01-01.gx1v6_20080212 

A    /scratch/aekholm/cesm/input/ice/cice/iced.0001-01-01.gx1v6_20080212

Export complete.

 

 

After the export, I confirmed that the input cice input file is no longer zero-length.

 

aekholm@scylla-a:/scratch/aekholm/cesm/input/ice/cice$ ls -lh iced.0001-01-01.gx1v6_20080212 

-rw-r--r-- 1 aekholm aekholm 61M 2009-02-17 13:54 iced.0001-01-01.gx1v6_20080212

 

 

I've re-submitted the test case to the batch scheduler, and I'm awaiting completion of the run to confirm that this issue is resolved.

Alexander K. Ekholm
Engineer I, Physical Oceanography
Woods Hole Oceanographic Institution
Woods Hole, MA, USA

Log in or register to post comments

Who's new

  • 1658093099@...
  • mborreggine@...
  • kabirtam@...
  • suns@...
  • liangpeng0405@...