Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Running BHIST experiment with error: pio_die:: myrank= -1 : ERROR: nf_mod.F90: 730 : NetCDF: Variable not found

xnnzka

xnnzka
Member
Hello, everyone, I need your help!

My cesm version is 2.2.0.

I want to utilize CESM2LENS restart file to exercise the branch type run, but I received this error in cesm.log:
--------------------------------------------------------------------------
Opened existing file b.e21.BHISTsmbb.f09_g17.LE2-1301.012.cam.r.2000-01-01-00000.nc 65536
Opened existing file /public/project/cesm/inputdata/atm/cam/topo/fv_0.9x1.25_nc3000_Nsw042_Nrs008_Co060_Fi001_ZR_sgh30_24km_GRNL_c170103.nc 131072
Opened existing file /public/project/cesm/inputdata/atm/cam/ozone_strataero/ozone_strataero_WACCM_L70_zm5day_18500101-20150103_CMIP6ensAvg_c180923.nc 196608
Opened existing file /public/project/cesm/inputdata/atm/waccm/lb/LBC_1750-2015_CMIP6_GlobAnnAvg_c180926.nc 196608
NetCDF: Variable not found
NetCDF: Variable not found
NetCDF: Variable not found
NetCDF: Variable not found
NetCDF: Variable not found
WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
NetCDF: Variable not found
pio_support::pio_die:: myrank= -1 : ERROR: nf_mod.F90: 730 : NetCDF: Variable not found
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
In this experiment, I just set:
./xmlchange --file env_run.xml --id RUN_TYPE --val 'branch'
./xmlchange --file env_run.xml --id RUN_REFCASE --val 'b.e21.BHISTsmbb.f09_g17.LE2-1301.012'
./xmlchange --file env_run.xml --id RUN_REFDATE --val '2000-01-01'
./xmlchange --file env_run.xml --id STOP_N --val '6'
./xmlchange --file env_run.xml --id STOP_OPTION --val 'nmonths'
./xmlchange --file env_run.xml --id RESUBMIT --val '1'
I do not modify any source code and others. So I am very confused.

I attach my atm.log, cesm.log and cpl.log because I only have these logs. It seems that lnd and other model do not start and the run is killed.

If the information I provided is not enough, please tell me!
 

Attachments

  • atm.log.txt
    32.9 KB · Views: 4
  • cesm.log.txt
    12.8 KB · Views: 4
  • cpl.log.txt
    46.4 KB · Views: 3

erik

Erik Kluzek
CSEG and Liaisons
Staff member
According to the cpl log file it's only got to the point of trying to initialize the ATM model. Which would be CAM in this case. The end of the atm log talks about having opened a file: /public/project/cesm/inputdata/atm/waccm/lb/LBC_1750-2015_CMIP6_GlobAnnAvg_c180926.nc. The CESM log shows it's a problem in PIO so it could be a problem with that file, since that's the last thing it talks about. It also reports as an error that a variable was not found...
pio_support::pio_die:: myrank= -1 : ERROR: nf_mod.F90: 730 : NetCDF: Variable not found

And reports on some other variables not found as well. Sometimes the code can handle variables not being found, and sometimes it can't. So not all of these reports on missing variables are the error that aborts the code. But, the one above does look like it is what's aborting the code, since it's aborting in an MPI_ABORT call.

The first suggestion I have is that you clean the build and build/run the code with DEBUG=TRUE, so that you'll see a traceback. This should give you line numbers of the code that's failing in the cesm.log file. Then you can look at the part of the code that it's failing in and add write statements to to catch the error. If you don't get a traceback from doing that you might need to figure out by hand where in the code the above file is being read so that you can do some debugging in it.
 
Vote Upvote 0 Downvote

xnnzka

xnnzka
Member
According to the cpl log file it's only got to the point of trying to initialize the ATM model. Which would be CAM in this case. The end of the atm log talks about having opened a file: /public/project/cesm/inputdata/atm/waccm/lb/LBC_1750-2015_CMIP6_GlobAnnAvg_c180926.nc. The CESM log shows it's a problem in PIO so it could be a problem with that file, since that's the last thing it talks about. It also reports as an error that a variable was not found...


And reports on some other variables not found as well. Sometimes the code can handle variables not being found, and sometimes it can't. So not all of these reports on missing variables are the error that aborts the code. But, the one above does look like it is what's aborting the code, since it's aborting in an MPI_ABORT call.

The first suggestion I have is that you clean the build and build/run the code with DEBUG=TRUE, so that you'll see a traceback. This should give you line numbers of the code that's failing in the cesm.log file. Then you can look at the part of the code that it's failing in and add write statements to to catch the error. If you don't get a traceback from doing that you might need to figure out by hand where in the code the above file is being read so that you can do some debugging in it.
Hi, Eric! Very thanks for your suggestions! I set INFO_DBUG=2 and DEBUG=TRUE, and I got new logs. I am sorry that after reading these logs, I still do not know why this run is killed so I attach them here. Could you help me?
 

Attachments

  • lognew.zip
    320.1 KB · Views: 2
Vote Upvote 0 Downvote

xnnzka

xnnzka
Member
According to the cpl log file it's only got to the point of trying to initialize the ATM model. Which would be CAM in this case. The end of the atm log talks about having opened a file: /public/project/cesm/inputdata/atm/waccm/lb/LBC_1750-2015_CMIP6_GlobAnnAvg_c180926.nc. The CESM log shows it's a problem in PIO so it could be a problem with that file, since that's the last thing it talks about. It also reports as an error that a variable was not found...


And reports on some other variables not found as well. Sometimes the code can handle variables not being found, and sometimes it can't. So not all of these reports on missing variables are the error that aborts the code. But, the one above does look like it is what's aborting the code, since it's aborting in an MPI_ABORT call.

The first suggestion I have is that you clean the build and build/run the code with DEBUG=TRUE, so that you'll see a traceback. This should give you line numbers of the code that's failing in the cesm.log file. Then you can look at the part of the code that it's failing in and add write statements to to catch the error. If you don't get a traceback from doing that you might need to figure out by hand where in the code the above file is being read so that you can do some debugging in it.
And as for /public/project/cesm/inputdata/atm/waccm/lb/LBC_1750-2015_CMIP6_GlobAnnAvg_c180926.nc, I delete this nc file and CESM re-download this file when I submit my case. I also check this file and I do not find anything wrong. For convenience, I also attach it.

If you have any suggestions, please tell me! Very thanks for your work!
 

Attachments

  • LBC_1750-2015_CMIP6_GlobAnnAvg_c180926.zip
    15.7 KB · Views: 0
Vote Upvote 0 Downvote

xnnzka

xnnzka
Member
According to the cpl log file it's only got to the point of trying to initialize the ATM model. Which would be CAM in this case. The end of the atm log talks about having opened a file: /public/project/cesm/inputdata/atm/waccm/lb/LBC_1750-2015_CMIP6_GlobAnnAvg_c180926.nc. The CESM log shows it's a problem in PIO so it could be a problem with that file, since that's the last thing it talks about. It also reports as an error that a variable was not found...


And reports on some other variables not found as well. Sometimes the code can handle variables not being found, and sometimes it can't. So not all of these reports on missing variables are the error that aborts the code. But, the one above does look like it is what's aborting the code, since it's aborting in an MPI_ABORT call.

The first suggestion I have is that you clean the build and build/run the code with DEBUG=TRUE, so that you'll see a traceback. This should give you line numbers of the code that's failing in the cesm.log file. Then you can look at the part of the code that it's failing in and add write statements to to catch the error. If you don't get a traceback from doing that you might need to figure out by hand where in the code the above file is being read so that you can do some debugging in it.
Hi, Eric. When I modify to use the restart file from 1281.001, it works. So I think there might be some wrong with smbb1301.012 restart file.
 
Vote Upvote 0 Downvote

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Hi, Eric. When I modify to use the restart file from 1281.001, it works. So I think there might be some wrong with smbb1301.012 restart file.

I'm glad you are getting something to work. I'm guessing that somehow that restart file got corrupted. If you do need that specific restart file, I'd try rerunning that section to get a new restart file at that point.

Does this resolve your issues sufficiently?
 
Vote Upvote 0 Downvote

xnnzka

xnnzka
Member
I'm glad you are getting something to work. I'm guessing that somehow that restart file got corrupted. If you do need that specific restart file, I'd try rerunning that section to get a new restart file at that point.

Does this resolve your issues sufficiently?
Hi, Eric. Very thanks for your help and support. I modify the version of CESM, from 2.2.0 to 2.1.4, and now I can run my case successfully! I think the CESM2 LENS restart files might be not fully compatible for CESM2.2.0.
 
Vote Upvote 0 Downvote

erik

Erik Kluzek
CSEG and Liaisons
Staff member
Hi, Eric. Very thanks for your help and support. I modify the version of CESM, from 2.2.0 to 2.1.4, and now I can run my case successfully! I think the CESM2 LENS restart files might be not fully compatible for CESM2.2.0.

OK, if I've got this right you've gone from the CESM2.2.0 version to the older version CESM2.1.4, and now it's working. Which is great. Yes, changes in restart files could be the issue there. These changes are common enough that there likely are differences in the restart files. Sometimes you can still get it to work especially if you do a hybrid RUN_TYPE rather than a branch. And in general I would hope the code would tell you about what the missing variable on the restart is, so you at least know.

But, LENS was run with CESM1 which is much older, so it does make sense that it was old enough that there were problems with the restart files, and you just needed to use an older code base for it.

Anyway, glad it's all working! Take care.
 
Vote Upvote 1 Downvote

xnnzka

xnnzka
Member
OK, if I've got this right you've gone from the CESM2.2.0 version to the older version CESM2.1.4, and now it's working. Which is great. Yes, changes in restart files could be the issue there. These changes are common enough that there likely are differences in the restart files. Sometimes you can still get it to work especially if you do a hybrid RUN_TYPE rather than a branch. And in general I would hope the code would tell you about what the missing variable on the restart is, so you at least know.

But, LENS was run with CESM1 which is much older, so it does make sense that it was old enough that there were problems with the restart files, and you just needed to use an older code base for it.

Anyway, glad it's all working! Take care.
Yes, u are right! Now I utilize older version and hybrid run type to do my experiment. Very thanks for your suggestions!
 
Vote Upvote 0 Downvote
Top