Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Resubmit failure

Status
Not open for further replies.

polly

Polly Thornton
New Member
I am running a multi-instance, DATM, single point case on Cheyenne. I set RESUBMIT=3, but the case stopped with no error messages after completing the initial run and first resubmit successfully (i.e. RESUBMIT=2 now). The archive directories were written successfully. When I try running the case to see if it will finish, I get this error:

ERROR: CONTINUE_RUN is true but this case does not appear to have restart files staged in /glade/scratch/pbuotte/Ituri_9pft_ensemble/run rpointer.cpl

However, the rpointer.cpl_0001, etc (I have 18 ensemble members) files are in the run directory and point to the correct restart files, which also exists in the run directory. Thanks for any insights.
 

slevis

Moderator
Staff member
Hi @polly,

I have never tried "multi-instance" but first thought I have is that the error message above references
/glade/scratch/pbuotte/Ituri_9pft_ensemble/run/rpointer.cpl
rather than
/glade/scratch/pbuotte/Ituri_9pft_ensemble/run rpointer.cpl_0001

No other ideas for now. It's interesting that the first resubmit worked...
 

polly

Polly Thornton
New Member
Hi @slevis! Yes, it's curious that it doesn't say it's looking for the rpointer.cpl for each ensemble member. I don't know if that's just how the error message is configured or if that's what's (not)happening in the code. And why it should succeed on the first resubmit and not the next.
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
@polly what model version are you using?

I suggest searching the code to see where that error you gave comes from. I suspect it is in cime, but I wasn't clear where it shows up. I'm wondering if this is a bug for multi-instance for the driver?
 

polly

Polly Thornton
New Member
Hi @erik. I'm running ctsm5.1.dev088. The error message comes from case_submit.py. But I didn't get any error message the first time. It just stopped running after the first resubmit. I only get the error message when I tried to submit the case manually.

I set finidat to the restart files in the user_nl_clm_* files, changed CONTINUE_RUN=FALSE, set the correct RUN_STARTDATE and it is running now. So there's nothing wrong with the restart files.
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
You shouldn't have to change the value of finidat, since that is just used for the first startup segment and NOT the continue run cases. By setting CONTINUE_RUN to FALSE that means you are starting the first startup segment over. But, maybe you just mean to say that you can show that the restart files are fine based on doing that?

In looking at cime for that version it looks like there might be issues. Are you running with MCT or NUOPC? I think you need to use NUOPC and make sure

MULTI_DRIVER==TRUE.
 

polly

Polly Thornton
New Member
Yes, I set finidat to make sure nothing had gone wrong with writing the restart files.

I am running with NUOPC and MULTI_DRIVER=TRUE
 

erik

Erik Kluzek
CSEG and Liaisons
Staff member
When I look at the if structure for this in

scripts/lib/CIME/case/case_submit.py
it looks wrong to me for your case...

Python:
       # only checks for the first instance in a multidriver case
        if case.get_value("COMP_INTERFACE") == "nuopc":
            rpointer = "rpointer.cpl"
        elif case.get_value("MULTI_DRIVER"):
            rpointer = "rpointer.drv_0001"
        else:
            rpointer = "rpointer.drv"
        expect(
            os.path.exists(os.path.join(rundir, rpointer)),
            "CONTINUE_RUN is true but this case does not appear to have restart files staged in {} {}".format(
                rundir, rpointer
            ),
        )

However looking at the latest cime code this looks correct. So I'm thinking there was a bug in cime that was fixed for multi-driver and NUOPC. You can probably search for it and figure out what it is and what tag fixed it. You might thus be able to use a newer version of cime or CTSM and get this to work. Or backport the fix to your version.
 
Status
Not open for further replies.
Top