Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

CTSM single-point simulation

ceyang

ceyang
New Member
Hi All,

Though I have asked in another post regarding single-point CTSM runs on Derecho and got some suggestions (i.e., go asking NCAR's helpdesk for help), I wonder if anyone could provide their successful experiences/insights.

According to NCAR's Helpdesk response, an entire node (128 cores) will be charged even if I only need one core to run a single-point simulation. While Cheyenne had the "share" queue supported, I wonder if anyone has come up with some solutions that site-level CTSM simulations can run on Derecho without letting 127 cores idle. For example, running multiple sites' simulations within one batch job.

Or, is using Casper the only way to run single-point CTSM in the future?

Any working/successful solution and/or instruction on modifying the case.submit script is much appreciated.


Thank you,
Ken
 

slevis

Moderator
You may find additional info in the document and discussion links provided in this post:
 

ceyang

ceyang
New Member
@slevis Thanks for your update. I tried the workaround just added in the Google document. It temporarily solves the single-point run issues on Derecho.

However, there is a catch in addition to the one-hour wallclock limit. Only one submitted job will run while the others are held in the Develop queue, at least in my case. So, I must wait a week or longer to complete all my multiple single-point simulations. It would be much appreciated if there is any way to handle this issue. Or is releasing working externals for CTSM on Casper imminent?


Thanks,
Ken
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I've checked and running on Casper isn't imminent. I would check with CISL regarding the inability to run more than one job at at time in the Develop queue to see if that is expected behavior. I don't see anything in the Derecho documentation that would indicate that is the case.
I do wonder if it is possible to use the "job array" capability to run multiple jobs on a single node, as indicated the documentation here:
Single point simulations might fit the MPMD criteria outlined there.
Again, I would contact CISL to see if they could help with this.
 

ceyang

ceyang
New Member
@oleson Thanks for your comment. I just received the response from CISL and MPMD is what they suggest on Derecho:

"Yes, the Develop queue has a restriction that only allows for one job per user.

Often this is an issue for users that are trying to run either a pre/post processing script or other Multiple Program Multiple Data (MPMD) type workflows. If that is the case for you, we have a tool on Derecho to launch many instances of the same script/executable using a command file. You can read about that here if you are interested: NCAR HPC Documentation"
 
Top