job submit limit, user's size and/or time limits when run the case.submit

CGL · Apr 13, 2023

Hi,everyone. I try to run CESM2.1.3. I got the sbatch error:
ERROR: Command: 'sbatch --time 0:20:00 -p cpu_parallel --dependency=afterok:7408958 case.st_archive --resubmit' failed with error 'sbatch: error: QOSMinCpuNotSatisfied
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)' from dir '/data/sxh/CESM2/CESM/cgl/scratch/test2'

It seems like the case.st_archive limited the progress. Where can i change the limited size or set it?

CGL · Apr 13, 2023

I added the cesm.log report: SIGSEGV,segmentation fault occurred.

nusbaume · Apr 17, 2023

Hi CGL,

It looks like you have two different issues listed on this thread. For your first issue you can change the case.st_archive program’s queue and wallclock time by running the following commands in your case directory:

Job queue:

./xmlchange --subgroup case.st_archive --id JOB_QUEUE --val <value>

Wallclock time:

./xmlchange --subgroup case.st_archive --id JOB_WALLCLOCK_TIME --val <value>

Where <value> is whatever job queue name or wallclock time length you want.

In terms of your second issue, it looks like the error is happening in the atmosphere model (CAM). I first would check whether there is a useful error message in the atm.log. Otherwise I would try running with debugging on, which can be done by doing the following:

./xmlchange –id DEBUG –val TRUE

And then re-building and re-running the model, which should then hopefully give you a more specific error message.

Hope that helps, and have a great day!

Jesse

CGL · Apr 19, 2023

nusbaume said:
Hi CGL,

It looks like you have two different issues listed on this thread. For your first issue you can change the case.st_archive program’s queue and wallclock time by running the following commands in your case directory:

Job queue:

Wallclock time:

Where <value> is whatever job queue name or wallclock time length you want.

In terms of your second issue, it looks like the error is happening in the atmosphere model (CAM). I first would check whether there is a useful error message in the atm.log. Otherwise I would try running with debugging on, which can be done by doing the following:

And then re-building and re-running the model, which should then hopefully give you a more specific error message.

Hope that helps, and have a great day!

Jesse

it seems like the xmlchange do not have the DEBUG function.View attachment 3527

CGL · Apr 19, 2023

nusbaume said:
Hi CGL,

It looks like you have two different issues listed on this thread. For your first issue you can change the case.st_archive program’s queue and wallclock time by running the following commands in your case directory:

Job queue:

Wallclock time:

Where <value> is whatever job queue name or wallclock time length you want.

In terms of your second issue, it looks like the error is happening in the atmosphere model (CAM). I first would check whether there is a useful error message in the atm.log. Otherwise I would try running with debugging on, which can be done by doing the following:

And then re-building and re-running the model, which should then hopefully give you a more specific error message.

Hope that helps, and have a great day!

Jesse

I think the first question related to the second. Maybe some system limit the progress or other mistake. I will check this. Thanks for your reply:)

ganbaranaito · Jun 12, 2023

CGL said:
I think the first question related to the second. Maybe some system limit the progress or other mistake. I will check this. Thanks for your reply:)

Hello, have you solved the second issue? I have the same problem.

CGL · Jun 12, 2023

ganbaranaito said:
Hello, have you solved the second issue? I have the same problem.

I changed a cluster to run the model. It worked. I think the reason is the process or enviroment get wrong configuration when you try a parallel computing. You should contact with your cluster managers.

ganbaranaito · Jun 12, 2023

CGL said:
I changed a cluster to run the model. It worked. I think the reason is the process or enviroment get wrong configuration when you try a parallel computing. You should contact with your cluster managers.

Thank you!

job submit limit, user's size and/or time limits when run the case.submit

CGL

CGL

Member

CGL

CGL

Member

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

CGL

CGL

Member

CGL

CGL

Member

ganbaranaito

takufuu

Member

CGL

CGL

Member

ganbaranaito

takufuu

Member