Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

case submit failed with --ntasks error

wvsi3w

wvsi3w
Member
Hello,

I did a quick run on CLM5 and I managed to do the case.setup, case.build, case.submit but this job failed with the message "Job Wall-clock time: 00:00:17", and I looked at the log file for errors and this is the error message: "srun: error: Invalid numeric value "tasks-per-node" for --ntasks".

I think this means the value for the --ntasks option is expected to be a numeric value, but instead it appears to be set to the string "tasks-per-node". I should make sure that this option is followed by a numeric value, not a string, and only in "env_batch" file there is "--ntasks-per-node". Should I change this one only? or is there anywhere else that I must look for "--ntasks" to edit?

this is what I have inside "env_batch" file:
<directives>
<directive> --job-name={{ job_id }}</directive>
<directive> --nodes={{ num_nodes }}</directive>
<directive> --ntasks-per-node={{ tasks_per_node }}</directive>
<directive> --output={{ job_id }} </directive>
<directive> --exclusive </directive>
</directives>

Should I change it to the following lines?
<directives>
<directive> --job-name={{ job_id }}</directive>
<directive> --nodes={{ num_nodes }}</directive>
<directive> --ntasks={{ num_procs }}</directive>
<directive> --output={{ job_id }} </directive>
<directive> --exclusive </directive>
</directives>

Or should I simply type the number of tasks which is 64?


Also, inside config_machine file there is another ntask parameter that I think (not sure) I should change:
<MAX_TASKS_PER_NODE>64</MAX_TASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>64</MAX_MPITASKS_PER_NODE>
<mpirun mpilib="openmpi">
<executable>srun</executable>
<arguments>
<arg name="num_tasks">-n {{ total_tasks }}</arg>
<arg name="tasks_per_node"> -ntasks-per-node $MAX_MPITASKS_PER_NODE </arg>
</arguments>

Should I change it also?

Thanks for your help.
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I don't think you should be getting these errors if the model has been ported correctly. I'm moving this to the infrastructure/porting forum. It would be helpful if you could provide the information requested here:

 

wvsi3w

wvsi3w
Member
+ the previous error message was: "srun: error: Invalid numeric value "tasks-per-node" for --ntasks." this means the value for the --ntasks option is expected to be a numeric value, but instead it appears to be set to the string "tasks-per-node". I should make sure that this option is followed by a numeric value, not a string, and only in "env_batch" file and in "env_mach_specific.xml" file in scratch directory there is "--ntasks-per-node" parameter.

+ I noticed that it failed because of a missing equal sign, it should be --ntasks-per-node=64 and not --ntasks-per-node 64 (which is interpreted as "-n tasks-per-node", and "-n" is the same as "--ntasks", so that's given the confusing error message).


I adjusted (added the =) in these two files: /home/meisam/scratch/cases/feb7/env_mach_specific.xml and /home/meisam/my_cesm_sandbox/cime/config/cesm/machines/config_machines.xml


for the line that says:

<arg name="tasks_per_node"> -ntasks-per-node=$MAX_MPITASKS_PER_NODE </arg> (same line in both files, as the first was generated from the second).

+ The edit did not work because I found that the error was in using only one minus in -ntasks-per-node, it needs to be --ntasks-per-node because "-ntasks-pernode" is interpreted as "-n tasks-per-node" which is the same as "--ntasks tasks-per-node". I've corrected that in those same two .xml files
 
Top