case submit failed with --ntasks error

wvsi3w

wvsi3w
Member
Hello,

I did a quick run on CLM5 and I managed to do the case.setup, case.build, case.submit but this job failed with the message "Job Wall-clock time: 00:00:17", and I looked at the log file for errors and this is the error message: "srun: error: Invalid numeric value "tasks-per-node" for --ntasks".

I think this means the value for the --ntasks option is expected to be a numeric value, but instead it appears to be set to the string "tasks-per-node". I should make sure that this option is followed by a numeric value, not a string, and only in "env_batch" file there is "--ntasks-per-node". Should I change this one only? or is there anywhere else that I must look for "--ntasks" to edit?

this is what I have inside "env_batch" file:
<directives>
<directive> --job-name={{ job_id }}</directive>
<directive> --nodes={{ num_nodes }}</directive>
<directive> --ntasks-per-node={{ tasks_per_node }}</directive>
<directive> --output={{ job_id }} </directive>
<directive> --exclusive </directive>
</directives>

Should I change it to the following lines?
<directives>
<directive> --job-name={{ job_id }}</directive>
<directive> --nodes={{ num_nodes }}</directive>
<directive> --ntasks={{ num_procs }}</directive>
<directive> --output={{ job_id }} </directive>
<directive> --exclusive </directive>
</directives>

Or should I simply type the number of tasks which is 64?


Also, inside config_machine file there is another ntask parameter that I think (not sure) I should change:
<MAX_TASKS_PER_NODE>64</MAX_TASKS_PER_NODE>
<MAX_MPITASKS_PER_NODE>64</MAX_MPITASKS_PER_NODE>
<mpirun mpilib="openmpi">
<executable>srun</executable>
<arguments>
<arg name="num_tasks">-n {{ total_tasks }}</arg>
<arg name="tasks_per_node"> -ntasks-per-node $MAX_MPITASKS_PER_NODE </arg>
</arguments>

Should I change it also?

Thanks for your help.
 

oleson

Keith Oleson
CSEG and Liaisons
Staff member
I don't think you should be getting these errors if the model has been ported correctly. I'm moving this to the infrastructure/porting forum. It would be helpful if you could provide the information requested here:

 

wvsi3w

wvsi3w
Member
+ the previous error message was: "srun: error: Invalid numeric value "tasks-per-node" for --ntasks." this means the value for the --ntasks option is expected to be a numeric value, but instead it appears to be set to the string "tasks-per-node". I should make sure that this option is followed by a numeric value, not a string, and only in "env_batch" file and in "env_mach_specific.xml" file in scratch directory there is "--ntasks-per-node" parameter.

+ I noticed that it failed because of a missing equal sign, it should be --ntasks-per-node=64 and not --ntasks-per-node 64 (which is interpreted as "-n tasks-per-node", and "-n" is the same as "--ntasks", so that's given the confusing error message).


I adjusted (added the =) in these two files: /home/meisam/scratch/cases/feb7/env_mach_specific.xml and /home/meisam/my_cesm_sandbox/cime/config/cesm/machines/config_machines.xml


for the line that says:

<arg name="tasks_per_node"> -ntasks-per-node=$MAX_MPITASKS_PER_NODE </arg> (same line in both files, as the first was generated from the second).

+ The edit did not work because I found that the error was in using only one minus in -ntasks-per-node, it needs to be --ntasks-per-node because "-ntasks-pernode" is interpreted as "-n tasks-per-node" which is the same as "--ntasks tasks-per-node". I've corrected that in those same two .xml files
 
Back
Top