total_tasks relation to NTASKS and NTHRDS

Shruti · Aug 19, 2020

Hello,

I was testing for cesm 2.1.2 with different task and thread count for the compsets A, X and B1850.
Set the NTASKS and NTHRDS to equal values as multiple of 4(4,8,16) using ./xmlchnage.
I am assuming the mpirun cmd takes NTASKS into account while executing.
Ex: NTASKS=4,-> mpirun -np 4.

The above thing is what i could conclude by executing A and X compset.

But that doesn't seem to be happening with B1850 compset.
So tried setting <MAX_TASKS_PER_NODE> =8 ,<MAX_MPITASKS_PER_NODE> =8, but the resulting mpirun command takes 72 value. i.e. mpirun -np 72.

Could you please explain how is this value of {total_tasks} calculated for mpirun command.

jedwards · Aug 19, 2020

In an A or X case all components typically start from task 0 in MPI_COMM_WORLD,
but in a B case the cice and pop models are offset so that they can run concurrently with atm and lnd.
If you run ./pelayout in your case you will see something like:
Comp NTASKS NTHRDS ROOTPE
CPL : 1800/ 1; 0
ATM : 1800/ 1; 0
LND : 540/ 1; 0
ICE : 540/ 1; 540
OCN : 648/ 1; 1800
ROF : 540/ 1; 0
GLC : 36/ 1; 0
WAV : 36/ 1; 0
IAC : 1/ 1; 0
ESP : 1/ 1; 0

In this case the TOTAL_TASKS=2448 because the ocn ROOTPE=1800 so that it runs concurent with ATM.

ucas_qs · Oct 11, 2020

jedwards said:
In an A or X case all components typically start from task 0 in MPI_COMM_WORLD,
but in a B case the cice and pop models are offset so that they can run concurrently with atm and lnd.
If you run ./pelayout in your case you will see something like:
Comp NTASKS NTHRDS ROOTPE
CPL : 1800/ 1; 0
ATM : 1800/ 1; 0
LND : 540/ 1; 0
ICE : 540/ 1; 540
OCN : 648/ 1; 1800
ROF : 540/ 1; 0
GLC : 36/ 1; 0
WAV : 36/ 1; 0
IAC : 1/ 1; 0
ESP : 1/ 1; 0

In this case the TOTAL_TASKS=2448 because the ocn ROOTPE=1800 so that it runs concurent with ATM.

Hello, can you elaborate on how the total number of tasks is calculated?

jedwards · Oct 12, 2020

total_tasks = max(rootpe + ntasks)

ucas_qs · Oct 12, 2020

jedwards said:
total_tasks = max(rootpe + ntasks)

Thanks. I am running BHISTcmip6, which is my PE layout. I haven't changed the default settings, I want to run it by default for 5 days, but it hasn't finished after almost 6 hours, I think there is a problem, I am not sure if this is due to the PE layout or something else. The link below is my specific description, can you give me some advice? Thank you.
https://bb.cgd.ucar.edu/cesm/threads/probloms-in-running-compset-bhistcmip6.5601/
Comp NTASKS NTHRDS ROOTPE
CPL : 512/ 1; 0
ATM : 512/ 1; 0
LND : 256/ 1; 0
ICE : 256/ 1; 256
OCN : 64/ 1; 512
ROF : 256/ 1; 0
GLC : 512/ 1; 0
WAV : 512/ 1; 0
ESP : 1/ 1; 0

jedwards · Oct 13, 2020

Have you followed the porting documents? You should run some less complicated cases first.

ucas_qs · Oct 13, 2020

jedwards said:
Have you followed the porting documents? You should run some less complicated cases first.

Yes, I ran the compset C before, which was successful. Its resolution is T62_ G16 and the stop time is one year, using one node (64 processors). It took about 1 hour. Here is the average SST obtained. I think this can verify that the porting of cesm2.1.3 is successful, but there are problems in BHISTcmip6. In addition, there is a question, how can I modify the output variable in pop?(reduce unnecessary output or increase some physical fields). It seems that there is no "fincl" in user_nl_pop? Thank you very much.

jedwards · Oct 13, 2020

You should be able to determine if the model is progressing slowly or is stalled by looking at the cpl log file in the run directory.
10. Model diagnostics and output — popdoc documentation describes the process of adjusting pop output.

What is the resolution of the B case? You might try an F case at the same resolution using more than 1 node. If you are able to run on a single node but not on multiple nodes it may indicate a problem in the network.

ucas_qs · Oct 13, 2020

jedwards said:
You should be able to determine if the model is progressing slowly or is stalled by looking at the cpl log file in the run directory.
10. Model diagnostics and output — popdoc documentation describes the process of adjusting pop output.

What is the resolution of the B case? You might try an F case at the same resolution using more than 1 node. If you are able to run on a single node but not on multiple nodes it may indicate a problem in the network.

Thank you very much for your help. It suddenly occurred to me that I didn't download ESMF. I don't know if this is the cause of the problem. Do I need to download ESMF for the compset BHISTcmip or C? If necessary, how to select the version I need?

jedwards · Oct 13, 2020

No, you do not need ESMF for these cases.

ucas_qs · Oct 13, 2020

jedwards said:
No, you do not need ESMF for these cases.

Thanks. I just tried the lower resolution f19_ g17 for BHISTcmip6, actually successfully implemented. It's a little surprising. Six nodes (384 processors) were used, but I still couldn't figure out why the f09_g17 failed? I consulted other people. They said that the NetCDF library does not match the Intel compiler used. I don't know whether this is credible, because I think that the successful execution of compset C indicates that the server environment is OK. What is your opinion?

jedwards · Oct 13, 2020

I think you are correct. If f19_g17 is working but f09_g17 is not you might try increasing the number of nodes used for the f09 case.
Check the end of the cesm.log and the cpl.log - where is the f09_g17 case stopping?

ucas_qs · Oct 13, 2020

jedwards said:
I think you are correct. If f19_g17 is working but f09_g17 is not you might try increasing the number of nodes used for the f09 case.
Check the end of the cesm.log and the cpl.log - where is the f09_g17 case stopping?

Thanks. The following are the log screenshots of cpl and cesm. The default run time is 5 days. Since the running time is more than 20 hours and the output file has not been increased, I canceled the job. I initially submitted BHISTcmip6_f09_g17 with nine nodes (578 processors).

ucas_qs · Oct 14, 2020

jedwards said:
You should be able to determine if the model is progressing slowly or is stalled by looking at the cpl log file in the run directory.
10. Model diagnostics and output — popdoc documentation describes the process of adjusting pop output.

What is the resolution of the B case? You might try an F case at the same resolution using more than 1 node. If you are able to run on a single node but not on multiple nodes it may indicate a problem in the network.

Hi, jedwards. The time-averaged history-file (“tavg”) module seems to have no variable "the latent heat flux" and "sea surface pressure" in POP. How can I make it output? In addition, I also need the temperature budget terms (such as horizontal advection, vertical advection and so on) in the ocean, how to set this to output? Thank you very much. Looking forward to your reply.

ucas_qs · Oct 16, 2020

jedwards said:
total_tasks = max(rootpe + ntasks)

Hi, Jedwards.I am running BHISTcmip6_f19_g17 and the PE layout is as bellow. The total_tasks = max(rootpe + ntasks)=578，However, it appears in my slurm .log that "run command is srun -n 384 /public3/home/sc52515/my_cesm_sandbox/output/BHISTcmip6_f19_g17/bld/cesm.exe >> cesm.log.$LID 2>&1". 384 tasks not 578 were used for submission. What is the reason for this? I want to make this case run faster. How do I change the processor layout? Thanks.
Comp NTASKS NTHRDS ROOTPE
CPL : 512/ 1; 0
ATM : 512/ 1; 0
LND : 256/ 1; 0
ICE : 256/ 1; 256
OCN : 64/ 1; 512
ROF : 256/ 1; 0
GLC : 512/ 1; 0
WAV : 512/ 1; 0
ESP : 1/ 1; 0

jedwards · Oct 16, 2020

It should be 578, I don't see how 384 could have happened. Did you run case.setup --reset after changing any of these values?

ucas_qs · Oct 16, 2020

jedwards said:
It should be 578, I don't see how 384 could have happened. Did you run case.setup --reset after changing any of these values?

Yes，I run case.setup --reset after changing these values. Another strange thing, I cp $CCSMROOT/models/ocn/pop2/input_templates/gx1v7_tavg_contents $CASE/SourceMods/src.pop/ and comment out unnecessary streams (only retain stream 1, which is h.). Then I run case.setup --reset and case.build，and submit it. But there were errors :
POP aborting...
FATAL ERROR: Empty stream

------------------------------------------------------------------------
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 553
------------------------------------------------------------------------
I looked at the gx1v7_tavg_contents in $CASE/run/ and found that the other streams in it were not commented out. I don't understand what's going on. Are my changes overridden at build time? Thanks.

jedwards · Oct 16, 2020

I'm not sure about the empty stream issue, you may need to post this question in the pop model forum.

ucas_qs · Oct 16, 2020

jedwards said:
I'm not sure about the empty stream issue, you may need to post this question in the pop model forum.

OK，thank you very much.

total_tasks relation to NTASKS and NTHRDS

Shruti Joshi

Member

CSEG and Liaisons

qiushi Zhang

Member

CSEG and Liaisons

qiushi Zhang

Member

CSEG and Liaisons

qiushi Zhang

Member

CSEG and Liaisons

qiushi Zhang

Member

CSEG and Liaisons

qiushi Zhang

Member

CSEG and Liaisons

qiushi Zhang

Member

qiushi Zhang

Member

qiushi Zhang

Member

CSEG and Liaisons

qiushi Zhang

Member

CSEG and Liaisons

qiushi Zhang

Member