Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

total_tasks relation to NTASKS and NTHRDS

Shruti

Shruti Joshi
Member
Hello,

I was testing for cesm 2.1.2 with different task and thread count for the compsets A, X and B1850.
Set the NTASKS and NTHRDS to equal values as multiple of 4(4,8,16) using ./xmlchnage.
I am assuming the mpirun cmd takes NTASKS into account while executing.
Ex: NTASKS=4,-> mpirun -np 4.

The above thing is what i could conclude by executing A and X compset.

But that doesn't seem to be happening with B1850 compset.
So tried setting <MAX_TASKS_PER_NODE> =8 ,<MAX_MPITASKS_PER_NODE> =8, but the resulting mpirun command takes 72 value. i.e. mpirun -np 72.

Could you please explain how is this value of {total_tasks} calculated for mpirun command.
 

jedwards

CSEG and Liaisons
Staff member
In an A or X case all components typically start from task 0 in MPI_COMM_WORLD,
but in a B case the cice and pop models are offset so that they can run concurrently with atm and lnd.
If you run ./pelayout in your case you will see something like:
Comp NTASKS NTHRDS ROOTPE
CPL : 1800/ 1; 0
ATM : 1800/ 1; 0
LND : 540/ 1; 0
ICE : 540/ 1; 540
OCN : 648/ 1; 1800
ROF : 540/ 1; 0
GLC : 36/ 1; 0
WAV : 36/ 1; 0
IAC : 1/ 1; 0
ESP : 1/ 1; 0

In this case the TOTAL_TASKS=2448 because the ocn ROOTPE=1800 so that it runs concurent with ATM.
 

ucas_qs

qiushi Zhang
Member
In an A or X case all components typically start from task 0 in MPI_COMM_WORLD,
but in a B case the cice and pop models are offset so that they can run concurrently with atm and lnd.
If you run ./pelayout in your case you will see something like:
Comp NTASKS NTHRDS ROOTPE
CPL : 1800/ 1; 0
ATM : 1800/ 1; 0
LND : 540/ 1; 0
ICE : 540/ 1; 540
OCN : 648/ 1; 1800
ROF : 540/ 1; 0
GLC : 36/ 1; 0
WAV : 36/ 1; 0
IAC : 1/ 1; 0
ESP : 1/ 1; 0

In this case the TOTAL_TASKS=2448 because the ocn ROOTPE=1800 so that it runs concurent with ATM.
Hello, can you elaborate on how the total number of tasks is calculated?
 

ucas_qs

qiushi Zhang
Member
total_tasks = max(rootpe + ntasks)
Thanks. I am running BHISTcmip6, which is my PE layout. I haven't changed the default settings, I want to run it by default for 5 days, but it hasn't finished after almost 6 hours, I think there is a problem, I am not sure if this is due to the PE layout or something else. The link below is my specific description, can you give me some advice? Thank you.
https://bb.cgd.ucar.edu/cesm/threads/probloms-in-running-compset-bhistcmip6.5601/
Comp NTASKS NTHRDS ROOTPE
CPL : 512/ 1; 0
ATM : 512/ 1; 0
LND : 256/ 1; 0
ICE : 256/ 1; 256
OCN : 64/ 1; 512
ROF : 256/ 1; 0
GLC : 512/ 1; 0
WAV : 512/ 1; 0
ESP : 1/ 1; 0
 

jedwards

CSEG and Liaisons
Staff member
Have you followed the porting documents? You should run some less complicated cases first.
 

ucas_qs

qiushi Zhang
Member
Have you followed the porting documents? You should run some less complicated cases first.
Yes, I ran the compset C before, which was successful. Its resolution is T62_ G16 and the stop time is one year, using one node (64 processors). It took about 1 hour. Here is the average SST obtained. I think this can verify that the porting of cesm2.1.3 is successful, but there are problems in BHISTcmip6. In addition, there is a question, how can I modify the output variable in pop?(reduce unnecessary output or increase some physical fields). It seems that there is no "fincl" in user_nl_pop? Thank you very much.
1602596905815.png
 

jedwards

CSEG and Liaisons
Staff member
You should be able to determine if the model is progressing slowly or is stalled by looking at the cpl log file in the run directory.
10. Model diagnostics and output — popdoc documentation describes the process of adjusting pop output.

What is the resolution of the B case? You might try an F case at the same resolution using more than 1 node. If you are able to run on a single node but not on multiple nodes it may indicate a problem in the network.
 

ucas_qs

qiushi Zhang
Member
You should be able to determine if the model is progressing slowly or is stalled by looking at the cpl log file in the run directory.
10. Model diagnostics and output — popdoc documentation describes the process of adjusting pop output.

What is the resolution of the B case? You might try an F case at the same resolution using more than 1 node. If you are able to run on a single node but not on multiple nodes it may indicate a problem in the network.
Thank you very much for your help. It suddenly occurred to me that I didn't download ESMF. I don't know if this is the cause of the problem. Do I need to download ESMF for the compset BHISTcmip or C? If necessary, how to select the version I need?
 

ucas_qs

qiushi Zhang
Member
No, you do not need ESMF for these cases.
Thanks. I just tried the lower resolution f19_ g17 for BHISTcmip6, actually successfully implemented. It's a little surprising. Six nodes (384 processors) were used, but I still couldn't figure out why the f09_g17 failed? I consulted other people. They said that the NetCDF library does not match the Intel compiler used. I don't know whether this is credible, because I think that the successful execution of compset C indicates that the server environment is OK. What is your opinion?
 

jedwards

CSEG and Liaisons
Staff member
I think you are correct. If f19_g17 is working but f09_g17 is not you might try increasing the number of nodes used for the f09 case.
Check the end of the cesm.log and the cpl.log - where is the f09_g17 case stopping?
 

ucas_qs

qiushi Zhang
Member
I think you are correct. If f19_g17 is working but f09_g17 is not you might try increasing the number of nodes used for the f09 case.
Check the end of the cesm.log and the cpl.log - where is the f09_g17 case stopping?
Thanks. The following are the log screenshots of cpl and cesm. The default run time is 5 days. Since the running time is more than 20 hours and the output file has not been increased, I canceled the job. I initially submitted BHISTcmip6_f09_g17 with nine nodes (578 processors).
1602600259112.png1602600298139.png
 

ucas_qs

qiushi Zhang
Member
You should be able to determine if the model is progressing slowly or is stalled by looking at the cpl log file in the run directory.
10. Model diagnostics and output — popdoc documentation describes the process of adjusting pop output.

What is the resolution of the B case? You might try an F case at the same resolution using more than 1 node. If you are able to run on a single node but not on multiple nodes it may indicate a problem in the network.
Hi, jedwards. The time-averaged history-file (“tavg”) module seems to have no variable "the latent heat flux" and "sea surface pressure" in POP. How can I make it output? In addition, I also need the temperature budget terms (such as horizontal advection, vertical advection and so on) in the ocean, how to set this to output? Thank you very much. Looking forward to your reply.
 

ucas_qs

qiushi Zhang
Member
total_tasks = max(rootpe + ntasks)
Hi, Jedwards.I am running BHISTcmip6_f19_g17 and the PE layout is as bellow. The total_tasks = max(rootpe + ntasks)=578,However, it appears in my slurm .log that "run command is srun -n 384 /public3/home/sc52515/my_cesm_sandbox/output/BHISTcmip6_f19_g17/bld/cesm.exe >> cesm.log.$LID 2>&1". 384 tasks not 578 were used for submission. What is the reason for this? I want to make this case run faster. How do I change the processor layout? Thanks.
Comp NTASKS NTHRDS ROOTPE
CPL : 512/ 1; 0
ATM : 512/ 1; 0
LND : 256/ 1; 0
ICE : 256/ 1; 256
OCN : 64/ 1; 512
ROF : 256/ 1; 0
GLC : 512/ 1; 0
WAV : 512/ 1; 0
ESP : 1/ 1; 0
 

jedwards

CSEG and Liaisons
Staff member
It should be 578, I don't see how 384 could have happened. Did you run case.setup --reset after changing any of these values?
 

ucas_qs

qiushi Zhang
Member
It should be 578, I don't see how 384 could have happened. Did you run case.setup --reset after changing any of these values?
Yes,I run case.setup --reset after changing these values. Another strange thing, I cp $CCSMROOT/models/ocn/pop2/input_templates/gx1v7_tavg_contents $CASE/SourceMods/src.pop/ and comment out unnecessary streams (only retain stream 1, which is h.). Then I run case.setup --reset and case.build,and submit it. But there were errors :
POP aborting...
FATAL ERROR: Empty stream

------------------------------------------------------------------------
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 553
------------------------------------------------------------------------

I looked at the gx1v7_tavg_contents in $CASE/run/ and found that the other streams in it were not commented out. I don't understand what's going on. Are my changes overridden at build time? Thanks.
 

jedwards

CSEG and Liaisons
Staff member
I'm not sure about the empty stream issue, you may need to post this question in the pop model forum.
 
Top