Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Questions on NTASKS, ROOTPE, and submission

xiangli

Xiang Li
Member
For recent gnu compiler versions you will need to add flags
-fallow-argument-mismatch -fallow-invalid-bozto the FCFLAGS
Thanks Jim.

Do you mean I should add this statement to config_compilers.xml?

<ADD_FCFLAGS>-fallow-argument-mismatch -fallow-invalid-boz</ADD_FCFLAGS>

I'm not sure whether I did it correctly:

1707946279259.png

1707946382331.png

Could you please give more details on how exactly to add FCFLAGS?

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
No - just append the line with -ffixed-line-length-none with

-fallow-argument-mismatch -fallow-invalid-boz
 

xiangli

Xiang Li
Member
No - just append the line with -ffixed-line-length-none with

-fallow-argument-mismatch -fallow-invalid-boz
Thanks, Jim! Now I got past the error.

At the final step of build, there was an error related to -lnetcdf.

1708014316129.png

1708014358717.png

I added Lines 155-156 in config_compilers.xml, but it still could not work.

1708014413574.png

Do you have any suggestions?

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
It looks like you have separate library locations for netcdf and netcdff but only have the netcdff path in your SLIBS
 

xiangli

Xiang Li
Member
It looks like you have separate library locations for netcdf and netcdff but only have the netcdff path in your SLIBS
Hi Jim,

Yes, I added the netcdf path in my SLIBS and now the ./create_test SMS.f19_g17.X runs successfully.

1708016905206.png

By the way, do I need also add <env name="NETCDF_PATH">/opt/apps/rhel9/netcdf-c-4.9.2</env> here:

1708016994889.png

I'm now testing a B1850 case. There was an error at the final step of build:

1708017109911.png

1708017198159.png

Do you have any comments or suggestions on that?

Thanks,
Xiang
 

xiangli

Xiang Li
Member
No, you don't need it in both places.

Thanks, Jim!

This error appeared at the final step of building a B1850 case:

1708017109911.png


1708017198159.png


Any suggestions on that would be appreciated!

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
This is due to exceeding the limits of data in your memory model - I'm not sure how you might fix it.
 

xiangli

Xiang Li
Member
This is due to exceeding the limits of data in your memory model - I'm not sure how you might fix it.
Hi Jim,

I realized that this error was because I reduced the NTASKS and ROOTPE incorrectly.

This is the default setting:

1708105476796.png

1708105509126.png

1708105540765.png

This was what I modified (both MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE were set to 8):

1708105595684.png

1708105613685.png

1708105641098.png

I did this because this would reduce the number of nodes this job was going to request.

Any suggestions on how to reduce the NTASKS and ROOTPE correctly?

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
I would increase the number of tasks per node and then the ntasks per component until you get a configuration that builds.
You can also play around with the gcc memory model flags instead. In any case running cesm on shared nodes is risky business - either you are going to stall or crash due to not enough resources or you are going to cause other users to stall and crash for the same reason.
 

xiangli

Xiang Li
Member
I would increase the number of tasks per node and then the ntasks per component until you get a configuration that builds.
You can also play around with the gcc memory model flags instead. In any case running cesm on shared nodes is risky business - either you are going to stall or crash due to not enough resources or you are going to cause other users to stall and crash for the same reason.
Hi Jim,

Yes, if MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE are set too small (such as 8), there would be errors either during building or running.

During the weekend, I changed MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE to 32 and used the default 6-node setting to run the model. It took 19 minutes to successfully finished the 5-model-day run.

However, today I unexpectedly had an error when building the case:

1708464368798.png

1708464404992.png

Could you please take a look at it?

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
it looks like someone or something external to cesm killed your process - this sometimes happens if you run out of memory on a shared system
 

xiangli

Xiang Li
Member
it looks like someone or something external to cesm killed your process - this sometimes happens if you run out of memory on a shared system
Hi Jim,

Exactly! Now the case could be built and submitted to run when there were no other processes at the node.

Following the porting guide, I'm running ./scripts_regression_tests. There was an error:

1708530163339.png

As you can see below, I actually copied the original machines.py and renamed it to machines_original.py. Then I modified the machines.py file by changing the machine name to "duke":

1708530350932.png

1708530322760.png

However, it seems that the script still read the machine_original.py.

How should I deal with it?

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
You really should not need to modify any of the python. Your changes should be limited to xml files.
 

xiangli

Xiang Li
Member
You really should not need to modify any of the python. Your changes should be limited to xml files.
Hi Jim,

Thanks for the reminder!

Now I recovered the machine.py script and reran ./scripts_regression_tests. Here is the error message:

1708543379320.png

Any suggestions on that?

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
The code you are showing me with cori-haswell in machines line 282 has been modified in the
latest maint-5.6 code which you should have - that name has been changed to derecho.
 

xiangli

Xiang Li
Member
The code you are showing me with cori-haswell in machines line 282 has been modified in the
latest maint-5.6 code which you should have - that name has been changed to derecho.
Hi Jim,

Does maint-5.6 mean CESM2.1.5.6? Can I get the latest code by redownloading the model?

Actually I have been trying redownloading it, but I could not run ./manage_externals/checkout_externals. I remember I did meet this problem when I downloaded it last time, but I forget how to get past that.

1708545736753.png

Is this related to the Subversion/1.14.3 that I'm currently using?

Looking forward to your suggestions!

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
I see - you want to:

git checkout release-cesm2.1.5
./manage_externals/checkout_externals
cd cime
git checkout maint-5.6
 

xiangli

Xiang Li
Member
I see - you want to:

git checkout release-cesm2.1.5
./manage_externals/checkout_externals
cd cime
git checkout maint-5.6
Thanks, Jim!

With these command, I successfully updated the codes and now in line 282 of machines.py the name changed to derecho.

However, I still had some errors after running ./scripts_regression_tests. Here are some screen shots.

1708630404568.png

1708630489584.png

In config_compilers, what should I set for MPIFC? mpif90 or mpifort?

1708630541478.png

Looking forward to your suggestions!

Thanks,
Xiang
 

jedwards

CSEG and Liaisons
Staff member
You need to check to see what your mpi library calls the mpi compiler wrappers and use those.
If both mpif90 and mpifort are available I think they are identical or nearly so - either should be fine.
 
Top