Questions on NTASKS, ROOTPE, and submission

xiangli · Feb 14, 2024

jedwards said:
For recent gnu compiler versions you will need to add flags
-fallow-argument-mismatch -fallow-invalid-bozto the FCFLAGS

Thanks Jim.

Do you mean I should add this statement to config_compilers.xml?

<ADD_FCFLAGS>-fallow-argument-mismatch -fallow-invalid-boz</ADD_FCFLAGS>

I'm not sure whether I did it correctly:

Could you please give more details on how exactly to add FCFLAGS?

Thanks,
Xiang

jedwards · Feb 14, 2024

No - just append the line with -ffixed-line-length-none with

-fallow-argument-mismatch -fallow-invalid-boz

xiangli · Feb 15, 2024

jedwards said:
No - just append the line with -ffixed-line-length-none with

-fallow-argument-mismatch -fallow-invalid-boz

Thanks, Jim! Now I got past the error.

At the final step of build, there was an error related to -lnetcdf.

I added Lines 155-156 in config_compilers.xml, but it still could not work.

Do you have any suggestions?

Thanks,
Xiang

jedwards · Feb 15, 2024

It looks like you have separate library locations for netcdf and netcdff but only have the netcdff path in your SLIBS

xiangli · Feb 15, 2024

jedwards said:
It looks like you have separate library locations for netcdf and netcdff but only have the netcdff path in your SLIBS

Hi Jim,

Yes, I added the netcdf path in my SLIBS and now the ./create_test SMS.f19_g17.X runs successfully.

By the way, do I need also add <env name="NETCDF_PATH">/opt/apps/rhel9/netcdf-c-4.9.2</env> here:

I'm now testing a B1850 case. There was an error at the final step of build:

Do you have any comments or suggestions on that?

Thanks,
Xiang

jedwards · Feb 15, 2024

No, you don't need it in both places.

xiangli · Feb 15, 2024

jedwards said:
No, you don't need it in both places.

Thanks, Jim!

This error appeared at the final step of building a B1850 case:

Any suggestions on that would be appreciated!

Thanks,
Xiang

jedwards · Feb 15, 2024

This is due to exceeding the limits of data in your memory model - I'm not sure how you might fix it.

xiangli · Feb 16, 2024

jedwards said:
This is due to exceeding the limits of data in your memory model - I'm not sure how you might fix it.

Hi Jim,

I realized that this error was because I reduced the NTASKS and ROOTPE incorrectly.

This is the default setting:

This was what I modified (both MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE were set to 8):

I did this because this would reduce the number of nodes this job was going to request.

Any suggestions on how to reduce the NTASKS and ROOTPE correctly?

Thanks,
Xiang

jedwards · Feb 16, 2024

I would increase the number of tasks per node and then the ntasks per component until you get a configuration that builds.
You can also play around with the gcc memory model flags instead. In any case running cesm on shared nodes is risky business - either you are going to stall or crash due to not enough resources or you are going to cause other users to stall and crash for the same reason.

xiangli · Feb 20, 2024

jedwards said:
I would increase the number of tasks per node and then the ntasks per component until you get a configuration that builds.
You can also play around with the gcc memory model flags instead. In any case running cesm on shared nodes is risky business - either you are going to stall or crash due to not enough resources or you are going to cause other users to stall and crash for the same reason.

Hi Jim,

Yes, if MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE are set too small (such as 8), there would be errors either during building or running.

During the weekend, I changed MAX_TASKS_PER_NODE and MAX_MPITASKS_PER_NODE to 32 and used the default 6-node setting to run the model. It took 19 minutes to successfully finished the 5-model-day run.

However, today I unexpectedly had an error when building the case:

Could you please take a look at it?

Thanks,
Xiang

jedwards · Feb 20, 2024

it looks like someone or something external to cesm killed your process - this sometimes happens if you run out of memory on a shared system

xiangli · Feb 21, 2024

jedwards said:
it looks like someone or something external to cesm killed your process - this sometimes happens if you run out of memory on a shared system

Hi Jim,

Exactly! Now the case could be built and submitted to run when there were no other processes at the node.

Following the porting guide, I'm running ./scripts_regression_tests. There was an error:

As you can see below, I actually copied the original machines.py and renamed it to machines_original.py. Then I modified the machines.py file by changing the machine name to "duke":

However, it seems that the script still read the machine_original.py.

How should I deal with it?

Thanks,
Xiang

jedwards · Feb 21, 2024

You really should not need to modify any of the python. Your changes should be limited to xml files.

xiangli · Feb 21, 2024

jedwards said:
You really should not need to modify any of the python. Your changes should be limited to xml files.

Hi Jim,

Thanks for the reminder!

Now I recovered the machine.py script and reran ./scripts_regression_tests. Here is the error message:

Any suggestions on that?

Thanks,
Xiang

jedwards · Feb 21, 2024

The code you are showing me with cori-haswell in machines line 282 has been modified in the
latest maint-5.6 code which you should have - that name has been changed to derecho.

xiangli · Feb 21, 2024

jedwards said:
The code you are showing me with cori-haswell in machines line 282 has been modified in the
latest maint-5.6 code which you should have - that name has been changed to derecho.

Hi Jim,

Does maint-5.6 mean CESM2.1.5.6? Can I get the latest code by redownloading the model?

Actually I have been trying redownloading it, but I could not run ./manage_externals/checkout_externals. I remember I did meet this problem when I downloaded it last time, but I forget how to get past that.

Is this related to the Subversion/1.14.3 that I'm currently using?

Looking forward to your suggestions!

Thanks,
Xiang

jedwards · Feb 21, 2024

I see - you want to:

git checkout release-cesm2.1.5
./manage_externals/checkout_externals
cd cime
git checkout maint-5.6

xiangli · Feb 22, 2024

jedwards said:
I see - you want to:

git checkout release-cesm2.1.5
./manage_externals/checkout_externals
cd cime
git checkout maint-5.6

Thanks, Jim!

With these command, I successfully updated the codes and now in line 282 of machines.py the name changed to derecho.

However, I still had some errors after running ./scripts_regression_tests. Here are some screen shots.

In config_compilers, what should I set for MPIFC? mpif90 or mpifort?

Looking forward to your suggestions!

Thanks,
Xiang

jedwards · Feb 22, 2024

You need to check to see what your mpi library calls the mpi compiler wrappers and use those.
If both mpif90 and mpifort are available I think they are identical or nearly so - either should be fine.

Questions on NTASKS, ROOTPE, and submission

Xiang Li

Member

CSEG and Liaisons

Xiang Li

Member

CSEG and Liaisons

Xiang Li

Member

CSEG and Liaisons

Xiang Li

Member

CSEG and Liaisons

Xiang Li

Member

CSEG and Liaisons

Xiang Li

Member

CSEG and Liaisons

Xiang Li

Member

CSEG and Liaisons

Xiang Li

Member

CSEG and Liaisons

Xiang Li

Member

CSEG and Liaisons

Xiang Li

Member

CSEG and Liaisons