bdobbins@gmail_com
New Member
Hi guys,
In case this is relevant to anyone else, the issue I was facing with extremely slow parallel I/O speeds on CESM appears to have an extremely simple solution - the addition of the -D_USE_FLOW_CONTROL flag to CPPDEFS in Macros..
This flag is a default for some of the larger systems (Franklin, Hopper, Kraken, etc.), but is not present in either the 'generic' files or even some large, site-specific ones like Pleiades. This is true in both the CESM 1.0.3 and 1.0.4 releases. Unless there is some potential harm from doing so (Jim?), I'd recommend this flag be added to the generic systems CPPDEFS for future releases.
Replacing that flag in CESM 1.0.3 reduced the I/O time at the end of each model month from ~800-1600s to a much better 75 - 110s, basically doubling the effective rate on a CAM5 physics runs and improving it by a factor >5x on a CAM4 physics run. In CESM 1.0.4, with improved PIO capabilities, it's even better, with the average I/O time on these steps down to 45-60s. This is, as of yet, without much tuning on the Lustre system - it's using the default directory setting of stripe-count=4 and stripe-size=1M.
I also noticed that the PIO directory from CESM 1.0.4, along with the 'calcdecomp.F90' file from the PIO 1.4.0 source code, can replace the CESM 1.0.3 PIO directory tree (${CESMROOT}/models/utils/pio) in full and give CESM 1.0.3 users the speed benefit of the updated code. This also gets around an issue with older Intel compilers (ex: release 2011.5.220) which seem to have problems with the nested modules in PIO. Later compilers, such as 2011.9.293, don't have this issue. We're going to run a quick validation of CESM 1.0.3 with the 1.0.4-based PIO code, plus the updated calcdecomp.F90 file, but as this is purely I/O-level changes, I don't expect any problems.
All told, depending on the physics options in use, we run between 2.5x - 6x faster now on 1-degree models, before any tuning of the PE layout.
Hope that helps someone else - it's always these small things that get you! - and thanks again to Jim for his help.
Cheers,
- Brian
(PS. I'd be interested if anyone has any insight into how the lack of 'flow control' hit us so hard - is this peculiar to, say, QLogic IB cards, which typically offload some of the processing of messages to the CPU? Or is it universally important at >=512 cores and few I/O writers?)
In case this is relevant to anyone else, the issue I was facing with extremely slow parallel I/O speeds on CESM appears to have an extremely simple solution - the addition of the -D_USE_FLOW_CONTROL flag to CPPDEFS in Macros..
This flag is a default for some of the larger systems (Franklin, Hopper, Kraken, etc.), but is not present in either the 'generic' files or even some large, site-specific ones like Pleiades. This is true in both the CESM 1.0.3 and 1.0.4 releases. Unless there is some potential harm from doing so (Jim?), I'd recommend this flag be added to the generic systems CPPDEFS for future releases.
Replacing that flag in CESM 1.0.3 reduced the I/O time at the end of each model month from ~800-1600s to a much better 75 - 110s, basically doubling the effective rate on a CAM5 physics runs and improving it by a factor >5x on a CAM4 physics run. In CESM 1.0.4, with improved PIO capabilities, it's even better, with the average I/O time on these steps down to 45-60s. This is, as of yet, without much tuning on the Lustre system - it's using the default directory setting of stripe-count=4 and stripe-size=1M.
I also noticed that the PIO directory from CESM 1.0.4, along with the 'calcdecomp.F90' file from the PIO 1.4.0 source code, can replace the CESM 1.0.3 PIO directory tree (${CESMROOT}/models/utils/pio) in full and give CESM 1.0.3 users the speed benefit of the updated code. This also gets around an issue with older Intel compilers (ex: release 2011.5.220) which seem to have problems with the nested modules in PIO. Later compilers, such as 2011.9.293, don't have this issue. We're going to run a quick validation of CESM 1.0.3 with the 1.0.4-based PIO code, plus the updated calcdecomp.F90 file, but as this is purely I/O-level changes, I don't expect any problems.
All told, depending on the physics options in use, we run between 2.5x - 6x faster now on 1-degree models, before any tuning of the PE layout.
Hope that helps someone else - it's always these small things that get you! - and thanks again to Jim for his help.
Cheers,
- Brian
(PS. I'd be interested if anyone has any insight into how the lack of 'flow control' hit us so hard - is this peculiar to, say, QLogic IB cards, which typically offload some of the processing of messages to the CPU? Or is it universally important at >=512 cores and few I/O writers?)