Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Large restart file output error in a high resolution configuration for CAM5-EUL dynamic core

eaton

CSEG and Liaisons
I would try to double the nodes by requesting 68 nodes and only assign 6 tasks per node.  This gives each task twice as much memory as in the successful configuration used for the 30 level grid.  I would also try setting the namelist variable atm_pio_stride=6.  This will put just 1 pio task on each node which should minimize the overhead incurred when writing the restart file.
 
I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks =     80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics. 
 
I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks =     80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics. 
 
I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks =     80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics. 
 
I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks =     80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics. 
 
I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks =     80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics. 
 
I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks =     80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics. 
 
I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks =     80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics. 
 
I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks =     80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics. 
 

eaton

CSEG and Liaisons
The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem.  Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic.  I'd at the very least be sure that I was using the latest release of the netcdf lib.  This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem.  I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension.  We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here. 
 

eaton

CSEG and Liaisons
The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem.  Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic.  I'd at the very least be sure that I was using the latest release of the netcdf lib.  This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem.  I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension.  We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here. 
 

eaton

CSEG and Liaisons
The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem.  Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic.  I'd at the very least be sure that I was using the latest release of the netcdf lib.  This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem.  I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension.  We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here. 
 

eaton

CSEG and Liaisons
The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem.  Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic.  I'd at the very least be sure that I was using the latest release of the netcdf lib.  This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem.  I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension.  We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here. 
 

eaton

CSEG and Liaisons
The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem.  Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic.  I'd at the very least be sure that I was using the latest release of the netcdf lib.  This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem.  I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension.  We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here. 
 

eaton

CSEG and Liaisons
The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem.  Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic.  I'd at the very least be sure that I was using the latest release of the netcdf lib.  This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem.  I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension.  We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here. 
 

eaton

CSEG and Liaisons
The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem.  Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic.  I'd at the very least be sure that I was using the latest release of the netcdf lib.  This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem.  I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension.  We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here. 
 

eaton

CSEG and Liaisons
The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem.  Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic.  I'd at the very least be sure that I was using the latest release of the netcdf lib.  This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem.  I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension.  We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here. 
 
Top